Same Symptom, Different Emergency

There is a particular failure mode in healthcare AI that keeps me up at night. A model that gives the same reassuring, competent-sounding response to two cases that look almost identical but require completely different clinical actions. A model that can't tell the difference.

A 2-week-old with a temperature of 100.4°F needs to be in an emergency room. A 3-year-old with the same temperature needs Tylenol and a movie. Same number on the thermometer. Wildly different clinical reality. If your LLM triage agent responds to both with "monitor at home and call your pediatrician in the morning," one of those families is going to be fine and the other is in serious danger. The terrifying part is that both responses would sound perfectly reasonable to a non-clinician. Both would pass a vibe check. Both would score well on most eval frameworks.

This is the problem that contrastive pair testing is designed to solve. The idea itself is not new. Contrast sets have a long history in NLP evaluation, going back to Gardner et al. (2020) and the broader tradition of minimal pair testing in linguistics. What I think is underexplored is applying this technique with domain-specific scoring dimensions designed for high-stakes clinical reasoning, and generating pairs systematically from the clinical knowledge your agent already needs to get right.

The poverty of isolated evals

The standard approach to evaluating an LLM is to run it through a set of test cases and grade each response independently. Did the model identify the emergency? Good, it passes. Did it recommend the right triage level? Great, check the box. This is necessary but nowhere near sufficient, for the same reason that testing a calculator by confirming it can compute 2 + 2 = 4 tells you nothing about whether it can distinguish 2 + 2 from 2 × 2.

Isolated evals answer the question: does the model get this case right? Contrastive evals answer a different and harder question: does the model understand why this case is different from that one?

The distinction matters because LLMs are spectacular at recognizing the gestalt of an emergency (the breathlessness, the urgency in the language, the clustering of scary-sounding symptoms). What they are not inherently good at is respecting the kind of sharp, quantitative decision boundaries that medicine is built on. A fever of 103.4°F in a toddler and a fever of 104.0°F in the same toddler may warrant different triage levels, not because 0.6 degrees is a lot, but because a clinical protocol draws a line there. The model doesn't know about the line unless it has internalized it, and the only way to test whether it has internalized it is to present both sides of the line and see what happens.

Three specific failure modes illustrate what falls through the cracks.

The reassurance machine. Some models develop a strong prior toward reassurance. They've seen thousands of conversations where the right answer was "this sounds normal, here's what to watch for." They become very good at generating warm, empathetic, reassuring text. They reassure parents of neonates with fevers. They reassure parents whose children are turning blue. The reassurance sounds great. It would score well on empathy, completeness, even clinical accuracy in a narrow sense (the facts they cite are usually correct). But it's applied indiscriminately. The model gives the same reassuring response to a 2-week-old with a fever and a 3-year-old with a fever, and only one of those families should be reassured.

The keyword detector. Other models learn to escalate based on the presence of specific trigger words ("not breathing," "seizure," "unconscious") without integrating those signals with the rest of the clinical picture. These models look excellent on isolated emergency scenarios because the emergencies in your test set contain trigger words (because of course they do). But present a genuine emergency described in a parent's casual language ("she's been really floppy and won't wake up for feeds") and the model hears a tired baby, not an altered mental status. Or present the trigger word in a benign context ("he held his breath during a tantrum and passed out for a second, he's totally fine now") and the model hears a 911 call.

The confident guesser. Perhaps the most insidious: a model that gets the right answer consistently but for the wrong reasons. The triage is correct, but the reasoning is either absent, generic, or based on the wrong variable. The model escalates the neonate not because it understands age-stratified fever protocols but because the word "2-week-old" appeared near "fever" often enough in its training data to create a statistical association. This model will be right until it encounters a neonate case phrased in a way it hasn't seen before, and when it's wrong, nobody will understand why because it was right about everything else.

The reassurance machine passes every test where reassurance is correct, which is most of them. The keyword detector passes every test that contains keywords, which is most of them. The confident guesser passes every test, period. You need a different kind of eval, one that tests whether the model can distinguish a case from a nearby case that requires a different answer.

How contrastive pairs work

The basic structure is simple enough to sketch on a napkin, which is usually a good sign.

Contrastive pair evaluation flow diagram — Anatomy of a contrastive eval: two near-identical cases, one distinguishing feature, four scoring dimensions.

You take two clinical scenarios that are identical in every respect except one distinguishing feature. You run both through your agent independently. Then you evaluate the pair of outputs across four dimensions. Each dimension exists because of a specific failure mode it catches: the three I described above, plus a fourth that's subtler.

Discrimination. Did the model assign different triage levels to the two cases? This is the bare minimum, and it's the dimension that catches the reassurance machine. If the model gave both cases the same warm, empathetic, clinically-accurate-sounding response, it has failed the most basic test of clinical reasoning. It's the equivalent of a radiologist who reads every scan as "looks fine." Technically right most of the time, which is exactly what makes it dangerous.

Direction. Did the model rank the more severe case as more severe? Discrimination alone isn't enough. A model could assign different tiers to the two cases and still get the ordering backwards, sending the neonate home and rushing the 3-year-old to the ER. It has distinguished the cases while being catastrophically wrong about which is which. This catches the keyword detector when it fires in the wrong direction: escalating the benign mention of a scary word while missing the real emergency described in plain language.

Feature attribution. Did the model actually reference the distinguishing feature in its reasoning? This is the dimension that catches the confident guesser, and it's the one I find myself thinking about the most. A model might get discrimination and direction right through sheer statistical luck, through correlations in training data that happen to align with clinical reality in this particular case. Feature attribution checks whether the model's reasoning is grounded in the right variable. If the distinguishing feature is the child's age and the model never mentions age in its response, you have a right answer for the wrong reasons, and wrong reasons eventually produce wrong answers.

Magnitude. Is the gap between the two triage levels proportional to the clinical significance of the difference? This catches a fourth failure mode that's easy to miss: the binary escalator. A model that has learned a simple "escalate / don't escalate" heuristic without any sense of proportionality. I'll come back to this one. It deserves its own section.

Where pairs come from

There are two natural sources for contrastive pairs, and they test different things.

The first source is clinical protocols themselves. Medicine is full of branching logic: if the child is under 28 days old, treat any fever as an emergency; if older, assess the temperature. If breathing is labored, escalate; if breathing is fine, continue assessment. These branch points in clinical reasoning are natural contrastive pairs waiting to happen. You take a clinical scenario, hold everything constant, and vary the single factor that determines which path a clinician would follow.

This is not the same thing as building a brittle rule-based agent that walks a decision tree. The agent can reason however it wants: chain-of-thought, retrieval-augmented, fine-tuned intuition, whatever. The clinical protocols are the eval, not the implementation. You're testing whether the agent, by whatever means it arrives at its answer, reaches the same conclusion that established clinical reasoning would. The expected triage levels come from the protocol, which means you're testing against the same standard a human clinician would be held to.

A pediatric fever protocol that branches on age, then temperature, then the presence of red flags generates pairs naturally:

Same fever, different age. Does the model apply age-stratified protocols? (The 2-week-old vs 3-year-old example.)
Same age, different temperature. Does it respect fever thresholds? (103.4°F vs 104.0°F in a toddler.)
Same age and temperature, with and without lethargy. Does it correctly weigh danger signs?

An allergic reaction protocol that branches on airway involvement, systemic spread, and epinephrine availability generates a different set:

Localized hives vs. hives with throat tightness. Does the model recognize an airway emergency?
Known allergy with epinephrine available vs. without. Does it account for rescue medication access?
Local reaction vs. diffuse hives with vomiting. Does it distinguish local from systemic response?

The second source is mutations. You take an existing test scenario and introduce a controlled perturbation: inject a red flag symptom, change the child's age to newborn, add a complicating factor. The original scenario is one side of the pair; the mutated version is the other. This is classical mutation testing borrowed from software engineering, where you introduce bugs into code and check whether your test suite catches them.

Mutation pairs tend to be more adversarial than protocol pairs. They test whether the model can detect a single critical signal buried in an otherwise routine conversation. A parent casually mentioning that their child swallowed a button battery in the middle of describing what sounds like a stomach bug. A teenager's offhand comment about self-harm dropped into a conversation about acne.

Protocol pairs test whether the model has internalized the standard clinical reasoning. Mutation pairs test whether it can spot the exception that overrides everything else. Both are necessary. Neither is sufficient alone.

The feature attribution problem

Of the four scoring dimensions, feature attribution deserves a closer look, because it gets at a question that extends well beyond healthcare: is the model reasoning or pattern-matching?

Here's the allergy example in full. You present the model with two cases: a child with localized hives after eating peanut butter, and the same child with hives plus throat tightness and difficulty swallowing. The model correctly classifies the first as home care and the second as an emergency. Discrimination: pass. Direction: pass. But when you look at the model's reasoning for the emergency case, it says something like "given the severity of the allergic reaction described, this warrants immediate emergency evaluation." It never mentions the airway. It never explains why throat tightness transforms a manageable allergic reaction into a potential anaphylaxis emergency.

This model has learned that longer descriptions of allergic reactions with more symptoms correlate with higher severity. That's true on average, but it's the wrong rule. The right rule is specific: airway involvement in an allergic reaction is an emergency regardless of how many other symptoms are present. A child with only throat tightness and no hives at all is a bigger emergency than a child covered in hives who's breathing fine. The model that reasons from "more symptoms = more severe" will get this backwards.

The implementation is deliberately simple: tokenize the distinguishing feature into its component words, normalize the model's response, and check whether at least half the feature tokens appear in the output. This is a bag-of-words check, and it has obvious limitations. A model that says "regardless of the child's age" has mentioned "age" without actually attributing its decision to age. A model that says "neonatal sepsis protocols apply here" has attributed correctly without using the word "age" at all. In practice, these edge cases are rarer than you'd expect. When a model is reasoning about the right variable, it tends to name it, and when it's not, it tends to be conspicuously absent. The 50% token threshold is a heuristic, not a proof. But a heuristic that catches the confident guesser 80% of the time is vastly better than not checking at all, which is what most eval frameworks do.

If you need higher fidelity, you can replace the bag-of-words check with an LLM-as-judge call that asks "did the model's reasoning reference [feature] as a factor in its decision?" But I'd start simple and escalate only when the false positive or false negative rate on the simple check becomes a problem in your specific domain.

Magnitude and the problem of proportionality

Magnitude scoring catches a class of failures that the other three dimensions miss entirely, and it's the dimension that most eval frameworks would never think to include.

Consider two contrastive pairs:

Localized hives vs. hives with throat tightness and difficulty breathing. Expected severity delta: 4 tiers (home care to 911 emergency).
Fever of 102.8°F vs. fever of 104.5°F in a 4-year-old. Expected severity delta: 1 tier (home care to urgent).

A model that escalates both by the same amount has failed to calibrate its responses to the clinical significance of the distinguishing feature. Throat tightness in an allergic reaction is a potential airway emergency, a 4-tier jump. A 1.7-degree fever increase is concerning but not catastrophic, a 1-tier nudge. A model with no sense of proportionality will either over-escalate routine cases (alarm fatigue, wasted ER visits, parents who learn to ignore the system) or under-escalate genuine emergencies (the scenario that harms children).

Magnitude is scored as min(actual_delta / expected_delta, 1.0), capped at 1.0 to avoid rewarding over-escalation. A score of 0.5 means the model produced half the expected severity gap. A score of 0 means it got the direction wrong entirely. It's a single number, but it tells you more about the clinical calibration of your model than a hundred isolated test cases, because it measures something that no individual test case can measure: whether the model's internal sense of severity is scaled correctly relative to clinical reality.

When contrastive evals mislead

Every eval technique has blind spots, and pretending otherwise would make this post less useful. Here are the ones I've hit.

Pairs assume independence, but features interact. A contrastive pair that varies age while holding temperature constant assumes age and temperature contribute independently to severity. In reality, fever thresholds are different for different age groups. 100.4°F is an emergency in a neonate but unremarkable in a toddler, while 104.5°F is concerning at any age. If you construct a pair that varies age while holding temperature at 104.5°F, both sides might legitimately warrant the same triage level, and your pair would incorrectly flag a "discrimination failure." The fix is careful pair design: vary the feature at a point where it actually changes the correct outcome, not at an extreme where both sides converge. This requires understanding the clinical logic, which is the same requirement you'd have for writing any good test.

The distinguishing feature isn't always singular. Some clinical boundaries depend on the conjunction of multiple factors. A 102°F fever alone might be home care, but a 102°F fever in a child who was lethargic this morning and hasn't urinated in 8 hours is a different picture. Contrastive pairs that vary a single feature test single-feature reasoning, which is the common case but not the only one. For conjunctive reasoning, you need pairs where the mutation is the addition or removal of the second factor while the first is held constant. This is doable but requires more careful construction.

Magnitude scoring assumes a linear severity scale. The formula actual_delta / expected_delta treats the distance from "home care" to "urgent" as the same as the distance from "urgent" to "emergency." In clinical reality, the jump from "schedule this week" to "go to urgent care today" is a smaller deal than the jump from "urgent care" to "call 911." A log-scale or custom weighting might be more appropriate for your severity taxonomy. I use a linear scale because it's simple and the signal-to-noise ratio is already high enough, but this is a place where domain-specific calibration could improve things.

Building your own contrastive suite

If you're building a healthcare LLM and want to add contrastive pair testing, here's how I'd approach it.

Start with your clinical protocols. Whatever clinical reasoning your agent is supposed to follow, each point where the reasoning branches based on a single variable is a contrastive pair. You don't need to generate these creatively. They're already implicit in the clinical logic. The expected triage levels come directly from the protocol. If you don't have formalized protocols, start with a single clinical area (fever is a good one: common, well-understood, clear thresholds) and build out from there.

Add mutation pairs for robustness. Build a catalog of clinical perturbations: age changes (especially to neonate), red flag injections (breathing difficulty, altered consciousness, self-harm ideation), severity escalators (known allergy without epinephrine available). Apply each perturbation to your existing regression scenarios. The original is one side of the pair; the mutated version is the other. This scales well because each mutation rule can be applied to multiple base scenarios.

Score all four dimensions. Discrimination and direction are table stakes. Feature attribution and magnitude are where the real signal lives. If you're tempted to skip them because they're harder to implement, consider that the failure modes they catch (the confident guesser and the binary escalator) are precisely the ones that look fine in every other eval.

Expect to maintain the suite. Clinical protocols change, new evidence emerges, thresholds get revised. A fever protocol updated to lower the neonatal emergency threshold from 100.4°F to 100.0°F invalidates every pair built around the old boundary. If your pairs are generated from a machine-readable representation of the clinical logic, updating the source updates the pairs. If they're handwritten YAML files, you have a maintenance burden proportional to your pair count. For a triage agent covering the most common pediatric presentations, I'd expect somewhere in the range of 50 to 200 pairs to achieve useful coverage, weighted toward the areas where your agent handles the most volume.

Track suite-level metrics over time. Individual pair results are useful for debugging; suite-level rates are useful for decision-making. A discrimination rate of 96% means the model can almost always tell cases apart. A feature attribution rate of 72% means it's getting the right answer for the wrong reasons more than a quarter of the time. That gap between getting it right and knowing why it's right is where the real risk lives, and it's invisible without contrastive testing.

Beyond healthcare

I've framed this in terms of healthcare because that's where I built it and where the stakes are most visceral. The underlying pattern (test whether a model can distinguish similar inputs that require different outputs) generalizes to any domain with precise, consequential boundaries.

The clearest parallel is content moderation. "I'm going to kill it at this presentation" and "I'm going to kill him at this presentation" differ by one word and one letter. One is enthusiasm, the other is a threat. A content moderation model that can't distinguish them has the same problem as a triage model that can't distinguish a neonate fever from a toddler fever: it has learned the gestalt without learning the boundary. You'd build pairs the same way. Hold the context constant, vary the word or phrase that flips the classification, score on discrimination, direction, and feature attribution. Magnitude is less relevant here (content moderation is closer to binary), but the other three dimensions transfer directly.

The technique also applies wherever regulatory or contractual thresholds create sharp boundaries: financial compliance (transaction amount above or below a reporting threshold), legal analysis (a clause that does or doesn't trigger indemnification), insurance underwriting (a pre-existing condition that does or doesn't affect coverage). In each case, the question is the same: does the model understand the boundary, or does it happen to land on the right side of it in your test data?

Contrastive pair testing won't tell you everything about your model. But it will tell you the one thing that matters most in high-stakes domains: whether your model knows what it's looking at, or whether it's very good at guessing. In healthcare, the distance between a good guess and a right answer is measured in outcomes that no eval score can undo.