Sunday, June 21, 2026

Healthcare Benchmarks Are Solely as Good as Their Assumptions – Machine Studying Weblog | ML@CMU


In healthcare settings the place sufferers use LLMs as a medical assistant, LLM efficiency differs between analysis and deployment. (a) Bean et al. (2025) discover a 61 share level distinction between analysis and deployment. (b) We argue this hole arises not from poorly designed benchmarks, however from implicit assumptions embedded in analysis protocols that fail to carry at deployment. (c) We suggest a taxonomy that categorizes assumptions into two sorts, job and consequence, to diagnose the place the hole arises and what’s required to shut it. Closing the hole requires making assumptions specific, testing which assumptions maintain, and updating analysis protocols accordingly.

Healthcare LLM benchmarks are one of many major paradigms by which LLMs are evaluated previous to medical settings. Benchmarks present a secure goalpost that enable researchers to iterate rapidly and measure progress persistently. Nevertheless, in high-stakes domains like healthcare, that very same abstraction turns into a legal responsibility. For instance, a latest examine discovered a 61 share level drop in accuracy when going from analysis to deployment (see Determine). On this setting, sufferers use LLMs as a medical assistant to raised perceive their signs, determine the underlying situation, and take applicable actions. 

Furthermore, the outcomes confirmed that sufferers given entry to a extremely succesful mannequin as a medical assistant did no higher at self-diagnosis than these with none mannequin. That’s, entry to an LLM had no important affect on affected person understanding. The implication isn’t that the mannequin underperformed. Somewhat, it’s that the way in which we consider is separate from what issues in deployment. For instance, throughout analysis we ask “does the mannequin get the appropriate reply?” whereas throughout deployment we ask “does the affected person act accurately on what the mannequin tells them?”

We argue that this hole arises due to implicit assumptions embedded in analysis that don’t maintain in the true world. That’s, the state of affairs that the benchmarks intend to seize and the real-world state of affairs differ resulting from implicit assumptions. This distinction in flip challenges analysis validity. Specifically, we classify assumptions into two sorts: job, which issues assumptions on dialog information, and consequence, which issues assumptions over human habits and outcomes. To deal with this, we suggest a framework known as BenchmarkCards that makes these assumptions specific so practitioners can determine when benchmark outcomes switch to deployment. 

Understanding the Analysis–Deployment Hole by way of Assumptions

For example of what our framework appears to be like like, in Determine 1 we reveal our place in a healthcare setting the place LLM-as-medical-assistance efficiency differs between analysis and deployment, with a 95% to 34% hole (Bean et al., 2025). Throughout analysis, the mannequin was given doctor-written, single-turn eventualities—one query, one reply, no follow-up—and requested to provide a analysis. Throughout deployment, sufferers interacted with the mannequin in a back-and-forth method, and success was measured by whether or not they might accurately determine their analysis afterward.

On this setting, three assumptions underlie the hole: 

  1. Question Distribution Analysis makes use of doctor-written queries, whereas actual sufferers produce queries that could be incomplete or imprecise. 
  2. Interplay Kind Analysis options single-turn interactions, whereas actual deployments contain back-and-forth dialogue. 
  3. Determination Mediation – Analysis measures whether or not the LLM produces the proper analysis, whereas deployment measures whether or not the affected person acts on it accurately.

We be aware that these are broad classes of assumptions that are current throughout analysis settings, and return to those when introducing BenchmarkCards. 

Stating benchmark assumptions explicitly permits us to estimate how a lot every assumption contributes to the evaluation-deployment hole — for instance, by measuring how the identical LLM performs on multi-turn interactions versus single-turn ones. Doing so in our operating instance reveals that the 61 share level hole between analysis and deployment may be damaged down into 12 factors resulting from question distribution, 19 factors resulting from interplay sort, and 30 factors resulting from resolution mediation. 

That final quantity displays one thing no benchmark can observe: whether or not sufferers truly observe what the mannequin tells them. In contrast to the primary two assumptions, which concern how the duty is structured, resolution mediation relies upon solely on human habits. A mannequin might accurately diagnose appendicitis, but when the affected person dismisses the advice, the end result is similar as a mistaken reply. Even a superbly designed benchmark can’t seize this failure mode, which suggests mannequin evaluators, deployers, and customers want a special mind-set about assumptions altogether. 

When assumptions go unspoken, the very goal of benchmark analysis —quantifying and evaluating mannequin efficiency to information deployment choices —is defeated: practitioners don’t have any option to assess whether or not benchmark outcomes maintain of their setting, or whether or not any obtainable benchmark offers dependable steering in any respect. 

Closing the Hole by way of Benchmark Playing cards and Staged Analysis

Assumptions fall into two classes: job and consequence, which defer primarily based on whether or not they are often examined with dialog information alone. For instance, assumptions on whether or not conversations are single or multi-turn are job assumptions, whereas assumptions over proxy vs medical metrics are consequence assumptions

Extra typically, we are able to view assumptions as clustering into two sorts: job and consequence. Activity assumptions concern whether or not the benchmark faithfully represents the situations of deployment. For instance, if real-world conversations are multi-turn, does the benchmark mirror this? End result assumptions concern whether or not the benchmark’s analysis criterion matches what truly issues in the true world. For instance, a benchmark would possibly measure LLM decision-making, whereas real-world efficiency is dependent upon what the person does afterward.

Critically, we be aware that tackling consequence assumptions requires operating real-world behavioral experiments. Activity assumptions may be addressed by constructing benchmarks that extra carefully resemble real-world conversations, however consequence assumptions rely upon human habits that no benchmark can simulate. Understanding whether or not customers act on LLM suggestions, for example, requires truly observing them accomplish that. 

Closing the hole requires two items of information: what assumptions a benchmark makes, and whether or not these assumptions maintain in a selected deployment context. To deal with the primary level, we suggest BenchmarkCards, structured documentation that benchmark designers fill out alongside their benchmark datasets to reply questions on their analysis protocol with out anticipating any specific downstream use (see Desk). A practitioner going through a deployment resolution then makes use of the playing cards to evaluate which assumptions maintain of their setting and determine which benchmarks most carefully match their use case. When no current benchmark matches nicely, the cardboard makes that hole seen, and alerts to the neighborhood the place new benchmarks are wanted.

A BenchmarkCard is stuffed out as soon as by benchmark designers, explicitly documenting the assumptions constructed into their analysis. A practitioner then makes use of it to evaluate which assumptions maintain of their particular deployment context. The left columns doc what the benchmark assumed; the appropriate column exhibits the place these assumptions broke down on this deployment.

As soon as assumptions are recognized, we suggest staged analysis: an iterative course of the place assumptions are examined one after the other and analysis protocols up to date accordingly. The levels are:

  1. Examine BenchmarkCards in opposition to Deployment – Use BenchmarkCards to determine which assumptions maintain and which don’t. 
  2. Acquire Knowledge for Activity Assumptions – For instance, accumulate information on actual person interactions to seize the distinction in question distribution. This augments a pre-existing benchmark so it’s extra relevant to a real-world setting. 
  3. Take a look at Activity Assumptions – Measure efficiency degradations and, for assumptions with massive drops, enhance the mannequin or accumulate extra focused information. As soon as job assumptions are happy, transfer to consequence assumptions.
  4. Take a look at End result Assumptions – Utilizing area experience, prioritize which consequence assumptions matter most, then run behavioral research or randomized managed trials (RCTs) to check them.

A Name to Motion

Higher benchmarks are mandatory however not ample for deploying LLMs safely in healthcare. The repair requires benchmark designers to state plainly what their analysis does and doesn’t seize, practitioners to verify these assumptions in opposition to their deployment context, and the neighborhood to construct the infrastructure that makes this commonplace process quite than distinctive effort. The ask appears to be like totally different relying on the place you sit. For AI groups contemplating deployment: check assumptions earlier than you ship, not after; don’t await real-world failure to inform you the place your analysis fell quick. For researchers constructing the following healthcare benchmark: doc your assumptions, so future customers can choose for themselves whether or not your analysis applies to their setting. For clinicians: deal with excessive benchmark numbers as a place to begin for dialog, not a inexperienced mild.

Acknowledgements: This weblog submit relies on our paper Healthcare LLM Benchmarks Are Solely as Good as Their Specific Assumptions, co-authored with Santiago Cortes-Gomez, Mateo Dulce Rubio, Fei Fang, and Bryan Wilder. Many because of Lawrence Jang, Amanda Coston, Luke Guerdan, Sang Truong, and Tori Qiu for his or her feedback on this work.

Related Articles

Latest Articles