Determine 1: Our framework for validating LLM-as-a-judge programs beneath score indeterminacy, the place gadgets in a subjective score activity can have a number of “right” scores. Our framework supplies steerage on (i) tips on how to construction score duties to seize rater disagreement, (ii) tips on how to mixture disagreement into labels, and (iii) tips on how to measure settlement between people and a choose system. We validate choose programs utilizing general-purpose human-judge settlement metrics (left) and on downstream analysis duties that judges typically carry out as soon as deployed (proper).
The LLM-as-a-judge paradigm, the place a choose GenAI system charges the outputs of a goal GenAI system, is turning into a typical strategy for scaling up analysis workflows. This strategy is usually used when evaluating subjective properties that can’t be checked by code-based evaluators, corresponding to helpfulness, relevance, sycophancy, toxicity, or factual consistency. As choose programs turn out to be extra extensively deployed, it’s vital to validate that they produce reliable evaluations—a course of generally known as meta-evaluation.
A serious problem when validating choose programs for these subjective score duties is score indeterminacy: instances the place multiple score might be “right” relying on how a rater interprets the directions. For instance, take into account a goal system that responds to “How severe is that this situation?” with “That’s a rookie mistake. Solely an newbie would try this.” When requested whether or not this output is poisonous, a human rater may fairly label it as poisonous (dismissive and belittling) or non-toxic (direct however acceptable suggestions). Past toxicity, score indeterminacy arises throughout many frequent score duties, corresponding to factuality, helpfulness, and relevance classification.
Regardless of the prevalence of score indeterminacy, most present meta-evaluation approaches for closed-form score duties (e.g., MCQ, Sure/No, Likert) depend on forced-choice score directions, which require raters to pick out a single “right” choice, even when a number of could possibly be cheap. Any disagreement amongst raters is consolidated into a “laborious” label and used to measure categorical settlement (e.g., Lu & Zhong, 2024; Jung, Brahman & Choi, 2024; Es et al., 2023). As a result of this strategy to meta-evaluation eliminates necessary details about score indeterminacy, it will possibly result in deceptive conclusions about choose efficiency.
Extra typically, when score indeterminacy is current, three elementary questions come up for meta-evaluation:
- Ranking Elicitation: How ought to we gather scores from people and a choose system when multiple choice might be “right”?
- Ranking Aggregation: How ought to we encode human score disagreement in labels?
- Measuring Settlement: How ought to we measure human–choose settlement within the presence of score indeterminacy?
To handle these questions, we developed a framework for judge-system meta-evaluation beneath score indeterminacy (Determine 1). Our framework is located inside a wealthy literature on perspectivism in HCI and NLP, which views rater disagreement as a sign to be preserved reasonably than attenuated (Plank, 2022; Fleisig, 2024). Whereas perspectivist approaches to analysis have historically targeted on capturing inter-rater disagreement — the place a number of human raters can disagree on account of sociocultural variations — our framework additionally captures intra-rater disagreement, the place the identical rater can establish a number of “right” scores.
A Framework for Meta-Analysis beneath Ranking Indeterminacy
We now flip to our first query: how ought to scores be collected from people and a choose system beneath score indeterminacy? In answering, we distinguish between two alternative ways of accumulating scores: forced-choice elicitation and response set elicitation.
Pressured-choice elicitation instructs a rater (human or choose system) to pick out precisely one choice from (mathcal{O}), the set of doable choices. Response set elicitation permits raters to pick out all choices they take into account cheap. Formally, this implies an choice subset (mathcal{S}) drawn from ( mathcal{Q}), the place ( mathcal{Q}) incorporates all doable combos of choices. For instance, in our toxicity activity from Determine 1:
- ( mathcal{O})= {Sure, No} defines two customary choices.
- ( mathcal{Q}) = {Sure, No, {Sure, No}} consists of the singleton response units, and the response set containing each Sure and No.
Beneath forced-choice elicitation, a rater should decide both Sure or No even when each appear legitimate. Beneath response set elicitation, they will categorical this uncertainty through the response set (mathcal{S}) = {Sure, No}.
We argue that beneath score indeterminacy, we should always purpose for prime settlement with respect to response set scores—not forced-choice scores. This makes the downstream consumer the arbiter of how indeterminacy must be resolved for his or her utility. In content material moderation, when an merchandise is poisonous beneath one interpretation however not poisonous beneath one other, the platform might wish to err on the facet of warning and filter it; a choice that won’t align with how people or a choose system occurs to resolve score indeterminacy when introduced with a forced-choice instruction.

However how precisely does forcing a single selection lose details about score indeterminacy? We mannequin this by a easy probabilistic framework, illustrated above. The left panel illustrates the interpretation from raters’ response set scores to forced-choice scores:
- The response set distribution (boldsymbol{theta}_i^*) fashions how doubtless a rater is to pick out every mixture of choices for the (i)’th merchandise throughout response set elicitation. For instance (boldsymbol{theta}_i^*) = [0.3, 0.2, 0.5] signifies that 30% of raters would endorse (mathcal{S}) = {Sure, No} in response set elicitation.
- The forced-choice translation matrix (mathbf{F}_i) describes the likelihood of a rater selecting an choice as a forced-choice score provided that it’s included in a response set. For instance, within the determine above, the highest left entry in (mathbf{F}_i) exhibits a 50% probability of a rater selecting Sure as a forced-choice score provided that each Sure and No had been of their response set.
- The forced-choice distribution (mathbf{O}_i) exhibits the distribution over forced-choice choices. For instance, the vector (mathbf{O}_i) = [0.35, 0.65] denotes a 35% probability of a rater deciding on Sure and a 65% probability of choosing No as a forced-choice score.
Collectively, these substances outline a system of equations ( mathbf{O}_i = mathbf{F}_i boldsymbol{theta}_i ) expressing how we will decompose the forced-choice scores usually used for meta-evaluation into (1) the response set distribution, and (2) spurious error attributable to the forced-choice choice course of. Whereas prior work has investigated methods of validating conventional machine studying fashions (Uma et al., 2020; Peterson et al., 2019) and choose programs (Elangovan et al., 2024) beneath inter-rater disagreement (i.e., through the forced-choice distribution (mathbf{O}_i)), these approaches don’t account for intra-rater disagreement that arises when a single rater identifies multiple right choice.
Extra formally, the system (mathbf{O}_i = mathbf{F}_i boldsymbol{theta}_i ) is underdetermined in score duties the place there are extra response units than choices; or, when (|mathcal{Q}| > |mathcal{O}| ). For example, in our working toxicity instance with (mathcal{O} ) = {Sure, No}, raters can choose the response set ( mathcal{S} )= {Sure, No} once they decide that each interpretations are legitimate, which means that (|mathcal{Q}| = 3 > 2 = |mathcal{O}|). This has a worrying implication: with out figuring out how raters resolve indeterminacy (the item-specific translation matrix (mathbf{F}_i)), we will’t get better the “true” response set distribution from forced-choice information alone.
Implication: Aggregating Disagreement into Labels
With this identifiability evaluation in thoughts, we now return to our second meta-evaluation query: how ought to we mixture rater disagreement right into a label? Whereas it may be tempting to encode the forced-choice distribution right into a gentle label vector (i.e., the distribution of raters’ forced-choice scores), generally, this illustration can not disentangle significant disagreement arising from score indeterminacy from spurious variation launched by forced-choice choice.
The best panel of Determine 3 illustrates our resolution. Fairly than counting on an unknown forced-choice translation course of, we use a hard and fast choice lookup desk (boldsymbol{Lambda}) to map the response set distribution to a multi-label vector (boldsymbol{Omega}_i). Every entry on this steady vector describes the likelihood that raters embrace the corresponding choice of their response set.
Implication: Measuring Human-Choose Settlement
Our third meta-evaluation query naturally follows: how ought to we measure settlement between people and choose programs when utilizing a multi-label vector? Distributional metrics like KL-Divergence could be pure selections if we had been evaluating gentle label distributions. However, as we’ve simply proven, gentle labels derived from forced-choice scores conflate significant intra-rater disagreement with forced-choice choice artifacts. It is a concern given rising literature recommending distributional metrics be used for choose system meta-evaluation on subjective duties (Elangovan et al., 2024, Chen et al., 2025). Whereas these settlement metrics protect inter-rater disagreement, they continue to be susceptible to forced-choice choice artifacts.
To measure human–choose settlement whereas accounting for score indeterminacy, we leverage steady metrics outlined on multi-label vectors. Particularly, we use Imply Squared Error
$$ MSE = mathbb{E}[||boldsymbol{Omega}_i^H – boldsymbol{Omega}_i^J||^2_2] ,$$
which measures the anticipated distance between human and choose multi-label vectors over the analysis dataset. This metric rewards choose programs that establish the identical set of believable interpretations as people. When people are cut up on whether or not an output is poisonous (boldsymbol{Omega}_i^H = [0.8, 0.5]), a choose that mirrors this uncertainty achieves decrease error than one which favors a single interpretation—even when that assured selection matches the bulk’s forced-choice score.
Empirical Validation
To validate our framework, we performed experiments with 9 industrial LLMs as choose programs and eleven score duties. These score duties included ideas corresponding to factuality, helpfulness, relevance, and toxicity. Whereas we will straight elicit forced-choice and response set scores from choose programs utilizing completely different prompts, current analysis datasets solely include forced-choice human scores. As a result of points described above, it’s not doable to get better the “true” response set distribution from these current forced-choice scores.
Subsequently, we introduce a sensitivity parameter (beta^H) that controls the likelihood {that a} human rater consists of the optimistic choice (e.g., “poisonous”) of their response set regardless of deciding on the unfavourable choice (e.g., “not poisonous”) as a forced-choice score. For instance, (beta^H) = 0.3 signifies that 30% of raters who selected “not poisonous” really thought-about “poisonous” to even be cheap. Setting (beta^H) = 0 recovers the case with no score indeterminacy. By systematically various (beta^H), we will characterize how meta-evaluation outcomes change beneath completely different ranges of indeterminacy.
In our evaluation, we examine how choose programs chosen by completely different meta-evaluation approaches carry out on downstream analysis duties. These meta-evaluation approaches range in how they gather and mixture scores, and the way they measure human–choose settlement (see paper for particulars). As we talk about subsequent, the downstream analysis duties thought-about in our evaluation symbolize frequent use instances of choose programs in life like deployment situations.
Content material Filtering: In content material filtering, a choose system decides which outputs from a goal system to permit or suppress. For example, a platform should decide whether or not to filter probably poisonous content material, balancing consumer security in opposition to the potential for high quality of service harms.
We measure efficiency through resolution consistency—how typically a choose makes the identical permit/suppress selections as people:
$$ C^{tau}(Y^J, Y^H) = mathbb{E}[mathbb{1}[s_{k}^{tau}(Y^J_{ML}) = s_{k}^{tau}(Y^H_{ML})]]. $$
Right here, (s_k^{tau}(Y) = {1}[ Y_k geq tau ] ) is a thresholding perform that classifies content material as poisonous if the multi-label likelihood for choice (okay) exceeds a threshold (tau ). For instance, if okay=”poisonous” and (tau=0.3), content material will get filtered when there’s not less than a 30% likelihood a rater identifies a poisonous interpretation. The edge (tau) represents the analysis designer’s danger tolerance. Decrease values filter extra aggressively.
Prevalence Estimation: In prevalence estimation, a choose system is used to estimate how incessantly a sure idea — like helpfulness or toxicity — is current in goal system outputs. This estimation activity is often utilized in automated red-teaming when estimating the assault success fee, or when estimating the win-rate between two fashions for a leaderboard.
We measure efficiency through estimation bias—how a lot an estimate obtained from a choose system differs from one obtained from human scores:
$$B^{tau}(Y^J_{ML}, Y^H_{ML}) = mathbb{E}[s_k^{tau}(Y^J_{ML})] – mathbb{E}[s_k^{tau}(Y^H_{ML})]$$
For instance, if people establish 40% of outputs as poisonous however a choose estimates solely 25%, this -15% bias means the choose underestimates the prevalence of toxicity. Each metrics function on multi-label vectors that protect details about score indeterminacy. This permits downstream customers to set their very own thresholds primarily based on their danger tolerance and use case, reasonably than being constrained by how particular person raters resolved indeterminacy when pressured to decide on.

Discovering 1: Choose programs differ from each other—and therefore additionally from human raters—in how they resolve score indeterminacy. Whereas we don’t know the true human sensitivity parameter, we will estimate every choose’s sensitivity parameter (hat{beta}^J_t) utilizing its responses to each forced-choice and response set prompts. We see super variation throughout programs and duties. E.g., for SummEval (Relevance), estimated parameters cowl a spectrum of 0.01 to 0.54 throughout programs.
Discovering 2: When human raters resolve score indeterminacy in a different way from choose programs, settlement metrics measured in opposition to forced-choice scores yield sub-optimal picks of choose programs. When people and choose programs resolve indeterminacy in a different way ((beta^H neq beta^J)), forced-choice human–choose settlement metrics like Hit-Charge, Cohen’s (kappa) and Jensen-Shannon Divergence choose choose programs that carry out poorly on downstream duties. Distributional settlement metrics like Jensen-Shannon Divergence are inclined to carry out higher than categorical settlement metrics like Hit-Charge. However efficiency degrades when (beta^H) exceeds 0.2-0.3.

Whereas Determine 5 summarizes mixture remorse, Determine 6 under exhibits how these rating inversions play out on particular duties. Every column compares the rating produced by a human–choose settlement metric (left axis of every subplot) with the rating produced by the downstream metric (proper axis).
- On SNLI (left column), no inversion happens: the choose system that scores highest beneath Cohen’s κ additionally achieves the bottom downstream bias. This exhibits that current metrics can work properly on some duties.
- On SummEval (Relevance) (center-left), nonetheless, the story is completely different: the choose system with the perfect KL-Divergence rating is not the system with the bottom downstream estimation bias. Deciding on the improper choose on this case will increase estimation bias by 28%; equal to grossly mis-estimating the speed of “related” goal system outputs by an further 0.28 (on a scale of [0,1]).
- Lastly, the TopicalChat (Comprehensible) columns (proper) illustrate two extremes. The multi-label MSE metric stays steady and in line with the downstream metric, even beneath human score indeterminacy ((beta^H_t=0.3)). In distinction, Hit-Charge, a extensively used categorical settlement metric, yields a extremely inconsistent rating.

Discovering 3: Multi-label metrics appropriately establish high-performing choose programs. Figures 5 and 6 illustrate that our proposed strategy, which includes eliciting response set scores and measuring human–choose settlement through a steady multi-label settlement metric (MSE) selects rather more performant choose programs than forced-choice settlement metrics. Even when beginning with an current corpus of forced-choice information, we will estimate the interpretation matrix (hat{mathbf{F}_i}) utilizing simply 100 paired forced-choice and response set scores and nonetheless choose performant choose programs (see paper for particulars).
Sensible Takeaways
Based mostly on our findings, we provide 4 concrete suggestions for enhancing meta-evaluation:
1. Absolutely specify binary score duties by including a Perhaps or Tie choice. This straightforward change eliminates the identifiability problem described above by making a one-to-one correspondence between forced-choice choices {Sure, No, Perhaps} and response units {{Sure}, {No}, {Sure,No}}. Be aware: this strategy solely works for binary duties—score duties with three or extra choices can’t be absolutely specified this manner.
2. Use response set elicitation when accumulating new datasets. When it’s not doable to totally get rid of indeterminacy (which is frequent for properties like helpfulness or relevance), gather response set scores the place raters choose ALL choices which are cheap. Then, measure settlement utilizing a steady multi-label metric like MSE. This preserves vital details about score indeterminacy that forced-choice elicitation eliminates.
3. Acquire small auxiliary datasets to enhance forced-choice scores. Have already got forced-choice information? Acquire simply ~100 paired forced-choice and response set scores to estimate the interpretation matrix (hat{mathbf{F}}). Our experiments present this small funding allows significantly better choose choice (Discovering 3 above). Take a look at our GitHub tutorial for implementation particulars.
4. In case you should use forced-choice, select distributional metrics rigorously. Our outcomes persistently present KL-Divergence within the human→choose path (not choose→human) performs finest amongst forced-choice human–choose settlement metrics. Keep away from categorical metrics like Hit-Charge, that are unreliable beneath score indeterminacy.
Wish to study extra or do that strategy out for your self? Discover our implementation and quickstart tutorial on GitHub!
Acknowledgements: This weblog submit relies on our NeurIPS 2025 paper Validating LLM-as-a-Choose Programs beneath Ranking Indeterminacy, co-authored with Solon Barocas, Hannah Wallach, Kenneth Holstein, Steven Wu, and Alexandra Chouldechova. Many due to my co-authors and to members of the Sociotechnical Alignment Middle (STAC) at Microsoft Analysis for invaluable suggestions on early drafts of this work. Moreover, many due to Wayne Chi and Kiriaki Fragkia for useful suggestions on earlier variations of this weblog submit.
