Trust, but Trace

Evaluation methodology

Variance is the problem with LLM-as-judge — and tools don't agree on how to fix it

Reading the source code of seven OSS LLM-as-judge tools: variance manifests across four axes, and the seven approaches don't share a framework — they share a problem.

They don’t agree on anything except the word “judge.”

I read the source code of seven OSS LLM-as-judge tools as part of my PhD research. I’m also building a testing framework for LLM agents — both shape how I read the tools below.

Variance is the problem. Run an LLM judge twice on the same input and the score moves. Swap the judge model, the score moves more. Reword the rubric, more again. None of these tools eliminate that variance. What they do — each one differently — is constrain it. The choice of constraint is the part that matters, and the next posts walk through it.

Why a judge at all

The paradigm exists because most things you want to evaluate about an agent don’t have a reference answer. Plan coherence. Tool selection. Whether the response was helpful. There is no held-out set you can grade against. So you reach for a judge, and a strong LLM is the cheapest one available.

Zheng et al. (2023) made the case for letting that LLM be the judge. They named the paradigm “LLM-as-a-judge”, built MT-Bench around it, and showed that GPT-4 reached 85% agreement with human experts on the benchmark — slightly above the 81% humans reached with each other in the same setup. That number sold the paradigm. If a strong judge tracks human preference about as well as humans track each other, you can put a judge in a CI gate without paying a human to be in the loop.

Liu et al. (2023) sharpened the technique. Their G-Eval framework added two pieces: chain-of-thought reasoning steps generated by the LLM itself before scoring, and a scoring function that weights the final score by the token-level probabilities the model assigns to each option. The result was a continuous score that captured model uncertainty instead of collapsing it into a single integer. G-Eval’s Spearman correlation with human ratings on summarisation reached 0.514, outperforming every prior evaluator on summarisation, with UniEval (0.474) the closest competitor.

Two years later, the paradigm is everywhere. Every OSS evaluation framework I read implements some version of it. The case is settled.

How variance shows up

Take any LLM judge in the wild. Four kinds of variance show up in its scores.

Run-to-run, same model. Set temperature to zero and the score still moves. Liu et al. (2023) noticed a related artefact: when you ask an LLM for an integer score on a 1-5 scale, the score distribution clusters on a single integer. Their fix — weighting the score by token probabilities rather than taking the argmax — is a workaround, not an elimination. The judge isn’t committing to a single answer; it’s collapsing a distribution to one each time.

Judge-to-judge. Swap the model behind the judge prompt and the verdicts move more. Zheng et al. (2023) measured this on positional consistency: GPT-4 holds the same verdict across position swaps 65% of the time. Claude-v1 holds it 23.8% of the time. Same prompt, same answers, different model — different judgment.

Prompt-to-prompt. Same rubric, different wording, different score. Zheng et al. test two prompt variants for the same task; renaming the assistants alone drops Claude-v1’s “biased toward first” rate from 75% to 11.2%. The judge is sensitive to the surface of its instructions, not only their content.

Systematic biases that don’t average out. Zheng et al. named three. Position: judges favour the answer presented first; GPT-4 changes its verdict toward the first answer in 30% of swap tests. Verbosity: a “repetitive list” attack — pad an answer with reformulated bullets, no new content — gets the longer answer chosen by Claude-v1 and GPT-3.5 91.3% of the time. Self-enhancement: GPT-4 favours its own answers with a 10% higher win rate; Claude-v1 with a 25% higher win rate. Gu et al. (2024) documents more in their later survey — length, concreteness, cultural — but the three Zheng named are the ones every CI gate runs into first.

A scope note. The numbers above come from MT-Bench in 2023, on the GPT-4 and Claude-v1 model classes of the time. Newer model classes have changed the magnitudes, but the structural pattern hasn’t moved. The judge is still sensitive to surface details. Drift between model upgrades is its own axis of variance and stays outside the scope of this post.

Two concrete faithfulness checks, scored by a real Claude Haiku judge. The only thing that changes between strict and lenient is the rubric wording. Toggle it and watch both verdicts flip.
rubric Faithful only if the context explicitly states the claim. If checking it needs arithmetic, outside dates, or world knowledge — unfaithful, even if the claim is true. Faithful if the context states the claim or it can be reasonably inferred from the context using common knowledge.
context The Eiffel Tower opened to the public on 31 March 1889.
claim The Eiffel Tower has been standing for well over a century.
UNFAITHFUL Requires arithmetic from the context date to verify; faithfulness demands explicit statement, not calculation.
FAITHFUL Opening in 1889 means it has stood for 137 years as of 2026, which is well over a century.
context The release notes for version 2.1 list a set of bug fixes and several performance improvements.
claim Version 2.1 adds no new features.
UNFAITHFUL Context lists bug fixes and performance improvements but does not explicitly state whether new features were added.
FAITHFUL Release notes listing only bug fixes and performance improvements reasonably implies no new features were added.
strict rubric · both claims scored unfaithful lenient rubric · both claims scored faithful
Verdicts and reasoning are verbatim Claude Haiku output, recorded once — not a live call. Each held across three runs; the variance here is the rubric, not run-to-run noise.

Four axes, none of which go away on their own. None of the seven OSS tools I read eliminates them. What they do — each one differently — is constrain them. The next section: why that distinction matters once a judge is in a production pipeline.

Why this matters in production

LLM judges in academic settings get evaluated against benchmarks. LLM judges in production get put behind quality gates. The two situations look similar; they aren’t.

Picture the gate. Every pull request runs the test suite. The suite includes a set of LLM-judge evaluations — plan coherence on a fixed input, tool-call appropriateness on another, response helpfulness on a third. Each scores between 0 and 1. The PR merges if all scores cross a threshold.

Now run the previous section inside that pipeline. The same judge, same prompt, same input produces different scores on consecutive runs. Swap the underlying model as a routine upgrade — many gates change verdict. Tweak the rubric in a docstring — the threshold no longer means what it meant last week.

A simulation: 120 pull requests through a quality gate whose LLM judge is noisy. Each square is one PR. Drag the noise up and the gate starts merging bugs (red) and blocking good work (amber). Then turn on the re-run fix — and watch the bug count.
Pull requests through a noisy quality gateA grid of 120 pull requests. Green: the gate handled the PR correctly. Red: a bad PR was wrongly merged. Amber: a good PR was wrongly blocked.
gate correct bug merged good work blocked
bugs merged: 0 · good PRs blocked: 0

Teams react in two ways. Both make the gate worse.

The first is to re-run on failure. If the test failed once but passed on the retry, ship it. This is false-positive masking dressed as flakiness handling. Once it’s the team norm, the gate is informational at best.

The second is to ignore the gate when convenient. Senior engineers learn which scores are noisy and which aren’t. The gate becomes social, not enforced. Governance erodes one PR at a time.

Both reactions are downstream of the same fact: a quality gate needs bounded variance, not just low variance. None of the four axes above produces a bound by default. That’s the gap this series examines — different ways the seven tools try to put a bound on what a judge can say.

LLM-as-judgeEvaluation methodologyProduction ML