They don’t agree on anything except the word “judge.”
I read the source code of seven OSS LLM-as-judge tools as part of my PhD research. I’m also building a testing framework for LLM agents — both shape how I read the tools below.
Variance is the problem. Run an LLM judge twice on the same input and the score moves. Swap the judge model, the score moves more. Reword the rubric, more again. None of these tools eliminate that variance. What they do — each one differently — is constrain it. The choice of constraint is the part that matters, and the next posts walk through it.
Why a judge at all
The paradigm exists because most things you want to evaluate about an agent don’t have a reference answer. Plan coherence. Tool selection. Whether the response was helpful. There is no held-out set you can grade against. So you reach for a judge, and a strong LLM is the cheapest one available.
Zheng et al. (2023) made the case for letting that LLM be the judge. They named the paradigm “LLM-as-a-judge”, built MT-Bench around it, and showed that GPT-4 reached 85% agreement with human experts on the benchmark — slightly above the 81% humans reached with each other in the same setup. That number sold the paradigm. If a strong judge tracks human preference about as well as humans track each other, you can put a judge in a CI gate without paying a human to be in the loop.
Liu et al. (2023) sharpened the technique. Their G-Eval framework added two pieces: chain-of-thought reasoning steps generated by the LLM itself before scoring, and a scoring function that weights the final score by the token-level probabilities the model assigns to each option. The result was a continuous score that captured model uncertainty instead of collapsing it into a single integer. G-Eval’s Spearman correlation with human ratings on summarisation reached 0.514, outperforming every prior evaluator on summarisation, with UniEval (0.474) the closest competitor.
Two years later, the paradigm is everywhere. Every OSS evaluation framework I read implements some version of it. The case is settled.
How variance shows up
Take any LLM judge in the wild. Four kinds of variance show up in its scores.
Run-to-run, same model. Set temperature to zero and the score still moves. Liu et al. (2023) noticed a related artefact: when you ask an LLM for an integer score on a 1-5 scale, the score distribution clusters on a single integer. Their fix — weighting the score by token probabilities rather than taking the argmax — is a workaround, not an elimination. The judge isn’t committing to a single answer; it’s collapsing a distribution to one each time.
Judge-to-judge. Swap the model behind the judge prompt and the verdicts move more. Zheng et al. (2023) measured this on positional consistency: GPT-4 holds the same verdict across position swaps 65% of the time. Claude-v1 holds it 23.8% of the time. Same prompt, same answers, different model — different judgment.
Prompt-to-prompt. Same rubric, different wording, different score. Zheng et al. test two prompt variants for the same task; renaming the assistants alone drops Claude-v1’s “biased toward first” rate from 75% to 11.2%. The judge is sensitive to the surface of its instructions, not only their content.
Systematic biases that don’t average out. Zheng et al. named three. Position: judges favour the answer presented first; GPT-4 changes its verdict toward the first answer in 30% of swap tests. Verbosity: a “repetitive list” attack — pad an answer with reformulated bullets, no new content — gets the longer answer chosen by Claude-v1 and GPT-3.5 91.3% of the time. Self-enhancement: GPT-4 favours its own answers with a 10% higher win rate; Claude-v1 with a 25% higher win rate. Gu et al. (2024) documents more in their later survey — length, concreteness, cultural — but the three Zheng named are the ones every CI gate runs into first.
A scope note. The numbers above come from MT-Bench in 2023, on the GPT-4 and Claude-v1 model classes of the time. Newer model classes have changed the magnitudes, but the structural pattern hasn’t moved. The judge is still sensitive to surface details. Drift between model upgrades is its own axis of variance and stays outside the scope of this post.
Four axes, none of which go away on their own. None of the seven OSS tools I read eliminates them. What they do — each one differently — is constrain them. The next section: why that distinction matters once a judge is in a production pipeline.
Why this matters in production
LLM judges in academic settings get evaluated against benchmarks. LLM judges in production get put behind quality gates. The two situations look similar; they aren’t.
Picture the gate. Every pull request runs the test suite. The suite includes a set of LLM-judge evaluations — plan coherence on a fixed input, tool-call appropriateness on another, response helpfulness on a third. Each scores between 0 and 1. The PR merges if all scores cross a threshold.
Now run the previous section inside that pipeline. The same judge, same prompt, same input produces different scores on consecutive runs. Swap the underlying model as a routine upgrade — many gates change verdict. Tweak the rubric in a docstring — the threshold no longer means what it meant last week.
Teams react in two ways. Both make the gate worse.
The first is to re-run on failure. If the test failed once but passed on the retry, ship it. This is false-positive masking dressed as flakiness handling. Once it’s the team norm, the gate is informational at best.
The second is to ignore the gate when convenient. Senior engineers learn which scores are noisy and which aren’t. The gate becomes social, not enforced. Governance erodes one PR at a time.
Both reactions are downstream of the same fact: a quality gate needs bounded variance, not just low variance. None of the four axes above produces a bound by default. That’s the gap this series examines — different ways the seven tools try to put a bound on what a judge can say.