Trust, but Trace

Evaluation methodology

Who calibrates the judge? The gap nothing fills

Six posts, five strategies, five tools. Two ship the primitive that measures judge-vs-human agreement; none gates a CI build on it with a published confidence band. That's the gap.

Six posts. Five strategies walked through. Five tools out of the seven I read for the underlying research. One question the series kept gesturing at and never quite answered: how reliable is the judge — really, on the task it’s running, with the rubric it’s been given?

I’m building AgentAnvil, a contract-driven testing framework for LLM-based agents, and reading other evaluators is how I work out what those contracts should constrain. The strategies in this series narrow what the judge can say (Parts 2-3), measure how sure it sounds (Part 4), pin down what it should ignore (Part 5), or skip it where deterministic checks suffice (Part 6). Each is honest, each works, each leaves something behind. I assumed none of them shipped the primitive that measures how often the judge agrees with a human on your task. A second pass through the source forced a correction. Two of the five ship it. What none of them ships is the workflow that gates a CI build on that agreement, with a published confidence band the build can read.

This post is the series closer. It maps the gap — narrower than I thought, still real.

The intuition

A judge is a measurement instrument. Run a thermometer on the same liquid twice and you read two numbers; the spread between them is the instrument’s variance. The thermometer is useful because somewhere upstream a manufacturer characterised that spread against a reference standard — boiling water, freezing point, NIST-traceable cells. The thermometer carries an implicit margin of error because that calibration step was paid for.

An LLM judge has no such step out of the box. You write a rubric, you pick a model, you point it at agent traces, you get scores. The score is what the framework returns; the framework rarely tells the CI gate what the score’s margin of error is on your task. To know that, you’d run the same items past one or more human raters, compute agreement against the judge, and treat the residual disagreement as the instrument’s noise floor. Cohen’s κ (Cohen, 1960; see Landis & Koch, 1977 for the magnitude bins) is the standard primitive. That measurement is the calibration step.

When I started this series I assumed the primitive itself was missing across the OSS tools. It isn’t, quite. Re-grepping the five trees with calibration vocabulary in mind, two of them ship it — Ragas through validate_alignment(), TruLens through GroundTruthAggregator(true_labels=...) — and one (ARES) ships a coarser version. What none of them ships is the regating workflow around those primitives: a published confidence band on the κ that a CI build consumes as a fail-the-build threshold, with a re-calibration trigger when the judge model or rubric changes. The primitives are there. The gate around them isn’t. That’s the gap this post is about.

The gap, mapped

Here’s the strategy-by-tool matrix at the end of the series — extended with two rows the walkthroughs didn’t reach.

What each tool's source code ships. Filled dots are first-class primitives — sourced from the walkthroughs in Parts 2-6, plus a second-pass grep across each repo for agreement-vs-gold APIs. The last row is the gap: no tool ships a CI-time gate keyed off a published confidence band.
StrategyRagasDeepEvalARESTruLensInspect AI
Scope constraint P2
Structural decomp P3
Probabilistic calibration P4
Rubric anchoring P5
Deterministic hybrid P6
Agreement primitives (κ, ECE) grep
Regating hook on κ + CI
primitives shipped6 of 7
tools shipping regating hook0 of 5

The first five rows record what each post unpacked. Ragas ships scope constraint cleanly. DeepEval ships decomposition and a token-probability nod at calibration. ARES ships the PPI bound. TruLens ships the negative rubric. Inspect AI ships the deterministic-vs-graded partition. Five squares filled, in five different places.

The sixth row was the surprise. Ragas’s validate_alignment() in metrics/base.py takes your judge and a held-out human-labeled dataset and returns Cohen’s κ — discrete metrics through cohen_kappa_score, ranking metrics through weighted κ, numeric metrics through Pearson r. The first-party how-to is literally titled “How to Align an LLM as a Judge.” TruLens goes further. GroundTruthAggregator(true_labels=...) in feedback/groundtruth.py ships cohens_kappa, ece (Expected Calibration Error), brier_score, matthews_correlation, plus rank-correlation primitives — exposed at the top of trulens.feedback and exercised by dedicated tests. ARES emits raw judge-vs-gold accuracy alongside its PPI bound. The primitives are not exotic and not absent — two of the five ship them, one of the five ships a coarser version.

The seventh row is the one still empty everywhere. None of the five wraps those primitives into a CI-time gate keyed off a published confidence band — “fail the build when κ drops below 0.7 with a 95% lower bound under 0.6.” Ragas returns a κ point estimate; the bootstrap is yours. TruLens returns the same. The workflow — published CI on the κ, threshold the build reads, re-calibration trigger when the judge model or the rubric changes — doesn’t exist in any of the five trees.

ARES is the closest, and worth disambiguating. Its PPI machinery wraps a CI around the agent’s aggregate score, which sounds like it ought to fill the regating row. But the interval is over the population statistic, not over agreement with the held-out humans, and a regating workflow needs the latter. PPI tells you the aggregate score sits inside a band (Angelopoulos et al., 2023); it doesn’t tell the build to fail when judge-human agreement drifts.

That’s the gap. Narrower than I framed it at the top of this post, but real.

What the gap costs

Imagine the gate. A pull request modifies an agent’s plan generator. CI runs the eval suite — twenty rubrics, a thousand items each. Two rubrics drop by 3 points. Is that real?

Without calibration, the question has no answer. The 3-point drop could be a regression, a judge-noise blip, a rubric drift after a prompt edit. You can re-run the gate (which the team will do, and which Part 1 already warned about), flag the change for human review (which doesn’t scale), or pick a threshold and call it. None of those produces a defensible “we are X% confident the drop is real” statement, because none has a measured noise floor for the judge under this rubric.

The κ statistic compares the judge to a hidden gold label, on a balanced binary task. Pick the judge's true accuracy and the number of human-labeled items. The band is what you can publish.
−0.2 0.0 0.5 1.0
κ̂ (point estimate)0.60
95% CI0.49 – 0.71
band width0.22

The widget lets you set how reliable your judge is against a hidden gold standard, then watch what the confidence band on Cohen’s κ looks like at different sample sizes. It’s not a calibration tool — it’s the cost demo for not having one. The point: even when you know your judge is 80% accurate, the band you can publish is wide unless you’ve annotated hundreds of items. Below about 100 human-labeled items, the noise on your noise estimate swallows the regression signal you care about.

Most teams are below 100 human-labeled items per rubric. Most teams are at zero. The series has been documenting tools that operate cleanly in that regime — strategies that need no labels at all. The gap is what those strategies can’t substitute for.

Why no one ships the gate

The primitives are the easy part. The regating workflow around them is hard in a specific way that doesn’t map to a general OSS evaluator.

A framework can ship a judge, rubric templates, decomposition graphs, retry logic, structured output parsers — all the things you read in Parts 2-6. It can also wrap sklearn.metrics.cohen_kappa_score once the labels exist, which is exactly what TruLens does at feedback/groundtruth.py:17. What a general framework can’t ship is the threshold. The CI gate that fails when κ < 0.7 needs to know which 0.7 is yours, on which rubric, with what tolerance for noisy jitters that aren’t real regressions. None of that is universal across tasks, and none of it can be defaulted without being wrong somewhere important.

What it costs to calibrate one judge on one rubric. Pick the sample size, how many raters annotate each item, and how long an item takes. The bill is the bill.
Raters per item1
annotations100
human-hours5.0
~ $ at $50/hr$250
days @ 6 hrs0.8

The widget shows what producing the labels actually costs. Pick the number of items, the raters, the per-item annotation time. The output is the bill, and it’s bigger than most people guess. That bill — plus the per-team threshold question above — is why the regating hook hasn’t been packaged. ARES’s design pressure — “annotate a small set, use it efficiently” — is the literature’s response to the labelling cost. ARES applies it to retrieval QA only.

What’s left to integrate is bounded and specific: a bootstrap CI wrapper around the κ Ragas and TruLens already compute, a config flag that says “these are my thresholds; gate on them”, and a hook that fires re-calibration when the judge model or the rubric version changes. The pieces are off-the-shelf — sklearn.metrics.cohen_kappa_score for the point estimate, the irrCAC family (Gwet’s inter-rater agreement package) for bootstrapped bands, statsmodels.stats.inter_rater for multi-rater extensions. The wrapper is the work no tool has done end to end.

What you can do today

Three moves you can make this quarter without changing your tool.

First: budget a calibration sample. Start with 50 items per rubric, labelled by one careful human — your judge supplies the second set of labels, and the pair gives you a Cohen’s κ point estimate. Scale to 100-200 once the rubric stabilises, because below ~100 items the band on the κ is wide enough to swallow most regressions you’d want to catch (the widget above demonstrates this). Whether your tool is Ragas (validate_alignment()), TruLens (GroundTruthAggregator), or something else, this is the one place the labels are unavoidable. Re-run the calibration once a quarter, and again whenever the judge model or the rubric changes.

Second: bootstrap the CI yourself and write the threshold into the gate. Neither Ragas nor TruLens returns a confidence band on κ — you compute it with a 1,000-resample bootstrap loop, or with irrCAC’s built-in confidence-interval helpers (kappa2.table in R, irrCAC.table.cohen() in Python). Then fail the build with a calibration-required error when the lower bound slips under your threshold — not a quality regression. The two failure modes feel similar from the outside; they require different fixes (rubric edit vs. agent fix). Disambiguating them is what an unwrapped judge cannot do for you.

Third: be honest in the eval report about what the judge can and cannot tell you. “Faithfulness dropped 3 points” is one kind of claim. “The judge, which on a 200-item calibration set agrees with humans at κ=0.72 (95% CI: 0.62-0.82), scored faithfulness 3 points lower this run; given the band, the drop is at the edge of what we can call a real regression” is another. The second is what a quality gate would have to publish to actually gate quality.

None of these requires your OSS tool to ship more than it does today. They require accepting that the regating wrapper is yours, and treating it as a budgeted, repeatable engineering activity rather than a research project.

The series, end to end

The strategies in this series aren’t competing approaches. They are complementary moves against the same problem — five different cuts at narrowing what an LLM judge can mess up. A mature evaluation harness combines them: deterministic where it can (Part 6), scope-constrained where it can’t (Part 2), structurally decomposed where the judgment is complex (Part 3), rubric-anchored where it overlaps with siblings (Part 5), probabilistically calibrated where the output needs an interval (Part 4). The architecture is plural.

Visually the five stack like this. Toggle a strategy off and the bar exposes the share it was bounding; the right-most slice never moves — the regating residual no toggle reaches. (The slice widths are illustrative.)

One horizontal bar = total judge variance. Toggle each strategy to see its slice get bounded (ink) or left exposed (amber). The right-most slice is the regating gap — the variance two tools (Ragas, TruLens) ship the primitives to measure but no tool gates on. Slice widths are illustrative.
P2
P3
P4
P5
P6
REGATE
bounded share65%
residual + regating gap35%

I’m working on this gap in AgentAnvil. Whether AgentAnvil ships the closure or some other tool does is not the interesting part — the gap exists, it has a name, and the strategies in this series do not fill it.

Rule of thumb: a judge is a measurement instrument, and a measurement without a published, gated noise floor isn’t one a build can defend. Five tools in this series narrow the variance; none of them ships the gate.

LLM-as-judgeEvaluation methodology