Trust, but Trace

1 series · 7 pieces

Series

Multi-part essays — read in order, or jump to any part. Each series has a thread holding it together; the parts compound.

Series

Reading the source:
how OSS tools judge LLM agents

Every LLM-as-judge tool fights the same thing: variance. Few say so out loud. I read the source of seven OSS evaluators — five strategies, straight from the code, not the docs. Each contains the variance a different way; each pays for it somewhere. The series ends where they all stop short: none gates a build on whether the judge still agrees with a human.

7 of 7 published
Part 1/7

Evaluation methodology

Variance is the problem with LLM-as-judge — and tools don't agree on how to fix it

Reading the source code of seven OSS LLM-as-judge tools: variance manifests across four axes, and the seven approaches don't share a framework — they share a problem.

6 min
Part 2/7

Evaluation methodology

Binary verdicts: how Ragas keeps judges honest

Ragas decomposes complex judgments into atomic yes/no questions. The trade-off: you trade information for variance reduction.

8 min
Part 3/7

Evaluation methodology

From rubric to graph: how DeepEval splits a judgment

DeepEval encodes evaluation as a graph traversal. Each node is a simpler decision than the overall judgment.

10 min
Part 4/7

Evaluation methodology

When judges show their uncertainty: DeepEval and ARES

DeepEval weights scores by token probabilities. ARES applies Prediction-Powered Inference for formal confidence intervals. Two routes to the same goal.

10 min
Part 5/7

Evaluation methodology

The negative rubric: telling a judge what NOT to evaluate

TruLens uses rubrics that explicitly tell the judge what falls outside scope. The pattern is unusually explicit.

9 min
Part 6/7

Evaluation methodology

Deterministic where you can, judge where you must

Tool calls deserve deterministic comparison. Goal completion needs LLM-based assessment. Some tools draw the line; others don't.

9 min
Part 7/7

Evaluation methodology

Who calibrates the judge? The gap nothing fills

Six posts, five strategies, five tools. Two ship the primitive that measures judge-vs-human agreement; none gates a CI build on it with a published confidence band. That's the gap.

11 min