1 series · 7 pieces
Series
Multi-part essays — read in order, or jump to any part. Each series has a thread holding it together; the parts compound.
Reading the source:
how OSS tools judge LLM agents
Every LLM-as-judge tool fights the same thing: variance. Few say so out loud. I read the source of seven OSS evaluators — five strategies, straight from the code, not the docs. Each contains the variance a different way; each pays for it somewhere. The series ends where they all stop short: none gates a build on whether the judge still agrees with a human.
Evaluation methodology
Variance is the problem with LLM-as-judge — and tools don't agree on how to fix it
Reading the source code of seven OSS LLM-as-judge tools: variance manifests across four axes, and the seven approaches don't share a framework — they share a problem.
Evaluation methodology
Binary verdicts: how Ragas keeps judges honest
Ragas decomposes complex judgments into atomic yes/no questions. The trade-off: you trade information for variance reduction.
Evaluation methodology
From rubric to graph: how DeepEval splits a judgment
DeepEval encodes evaluation as a graph traversal. Each node is a simpler decision than the overall judgment.
Evaluation methodology
When judges show their uncertainty: DeepEval and ARES
DeepEval weights scores by token probabilities. ARES applies Prediction-Powered Inference for formal confidence intervals. Two routes to the same goal.
Evaluation methodology
The negative rubric: telling a judge what NOT to evaluate
TruLens uses rubrics that explicitly tell the judge what falls outside scope. The pattern is unusually explicit.
Evaluation methodology
Deterministic where you can, judge where you must
Tool calls deserve deterministic comparison. Goal completion needs LLM-based assessment. Some tools draw the line; others don't.
Evaluation methodology
Who calibrates the judge? The gap nothing fills
Six posts, five strategies, five tools. Two ship the primitive that measures judge-vs-human agreement; none gates a CI build on it with a published confidence band. That's the gap.