Evaluation methodology
Evaluator design, calibration, and the gap between benchmark performance and production reliability — testing methodologies for non-deterministic systems.
← All writingWho calibrates the judge? The gap nothing fills
Six posts, five strategies, five tools. Two ship the primitive that measures judge-vs-human agreement; none gates a CI build on it with a published confidence band. That's the gap.
Deterministic where you can, judge where you must
Tool calls deserve deterministic comparison. Goal completion needs LLM-based assessment. Some tools draw the line; others don't.
The negative rubric: telling a judge what NOT to evaluate
TruLens uses rubrics that explicitly tell the judge what falls outside scope. The pattern is unusually explicit.
When judges show their uncertainty: DeepEval and ARES
DeepEval weights scores by token probabilities. ARES applies Prediction-Powered Inference for formal confidence intervals. Two routes to the same goal.
From rubric to graph: how DeepEval splits a judgment
DeepEval encodes evaluation as a graph traversal. Each node is a simpler decision than the overall judgment.
Binary verdicts: how Ragas keeps judges honest
Ragas decomposes complex judgments into atomic yes/no questions. The trade-off: you trade information for variance reduction.
Variance is the problem with LLM-as-judge — and tools don't agree on how to fix it
Reading the source code of seven OSS LLM-as-judge tools: variance manifests across four axes, and the seven approaches don't share a framework — they share a problem.