Trust, but Trace
Topic · 7 pieces

Evaluation methodology

Evaluator design, calibration, and the gap between benchmark performance and production reliability — testing methodologies for non-deterministic systems.

← All writing
01
Evaluation methodology

Who calibrates the judge? The gap nothing fills

Six posts, five strategies, five tools. Two ship the primitive that measures judge-vs-human agreement; none gates a CI build on it with a published confidence band. That's the gap.

11 min →
02
Evaluation methodology

Deterministic where you can, judge where you must

Tool calls deserve deterministic comparison. Goal completion needs LLM-based assessment. Some tools draw the line; others don't.

9 min →
03
Evaluation methodology

The negative rubric: telling a judge what NOT to evaluate

TruLens uses rubrics that explicitly tell the judge what falls outside scope. The pattern is unusually explicit.

9 min →
04
Evaluation methodology

When judges show their uncertainty: DeepEval and ARES

DeepEval weights scores by token probabilities. ARES applies Prediction-Powered Inference for formal confidence intervals. Two routes to the same goal.

10 min →
05
Evaluation methodology

From rubric to graph: how DeepEval splits a judgment

DeepEval encodes evaluation as a graph traversal. Each node is a simpler decision than the overall judgment.

10 min →
06
Evaluation methodology

Binary verdicts: how Ragas keeps judges honest

Ragas decomposes complex judgments into atomic yes/no questions. The trade-off: you trade information for variance reduction.

8 min →
07
Evaluation methodology

Variance is the problem with LLM-as-judge — and tools don't agree on how to fix it

Reading the source code of seven OSS LLM-as-judge tools: variance manifests across four axes, and the seven approaches don't share a framework — they share a problem.

6 min →