Topic · 7 pieces

Evaluation methodology

Evaluator design, calibration, and the gap between benchmark performance and production reliability — testing methodologies for non-deterministic systems.

← All writing

Evaluation methodology May 30, 2026

Who calibrates the judge? The gap nothing fills

Six posts, five strategies, five tools. Two ship the primitive that measures judge-vs-human agreement; none gates a CI build on it with a published confidence band. That's the gap.

11 min → 02

Evaluation methodology May 27, 2026

Deterministic where you can, judge where you must

Tool calls deserve deterministic comparison. Goal completion needs LLM-based assessment. Some tools draw the line; others don't.

9 min → 03

Evaluation methodology May 23, 2026

The negative rubric: telling a judge what NOT to evaluate

TruLens uses rubrics that explicitly tell the judge what falls outside scope. The pattern is unusually explicit.

9 min → 04

Evaluation methodology May 20, 2026

When judges show their uncertainty: DeepEval and ARES

DeepEval weights scores by token probabilities. ARES applies Prediction-Powered Inference for formal confidence intervals. Two routes to the same goal.

10 min → 05

Evaluation methodology May 16, 2026

From rubric to graph: how DeepEval splits a judgment

DeepEval encodes evaluation as a graph traversal. Each node is a simpler decision than the overall judgment.

10 min → 06

Evaluation methodology May 7, 2026

Binary verdicts: how Ragas keeps judges honest

Ragas decomposes complex judgments into atomic yes/no questions. The trade-off: you trade information for variance reduction.

8 min → 07

Evaluation methodology May 3, 2026

Variance is the problem with LLM-as-judge — and tools don't agree on how to fix it

Reading the source code of seven OSS LLM-as-judge tools: variance manifests across four axes, and the seven approaches don't share a framework — they share a problem.

6 min →