Trust, but Trace

The archive · 9 pieces

Writing

Long essays, working notes, and series on agent reliability, evaluation, and the EU AI Act.

01
EU AI Act

Article 72 named the loop. Your dashboards aren't closing it.

Article 72 demands active, systematic collection of performance data across a system's whole lifetime, fed back into risk management. Observability gives you dashboards, not that — and the template that was meant to help just got removed. Where the loop stays open, and what to build.

10 min →
02
Evaluation methodology

Who calibrates the judge? The gap nothing fills

Six posts, five strategies, five tools. Two ship the primitive that measures judge-vs-human agreement; none gates a CI build on it with a published confidence band. That's the gap.

11 min →
03
Evaluation methodology

Deterministic where you can, judge where you must

Tool calls deserve deterministic comparison. Goal completion needs LLM-based assessment. Some tools draw the line; others don't.

9 min →
04
Evaluation methodology

The negative rubric: telling a judge what NOT to evaluate

TruLens uses rubrics that explicitly tell the judge what falls outside scope. The pattern is unusually explicit.

9 min →
05
Evaluation methodology

When judges show their uncertainty: DeepEval and ARES

DeepEval weights scores by token probabilities. ARES applies Prediction-Powered Inference for formal confidence intervals. Two routes to the same goal.

10 min →
06
Evaluation methodology

From rubric to graph: how DeepEval splits a judgment

DeepEval encodes evaluation as a graph traversal. Each node is a simpler decision than the overall judgment.

10 min →
07
Evaluation methodology

Binary verdicts: how Ragas keeps judges honest

Ragas decomposes complex judgments into atomic yes/no questions. The trade-off: you trade information for variance reduction.

8 min →
08
EU AI Act

Article 15 named the requirements. The toolchain hasn't caught up.

Article 15 of the AI Act names five attack vectors and a lifecycle requirement. The evaluation toolchain available today addresses a fraction of that. Where the gap lives, and what to do meanwhile.

10 min →
09
Evaluation methodology

Variance is the problem with LLM-as-judge — and tools don't agree on how to fix it

Reading the source code of seven OSS LLM-as-judge tools: variance manifests across four axes, and the seven approaches don't share a framework — they share a problem.

6 min →