The archive · 9 pieces

Writing

Long essays, working notes, and series on agent reliability, evaluation, and the EU AI Act.

EU AI Act June 9, 2026

Article 72 named the loop. Your dashboards aren't closing it.

Article 72 demands active, systematic collection of performance data across a system's whole lifetime, fed back into risk management. Observability gives you dashboards, not that — and the template that was meant to help just got removed. Where the loop stays open, and what to build.

10 min → 02

Evaluation methodology May 30, 2026

Who calibrates the judge? The gap nothing fills

Six posts, five strategies, five tools. Two ship the primitive that measures judge-vs-human agreement; none gates a CI build on it with a published confidence band. That's the gap.

11 min → 03

Evaluation methodology May 27, 2026

Deterministic where you can, judge where you must

Tool calls deserve deterministic comparison. Goal completion needs LLM-based assessment. Some tools draw the line; others don't.

9 min → 04

Evaluation methodology May 23, 2026

The negative rubric: telling a judge what NOT to evaluate

TruLens uses rubrics that explicitly tell the judge what falls outside scope. The pattern is unusually explicit.

9 min → 05

Evaluation methodology May 20, 2026

When judges show their uncertainty: DeepEval and ARES

DeepEval weights scores by token probabilities. ARES applies Prediction-Powered Inference for formal confidence intervals. Two routes to the same goal.

10 min → 06

Evaluation methodology May 16, 2026

From rubric to graph: how DeepEval splits a judgment

DeepEval encodes evaluation as a graph traversal. Each node is a simpler decision than the overall judgment.

10 min → 07

Evaluation methodology May 7, 2026

Binary verdicts: how Ragas keeps judges honest

Ragas decomposes complex judgments into atomic yes/no questions. The trade-off: you trade information for variance reduction.

8 min → 08

EU AI Act May 3, 2026

Article 15 named the requirements. The toolchain hasn't caught up.

Article 15 of the AI Act names five attack vectors and a lifecycle requirement. The evaluation toolchain available today addresses a fraction of that. Where the gap lives, and what to do meanwhile.

10 min → 09

Evaluation methodology May 3, 2026

Variance is the problem with LLM-as-judge — and tools don't agree on how to fix it

Reading the source code of seven OSS LLM-as-judge tools: variance manifests across four axes, and the seven approaches don't share a framework — they share a problem.

6 min →