The archive · 9 pieces
Writing
Long essays, working notes, and series on agent reliability, evaluation, and the EU AI Act.
Article 72 named the loop. Your dashboards aren't closing it.
Article 72 demands active, systematic collection of performance data across a system's whole lifetime, fed back into risk management. Observability gives you dashboards, not that — and the template that was meant to help just got removed. Where the loop stays open, and what to build.
Who calibrates the judge? The gap nothing fills
Six posts, five strategies, five tools. Two ship the primitive that measures judge-vs-human agreement; none gates a CI build on it with a published confidence band. That's the gap.
Deterministic where you can, judge where you must
Tool calls deserve deterministic comparison. Goal completion needs LLM-based assessment. Some tools draw the line; others don't.
The negative rubric: telling a judge what NOT to evaluate
TruLens uses rubrics that explicitly tell the judge what falls outside scope. The pattern is unusually explicit.
When judges show their uncertainty: DeepEval and ARES
DeepEval weights scores by token probabilities. ARES applies Prediction-Powered Inference for formal confidence intervals. Two routes to the same goal.
From rubric to graph: how DeepEval splits a judgment
DeepEval encodes evaluation as a graph traversal. Each node is a simpler decision than the overall judgment.
Binary verdicts: how Ragas keeps judges honest
Ragas decomposes complex judgments into atomic yes/no questions. The trade-off: you trade information for variance reduction.
Article 15 named the requirements. The toolchain hasn't caught up.
Article 15 of the AI Act names five attack vectors and a lifecycle requirement. The evaluation toolchain available today addresses a fraction of that. Where the gap lives, and what to do meanwhile.
Variance is the problem with LLM-as-judge — and tools don't agree on how to fix it
Reading the source code of seven OSS LLM-as-judge tools: variance manifests across four axes, and the seven approaches don't share a framework — they share a problem.