About

What this is,
and why it exists

The site

About this site

The longer answer.

What you’ll find

01Evaluation harnesses, in detail
02Production ML failure modes
03Notes on debugging multi-agent runs
04Occasional opinionated takes

ML research and ML production look superficially similar — they share vocabulary, papers, even people — but they have different failure modes, different value functions, different ideas of what “done” means. A model that scores well on a benchmark can still leak data, hallucinate confidently, or regress silently after a deploy.

This site covers the engineering side of testing LLM-based and multi-agent systems — evaluation harnesses, trace and replay infrastructure, the things that let agents be debugged, audited, and trusted in deployment. Mostly engineering, not modelling.

Short pieces, real numbers, code where useful. If you ship ML systems — LLM agents or otherwise — and care about reliability, you’ll find familiar problems here.

Philosophy

How it’s written

The principles behind every essay here.

01
Concrete over abstract
Every essay anchors to OSS code, peer-reviewed methodology, or observable patterns.
02
Brevity over completeness
Short pieces (5–8 minutes), real numbers, code where useful. The reader’s time is the constraint.
03
Engineering, not modelling
The systems around language models — evaluation harnesses, traces, contracts, replays — are where reliability is won. That’s the focus.
04
Verifiable
Citations point to OSS code, merged pull requests, or DOI-indexed papers. Claims that can’t be backed by something public don’t get published.
05
AI-assisted, author-owned
AI is a writing partner; the analysis, design, and accountability stay with the author. See AI policy for details.

What this is,and why it exists

About this site

How it’s written

What this is,
and why it exists