About
What this is,
and why it exists
The site
About this site
The longer answer.
- 01Evaluation harnesses, in detail
- 02Production ML failure modes
- 03Notes on debugging multi-agent runs
- 04Occasional opinionated takes
ML research and ML production look superficially similar — they share vocabulary, papers, even people — but they have different failure modes, different value functions, different ideas of what “done” means. A model that scores well on a benchmark can still leak data, hallucinate confidently, or regress silently after a deploy.
This site covers the engineering side of testing LLM-based and multi-agent systems — evaluation harnesses, trace and replay infrastructure, the things that let agents be debugged, audited, and trusted in deployment. Mostly engineering, not modelling.
Short pieces, real numbers, code where useful. If you ship ML systems — LLM agents or otherwise — and care about reliability, you’ll find familiar problems here.
Philosophy
How it’s written
The principles behind every essay here.
- 01Concrete over abstractEvery essay anchors to OSS code, peer-reviewed methodology, or observable patterns.
- 02Brevity over completenessShort pieces (5–8 minutes), real numbers, code where useful. The reader’s time is the constraint.
- 03Engineering, not modellingThe systems around language models — evaluation harnesses, traces, contracts, replays — are where reliability is won. That’s the focus.
- 04VerifiableCitations point to OSS code, merged pull requests, or DOI-indexed papers. Claims that can’t be backed by something public don’t get published.
- 05AI-assisted, author-ownedAI is a writing partner; the analysis, design, and accountability stay with the author. See AI policy for details.