Trust, but Trace

About

What this is,
and why it exists

The site

About this site

The longer answer.

What you’ll find
  • 01Evaluation harnesses, in detail
  • 02Production ML failure modes
  • 03Notes on debugging multi-agent runs
  • 04Occasional opinionated takes

ML research and ML production look superficially similar — they share vocabulary, papers, even people — but they have different failure modes, different value functions, different ideas of what “done” means. A model that scores well on a benchmark can still leak data, hallucinate confidently, or regress silently after a deploy.

This site covers the engineering side of testing LLM-based and multi-agent systems — evaluation harnesses, trace and replay infrastructure, the things that let agents be debugged, audited, and trusted in deployment. Mostly engineering, not modelling.

Short pieces, real numbers, code where useful. If you ship ML systems — LLM agents or otherwise — and care about reliability, you’ll find familiar problems here.

Philosophy

How it’s written

The principles behind every essay here.

  • 01
    Concrete over abstract
    Every essay anchors to OSS code, peer-reviewed methodology, or observable patterns.
  • 02
    Brevity over completeness
    Short pieces (5–8 minutes), real numbers, code where useful. The reader’s time is the constraint.
  • 03
    Engineering, not modelling
    The systems around language models — evaluation harnesses, traces, contracts, replays — are where reliability is won. That’s the focus.
  • 04
    Verifiable
    Citations point to OSS code, merged pull requests, or DOI-indexed papers. Claims that can’t be backed by something public don’t get published.
  • 05
    AI-assisted, author-owned
    AI is a writing partner; the analysis, design, and accountability stay with the author. See AI policy for details.