Engineering

End-to-end evals for agentic systems

Unit tests and benchmarks miss the failures that actually break agents. A pattern for evaluating the system as a whole.

Carlos Chinchilla Corbacho

March 12, 2026 1 min read

Most ML evaluation literature is about models. Most production failures are about systems.

EvaluationAgentsTesting

Next up · Reliability series

Part 3 / 3

Notes

A note for engineers transitioning into ML — and for researchers wondering why their prototype keeps falling over in production.