Most ML evaluation literature is about models. Most production failures are about systems.
Engineering
End-to-end evals for agentic systems
Unit tests and benchmarks miss the failures that actually break agents. A pattern for evaluating the system as a whole.