Topic · 3 pieces

Agents

On orchestration, tool use, memory, and the specific failure modes of LLM-driven multi-step systems.

How to build agent workflows you can replay, diff, and certify — when the underlying LLM call is none of those things.

Why prompt-injection benchmarks tell you almost nothing about whether your agent is safe to deploy — and what to test instead.

Unit tests and benchmarks miss the failures that actually break agents. A pattern for evaluating the system as a whole.