An agent we shipped last quarter passed every model-level red-team evaluation we ran. It still leaked customer data in its first week of production. The leak had nothing to do with the model.
The vulnerability lived three layers up. A tool we exposed — a search over a private index — had no per-user authorisation check. The model, doing exactly what we asked, used the tool to answer the user’s question. The answer happened to contain another customer’s order history.
Where the benchmarks look#
Public prompt-injection benchmarks evaluate the model in isolation. You feed it adversarial text, you check whether it complies. This is a real and useful test, but it is testing the wrong layer for an agent.
What to test instead#
- Tool authorisation. Every tool call must carry the calling user’s identity, and every backend must enforce it.
- Memory isolation. Treat shared memory reads as untrusted input.
- Retrieval poisoning. Indexable surfaces are now part of your prompt.
- Output destinations. Email, webhook, file write — each is an exfiltration channel.