Carlos Chinchilla Corbacho

Security

Agent security is not model security

Why prompt-injection benchmarks tell you almost nothing about whether your agent is safe to deploy — and what to test instead.

An agent we shipped last quarter passed every model-level red-team evaluation we ran. It still leaked customer data in its first week of production. The leak had nothing to do with the model.

The vulnerability lived three layers up. A tool we exposed — a search over a private index — had no per-user authorisation check. The model, doing exactly what we asked, used the tool to answer the user’s question. The answer happened to contain another customer’s order history.

Where the benchmarks look#

Public prompt-injection benchmarks evaluate the model in isolation. You feed it adversarial text, you check whether it complies. This is a real and useful test, but it is testing the wrong layer for an agent.

What to test instead#

  • Tool authorisation. Every tool call must carry the calling user’s identity, and every backend must enforce it.
  • Memory isolation. Treat shared memory reads as untrusted input.
  • Retrieval poisoning. Indexable surfaces are now part of your prompt.
  • Output destinations. Email, webhook, file write — each is an exfiltration channel.
SecurityAgentsTesting