Article 15 named the requirements. The toolchain hasn't caught up.

I assumed the EU AI Act’s testing requirements were already covered. I was wrong.

Article 15 names five attack vectors. It requires consistent performance throughout the system’s lifecycle. It mentions feedback loops in systems that keep learning post-deployment. The article is specific.

The evaluation tools available today address a fraction of what the article names. That gap is the subject of this post.

What Article 15 actually says

Article 15 of Regulation (EU) 2024/1689 is titled Accuracy, robustness and cybersecurity. It governs what high-risk AI systems must achieve and what providers must prove. Five paragraphs. Each one says something specific.

Paragraph (1) is the frame. High-risk AI systems shall achieve an appropriate level of accuracy, robustness, and cybersecurity, and “perform consistently in those respects throughout their lifecycle”. Read the lifecycle clause twice. The article isn’t asking for a passing benchmark on day one. It’s asking for sustained performance for as long as the system is in service. That distinction matters. Most evaluation infrastructure ignores it.

“Appropriate level” is qualitative. The article doesn’t set numeric thresholds; it requires providers to justify what is appropriate for the system’s intended purpose, in writing. The justification is the deliverable.

Paragraph (2) anticipates that the methodologies for measuring accuracy and robustness don’t yet exist at scale. The Commission is to encourage the development of benchmarks and measurement methodologies in cooperation with metrology and benchmarking authorities. The article admits the gap.

Paragraph (3) requires that the actual levels of accuracy, and the metrics they are measured against, be declared in the instructions for use. This is the transparency anchor — what you tested, what you measured, what threshold you cleared.

Paragraph (4) covers robustness and post-deployment behaviour. High-risk AI systems must be resilient to errors, faults, and inconsistencies, with technical and organisational measures in place. Where systems continue to learn after deployment, the article requires that feedback loops — biased outputs influencing future inputs — are addressed with mitigation measures. This is the online-learning clause, written explicitly into the regulation. Systems that fine-tune in production, or that use user interactions to shape future behaviour, are inside this paragraph whether their builders think of them as “AI Act systems” or not.

Paragraph (5) covers cybersecurity, and this is where the article gets technically specific. High-risk AI systems must be resilient against unauthorised attempts to alter their use, outputs, or performance. The article enumerates five AI-specific attack vectors that providers must — where appropriate — prevent, detect, respond to, resolve, and control:

Data poisoning of the training set
Model poisoning of pre-trained components
Adversarial examples / model evasion at inference
Confidentiality attacks
Model flaws

Five categories. They are written into the law.

What the ecosystem actually tests

Take the five paragraphs of Article 15 and put them next to today’s evaluation tools. The categories don’t line up.

Most current evaluation tooling falls into four groups, each addressing a different part of what the article names.

Prompt injection and jailbreak benchmarks test the model in isolation against attempts to bypass its safety instructions. They probe what the article calls adversarial examples and model evasion — but only at the model boundary, and only at a single moment. They don’t address the lifecycle requirement, and they don’t test what happens once the model is wrapped in tool calls and memory.

Capability benchmarks test what the model knows and can do across domains, usually with multiple-choice or open-ended formats. They are useful as one input to the accuracy claim of Paragraph (1). They aren’t, on their own, sufficient evidence of robustness or cybersecurity. The article requires more than knowing the model can answer a question.

Red-teaming for safety is adversarial scenario generation, often manual or semi-manual. It addresses the cybersecurity dimension of Paragraph (5) — partially. The coverage depends on the imagination of the red team, not on a systematic mapping of the five attack vectors named in the law. Two organisations red-teaming the same system rarely produce the same coverage.

Runtime guardrails and output filtering intervene at request time, not at test time. They are a deployment-side defence, useful as compliance evidence under Paragraph (4), but they aren’t testing. They restrict what the system does in production; lifecycle performance is somewhere else entirely.

Three observations follow from this map.

First, every category operates predominantly at the model layer. Agent-level coverage is nascent — benchmarks like AgentDojo, AgentHarm, and Agent Security Bench have appeared in the last 18 months — but it is not yet absorbed into mainstream evaluation pipelines. The article’s appropriate level of accuracy, robustness, and cybersecurity applies to the high-risk AI system as a whole; the toolchain measures a fragment of that whole.

Second, every category is a point-in-time activity. Run a jailbreak benchmark before a release, ship the release, skip the re-run on the next release because the regression cost is too high. None of these categories produces evidence of consistent performance throughout the lifecycle. They produce a snapshot, and snapshots aren’t what Paragraph (1) requires.

Third, the named attack vectors of Paragraph (5) are not symmetrically covered. Adversarial examples and model evasion are well-served by the prompt-injection benchmarks. Data poisoning and model poisoning have growing coverage at training time, less so once a system is deployed and incremental fine-tuning is in scope. Confidentiality attacks and model flaws are largely under-served by the current toolchain.

Article 15 names what to test; the ecosystem tests a fraction of that. Where it misses comes next.

What Article 15 requires of a high-risk AI system, and what today’s evaluation toolchain actually tests. Nothing on the list is fully covered — every requirement is partial, or a gap the toolchain hasn’t closed.

partial gap

Accuracy§15(1) partial

Capability benchmarks are one input to the accuracy claim — not sufficient on their own.

Consistent lifecycle performance§15(1) gap

The toolchain produces snapshots; none of them is lifecycle evidence.

Failure recovery§15(4) gap

Almost no harness tests what the system does when something fails.

Adversarial examples / evasion§15(5) partial

Well-served by prompt-injection benchmarks — at the model boundary only.

Data poisoning§15(5) partial

Growing coverage at training time; thinner once the system is deployed.

Model poisoning§15(5) partial

Growing coverage at training time; thinner once the system is deployed.

Confidentiality attacks§15(5) gap

Largely under-served by the current toolchain.

Model flaws§15(5) gap

Largely under-served by the current toolchain.

Multi-agent coordination§15(4–5) gap

Cascading failures between agents are rarely tested.

Where the toolchain misses

Four gaps between what the article names and what the ecosystem currently tests. They are not exhaustive. They are the four where the distance is widest.

Adversarial robustness at the agent layer, not just the model layer. Paragraph (5) names attack vectors that current benchmarks address at the model boundary — data poisoning, adversarial examples, model evasion. Agent systems extend the boundary. Tool selection, plan formation, memory, and inter-system communication are all surfaces where adversarial inputs can shape behaviour without ever reaching the model in a recognisably hostile form. A manipulated tool output that influences the next plan step is an adversarial input at the agent level. A poisoned memory entry that biases future decisions is the same. Article 15’s wording is system-level; the tools are predominantly model-level.

Lifecycle, not one-shot. Paragraph (1) asks for consistent performance across the system’s lifecycle. The toolchain produces snapshots — pre-release benchmarks, occasional re-runs, ad-hoc red-team exercises. None of those is lifecycle evidence. A snapshot from six months ago is no evidence at all if the model has been swapped in the meantime, the prompt template has shifted, or a downstream service has changed its API. The ecosystem hasn’t built the regression-style infrastructure that would make Paragraph (1) checkable on every release, against every change, for as long as the system runs.

Failure recovery. Paragraph (4) requires resilience and accepts technical redundancy and fail-safe plans as a path to it. But almost no evaluation harness today tests what the system does when something fails. Recovery from a failed tool call. Fallback when the model’s confidence is low. Graceful degradation when an upstream service is down. These are operational properties of the system. They are also what Paragraph (4) names. The toolchain treats success as the normal case and ignores the failure modes the article cares about. A system that has never been tested under failure can’t be claimed to be resilient.

Multi-agent coordination. Paragraph (4)‘s feedback-loop clause and Paragraph (5)‘s named confidentiality-attacks vector both apply with extra force when the system is composed of multiple agents passing outputs to each other. Cascading failures — where one agent’s output becomes another agent’s input, with no observability between them — are an emerging concern in the academic literature. The current toolchain rarely tests for them. I’ll return to this gap in later writing. For this post, the point is that the article’s clauses already cover the multi-agent case — the tools haven’t caught up.

One lever could close some of these gaps: harmonised standards. They aren’t ready.

Harmonised standards still being written

Harmonised standards spell out the technical detail behind the Act. Compliance with one creates a legal presumption that the underlying requirement is met. Without a published standard, providers must demonstrate compliance directly — harder, longer, more open to dispute.

The state as of May 2026 is bleaker than the rhetoric around the AI Act suggests. Not a single harmonised standard under the AI Act has been cited in the Official Journal of the European Union. None.

CEN-CENELEC JTC 21’s most advanced draft, prEN 18286 (Quality Management System), failed its public Enquiry vote in January 2026. Publication has slipped to Q4 2026 at the earliest. That is the procedural standard.

The technical standards relevant to Article 15 — accuracy and robustness, cybersecurity — sit further back. Neither has entered Enquiry.

Even on JTC 21’s accelerated procedure, publication targets Q4 2026; OJEU citation is a separate step that pushes the realistic earliest window for Article 15-relevant standards to mid-to-late 2027.

The Commission has noticed. On 19 November 2025 it proposed the Digital Omnibus on AI, linking high-risk AI rules to standards availability, with backstops of December 2027 and August 2028. The proposal is in trilogue at the time of writing. Until adopted, the legal application date for high-risk obligations remains 2 August 2026. Plan against the date currently in law, not the date you hope is coming.

What to do meanwhile

Four things you can do without waiting for harmonised standards. The prescriptions assume you are a provider — the engineer building a high-risk AI system, not the team using one. Section 2’s requirements (Articles 8-15) apply to providers; deployers have a separate obligation list under Article 26.

Document what you test, what you don’t, and why. Article 15(1) requires an appropriate level of accuracy, robustness, and cybersecurity — qualitative, contextual, justified per system. The justification is the deliverable. Article 11 is unambiguous about what that documentation includes: dated and signed test logs and test reports are part of the technical documentation, not a nice-to-have. Write down what you measure, what threshold you cleared, what you chose not to test, and the technical reasoning. A notified body will read it line by line.

Run a continuous risk register, not a point-in-time audit. Article 9 requires a risk management system that operates iteratively across the system’s lifecycle; Article 72 requires post-market monitoring. Treat them as one discipline. The risk register is updated on every change — new model, new prompt template, new tool, new dependency. A risk audit done once at acceptance isn’t what the article asks for.

Start with the five attack vectors of Paragraph (5). Build red-team coverage against the named list first — data poisoning, model poisoning, adversarial examples or model evasion, confidentiality attacks, model flaws. Cover the legal anchors before extending into adjacent threats. The article’s list is the starting point, not the ceiling.

Diversify your testing toolchain. Standards are still being drafted, vendor offerings will shift, methods will mature. A testing harness wired to a single benchmark suite or a single provider is a risk in itself — not primarily for compliance reasons, but because the next three years of change will rebalance the landscape. The infrastructure should survive that rebalancing.

That’s the engineering work the article actually requires.

If you’ve started this work, I’d be interested to hear what you’ve found. The gap between what Article 15 names and what the ecosystem currently provides is large enough that no team is far ahead of any other. The methods that will inform the harmonised standards are being built right now in the teams that don’t wait.

EU AI Act