By 2 August 2026, providers of high-risk AI systems under Annex III of the EU AI Act must demonstrate conformity, compile Annex IV technical documentation, and operate a post-market monitoring system under Article 72. That deadline has changed how buyers read proposals for AI red teaming services. The question is no longer “should we test?” It is “what will we actually receive, in what shape, and how does it map to the compliance artifacts we need to produce?”
Buyers shortlisting vendors hit a market where the same phrase covers both a two-day automated scan and a 12-week adversarial engagement. Some deliverables are audit-ready. Others need translation before a Notified Body, a GRC team, or an internal auditor can use them.
This article breaks down what a credible external AI red teaming assessment includes: the four phases of a structured engagement, the three deliverables to treat as non-negotiable, the variables that move a timeline from 3 to 12 weeks, and the trade-off between point-in-time and continuous testing.
If you are a CTO, CISO, procurement lead, or compliance lead comparing proposals, this is the shape of a substantive offer.
The Four Phases of a Credible External Engagement
A structured AI red teaming engagement moves through four phases. The names vary across practitioners, but the structure is consistent. When a proposal collapses scoping into “kickoff call” or treats threat modeling as a by-product of testing, that is a sourcing risk.
1. Scoping and Rules of Engagement
The first phase defines the perimeter. Which systems are in scope, which are out, which environments the red team can touch, what constitutes a critical finding that pauses the engagement, and what the escalation path looks like. Expect written rules of engagement, named points of contact on both sides, a legal review of any data-handling commitments, and sign-off from the business owner of each system in scope.
This phase is where procurement and compliance leads earn back weeks of engagement time later. Arriving at kickoff with approvals in hand, an isolated testing environment, and pre-cleared access cuts the time the vendor spends chasing permissions rather than finding issues.
2. Threat modeling and Attack Surface Mapping
The second phase is AI-specific and is where a vendor’s depth shows. The red team maps the system’s attack surface against recognized taxonomies, the OWASP Top 10 for LLM Applications (latest edition) and MITRE ATLAS, then builds a threat model that reflects how your system is built. A retrieval-augmented generation (RAG) pipeline, a customer-facing chatbot with tool access, and an autonomous agent with privileged data reads are three different threat models and should produce three different test plans.
Watch for proposals that skip this phase or reuse a generic library of attacks. Adversarial testing without a system-specific threat model produces findings that look plausible but rarely land on the parts of the system that matter.
3. Active Adversarial Testing
Phase three is the testing itself, a combination of manual adversarial work and automated tooling. Manual testing surfaces the issues that require reasoning about your specific system: multi-turn jailbreaks, indirect prompt injection via retrieved documents, tool-call hijacking in agentic flows. Automated tooling provides breadth across known attack patterns at scale. Neither substitutes for the other.
Buyers should expect logged attempts, a reproducible proof of concept for each finding, and an agreed disclosure path for critical issues during the engagement rather than at the end. A red team that sits on a severity-critical finding for four weeks before reporting is running on the wrong incentive structure.
4. Reporting and Remediation Guidance
The fourth phase is where the engagement becomes a compliance artifact. The deliverable is not the attack. It is the documentation of the attack, the evidence that reproduced it, the severity, the impact, and the remediation path. The findings report, the risk register, and the remediation roadmap each have a specific purpose, and the next section breaks down what each should contain.
Three Deliverables You Should Treat as Non-Negotiable
Across practitioner documentation, three outputs recur in every credible AI red teaming engagement. If a proposal doesn’t commit to all three in writing, treat that as a gap.
1. Findings Report With Severity-Rated, Reproducible Vulnerabilities
Each finding should include: attack vector, access level required, proof of concept sufficient to reproduce the issue, severity rating with the methodology behind it, business impact, and remediation recommendation. An executive summary should open the report. A methodology section should state scope, tools used, access level, coverage, and known limits of the engagement. Anything less is a marketing summary, not an assessment.
2. Risk Register Mapped to recognized Taxonomies
Findings should be mapped to OWASP LLM Top 10 categories and MITRE ATLAS tactics and techniques. This mapping is not cosmetic. For compliance work, it gives your GRC team a defensible structure, a classification an external auditor already knows, and a way to compare findings across engagements over time. For internal prioritization, it connects the red team’s output to a threat model your security team can operate against.
3. Prioritized Remediation Roadmap
The report should close with a remediation roadmap structured across immediate, short-term, and long-term horizons. “Patch this prompt template this week” is not the same recommendation as “redesign this tool-call interface next quarter”, and the report should distinguish them. Expect the vendor to rank findings by exploitability, impact, and remediation cost, and to say clearly which issues are blocking deployment and which are residual risks you can accept.
Why These Three Matter for Compliance
For organizations in scope of the EU AI Act as providers of high-risk systems, the same three outputs feed directly into the compliance artifacts:
- The findings report and methodology section map to Annex IV technical documentation, specifically the testing and validation record.
- The risk register maps to Article 9 risk management and to Article 15 evidence on accuracy, robustness, and cybersecurity, including the Article 15(5) adversarial resilience requirement.
- The remediation roadmap and any retest evidence map to Article 72 post-market monitoring.
A report that cannot be handed to a Notified Body or internal audit team without translation is a report written for the wrong audience.
Timelines: What Drives Three Weeks Versus Twelve
Practitioners commonly report engagement windows ranging from one to two weeks for narrowly scoped tests, three to six weeks for a typical enterprise LLM application, and up to 8 to 12 weeks for complex agentic systems with multiple tool integrations and production data access. These are practitioner-reported ranges rather than formal benchmarks. Independent timing studies do not yet exist, and buyers should expect vendors to calibrate against the specifics of the system under test.
Four variables move the timeline:
- Scope. Model only, full application stack, or multi-system pipeline.
- Access level. Black, gray, or white box (covered below).
- Number of integrations and tools. Each external system, API, or tool-call surface extends the attack surface and the test plan.
- Realism of the threat model. A production-fidelity engagement against a realistic adversary takes longer than a narrow check against a short list of known issues.
The timeline you can compress is the pre-engagement window. Procurement leads who bring a signed statement of work, a ready-to-use sandbox, named approvers for each access request, and a pre-scheduled readout calendar cut weeks from the cycle. The ones who treat scoping as something the vendor will “figure out” after kickoff extend it.
Access Models: Black, Gray, or White Box
How much the red team can see determines what they can find. The conventional classification runs across three tiers.
| Access Model | What the Red Team Sees | Strengths | Limits |
|---|---|---|---|
| Black box | API or UI only. No architectural knowledge. | Realistic external-attacker simulation. Tests what an outside adversary would reach. | Shallow. Misses issues that require understanding of prompts, data flows, or tool graphs. |
| Gray box | Scoped access: architecture diagrams, system prompts, selected logs, limited data-flow documentation. | Balanced depth and realism. Most enterprise engagements land here. | Requires clear scoping of what is and isn’t shared. |
| White box | Full access: architecture, prompts, data flows, training data, sometimes weights. | Deepest coverage. Surfaces issues that black-box testing never reaches. | Heavier legal and data-handling controls. Less representative of an external attacker. |
Most mature engagements use grey box with explicit access scoping. Black box alone produces engagements that finish on time but leave the important issues undiscovered. White box without strong data-handling controls introduces its own risk. The right choice depends on what you are trying to demonstrate: a realistic attacker simulation, a depth-first AI vulnerability assessment, or an independent compliance artifact.
Point-in-Time Versus Continuous Assessment
A point-in-time external engagement produces a third-party report with a date on it. It suits pre-launch validation, audit responses, conformity assessment, and board-level assurance. It’s the right shape of evidence when what you need is an independent opinion at a specific moment.
Continuous assessment, whether platform-based or structured as recurring scoped engagements, produces sustained evidence of how a system behaves across its lifecycle. It is increasingly expected under EU AI Act Article 72 post-market monitoring, and it reflects how agentic and frequently updated AI systems actually change. A system that learns, or that is redeployed every two weeks, cannot credibly rely on a report written six months ago.
A hybrid model is emerging as the dominant shape of mature AI security programs: continuous automated testing for breadth, targeted manual engagements for depth and high-risk use cases, and internal ownership of triage, remediation, and governance. This is an emerging pattern rather than an empirical finding, but the logic is sound given Article 72 expectations and the reality of agentic system drift.
Two triggers often push programs from point-in-time to continuous assessment:
- Substantial modifications. Under Article 43(4) of the EU AI Act, a substantial modification to a high-risk system requires a new conformity assessment. Behavior drift in systems that continue learning post-deployment can qualify. Continuous testing surfaces this before the drift becomes a regulatory event.
- Multiple production systems or frequent release cycles. Point-in-time testing scales poorly across a portfolio. If you have more than a handful of AI systems in production, or if you ship changes weekly, the evidence gap between engagements widens quickly.
What the Output Has to Map To
For any high-risk AI system in scope of the EU AI Act, the engagement’s output must connect to specific regulatory artifacts. The table below maps the three core deliverables to the articles and annexes your compliance team will cite.
| Engagement Output | EU AI Act Reference | Purpose |
|---|---|---|
| Methodology and scope section | Annex IV, points 2 and 3 | Demonstrates the testing approach, tools, and coverage the provider relied on |
| Per-finding entries with severity and proof of concept | Article 15 (accuracy, robustness, cybersecurity); Article 15(5) (adversarial resilience) | Evidence that the system has been tested against attempts to alter use, outputs, or performance |
| Risk register mapped to OWASP LLM Top 10 and MITRE ATLAS | Article 9 (risk management) | Structured record of identified risks and their classification |
| Remediation roadmap and retest evidence | Article 72 (post-market monitoring); Annex IV point 9 | Demonstrates the post-market monitoring plan and ongoing risk management |
Parallel frameworks apply in the same documentation set. ETSI EN 304 223 (2025-12) sets European baseline cybersecurity requirements for AI systems across the lifecycle, addressing data poisoning, model manipulation, and indirect prompt injection. The NIST AI Risk Management Framework and its Generative AI Profile (NIST AI 600-1) cover adversarial evaluation under the MEASURE function. A well-structured engagement produces evidence that maps to each of these in parallel rather than requiring three separate testing cycles.
One note on penalties that matters for compliance planning: the correct tier for non-compliance with high-risk system obligations is EUR 15 million or 3% of worldwide annual turnover, whichever is higher, under Article 99. The EUR 35 million or 7% tier applies only to Article 5 prohibited practices. The 7% figure, which appears frequently in secondary coverage, is a common error.
What to Require in a Managed AI Red Teaming Service SOW
For an AI vulnerability assessment you can hand to a Notified Body, an internal auditor, or a board; the following should appear in the statement of work as contractual deliverables, not aspirations.
- A four-phase engagement structure with named phases, owners, and gating criteria: scoping, threat modeling, active testing, and reporting.
- Written rules of engagement, including access level, testing environment, logging commitments, data-handling controls, and a defined disclosure path for critical findings discovered mid-engagement.
- A findings report with executive summary, methodology, and per-finding entries covering attack vector, proof of concept, severity, impact, and remediation.
- A risk register mapped to OWASP LLM Top 10 (2025) and MITRE ATLAS.
- A prioritized remediation roadmap across immediate, short-term, and long-term horizons.
- Explicit mapping to EU AI Act articles and annexes where the system is in scope of high-risk obligations, and to ETSI EN 304 223 and NIST AI RMF where they apply.
- A retest commitment for critical findings, so the final artifact includes evidence that issues were resolved, not just identified.
- A named engagement lead with an AI security background, not a rotating cast of generalists.
Vendors who commit to this structure can deliver a scoped AI model security assessment in weeks rather than months.
The Shape of a Substantive Proposal
AI red teaming services are not a single product. They are a set of phased commitments, delivered against a scope the buyer has defined. The difference between a credible external assessment and an expensive scan comes down to four questions a CTO, CISO, procurement lead, or compliance lead should ask before signing:
- Does the proposal structure the engagement into four distinct phases, with named deliverables at each gate?
- Does it commit in writing to the three core outputs: findings report, risk register, remediation roadmap?
- Does the reporting map explicitly to OWASP LLM Top 10, MITRE ATLAS, and the EU AI Act articles that apply to you?
- Is the access model, timeline, and retest commitment specified in the SOW, or deferred to “after kickoff”?
If the answer to all four is yes, the engagement will produce the evidence you need. If not, the gap between what you paid for and what you can hand to an auditor will surface later, on a tighter clock.
Provion delivers managed AI red teaming services structured around this shape: four-phase engagements, three core deliverables, EU AI Act-mapped findings, and a retest path for critical issues. For CTOs, CISOs, procurement leads, and compliance leads preparing for the August 2026 deadline, the next step is a scoping call to define what your engagement should include.