Is Your AI System Ready for Production Review?

Production readiness is an evidence state. A working demo does not make a team ready. Readiness comes from being able to show what the system is meant to do, how it can fail, who owns each risk, what was tested, which controls exist, what evidence supports those controls, and what happens when the system is misused or modified.

The moment a named AI system moves from pilot toward production, the burden shifts. “It works” is no longer enough; now you have to prove it works and fails safely. Reviewers want scope, risks, tests, controls, owners, and a plan for change, all as evidence rather than assurances.

Whether the trigger is a production launch, a customer or security review, an EU AI Act readiness exercise, or a tool-connected workflow heading for wider use, the question is the same. Can you show, with evidence, that this system is ready?

Production Review Is an Evidence Gate, Not a Launch Meeting

A production review is a gate. Confident answers do not carry the day. Each readiness claim has to be defended with an artefact.

If the review asks only “does it work?”, the process is too shallow. A serious review asks what the system is for, how it fails, how it can be misused, who monitors it, what logs exist, who can intervene, which tests were run, what failed, what was fixed, and what residual risk remains. Each of those questions points to a document, a test record, or a named owner, not to an opinion.

Production risk usually shows up outside the happy path. Functional success tells you the system can produce the intended output under expected conditions. It says little about behaviour under adversarial input, misuse, model drift, tool misuse, data exposure, or operational change. The EU AI Act reflects the same logic for high-risk systems: Article 9 describes the risk management system as a continuous lifecycle process that must consider reasonably foreseeable misuse, post-market monitoring inputs, and testing before the system is placed on the market or put into service (AI Act Service Desk, Article 9).

Treat evidence as the unit of readiness. Every readiness claim should map to a named artefact: a system description, an intended-purpose memo, a risk register, a threat model, a misuse analysis, a test log, a findings report, a mitigation record, a human oversight design, a logging record, a monitoring plan, or an ownership map. If a claim has no artefact behind it, it’s a statement of confidence, not evidence.

The Six Evidence Areas a Production Review Should Cover

The six areas below are the questions to walk through in order, with enough structure for a CTO, CISO, AI lead, or governance lead to run the review without missing a layer.

  1. System Definition Evidence. Can you define the system boundary? This covers intended purpose, user groups, deployment environment, model or provider dependencies, retrieval sources, tools, application programming interfaces (APIs), data flows, permissions, version history, and the release path. Without this, no one in the room agrees on what’s being reviewed.
  2. Risk and Misuse Evidence. Can you show how the system can fail and be abused? This covers foreseeable misuse, user harm, business impact, adversarial scenarios, failure modes, and the assumptions you make about operators and roles.
  3. Testing Evidence. Can you show what has actually been tested? This covers the threat model, the test plan, the access model, adversarial testing, evaluation metrics, test logs, reproducible proof of concept, findings, and known limits. “We ran evals” isn’t testing evidence.
  4. Control Evidence. Can you show what stops or detects a failure? This covers guardrails, input and output handling, tool boundaries, least-privilege permissions, human oversight, logging, monitoring, incident response, fallback modes, rate limiting, and approval paths for high-impact actions.
  5. Ownership Evidence. Can you show who accepts and fixes each risk? This covers named owners for product risk, security risk, compliance evidence, model evaluation, data quality, incident response, human oversight, remediation, and residual-risk acceptance.
  6. Lifecycle Evidence. Can you show how readiness stays current? This covers the monitoring plan, post-market feedback, model update triggers, retest triggers, change control, and periodic evidence refresh.

A control that can’t be evidenced is only a claim. The areas where most internal programmes are thin are ownership and lifecycle: teams often have policies and dashboards, but no one can say who owns a harmful output, an unauthorised action, or a failed oversight path in production.

How This Maps to EU AI Act Readiness

For high-risk AI systems, these six areas map onto the Act’s requirements for documentation, risk management, logging, oversight, and monitoring. They help you produce the evidence. They don’t certify it, and nothing here should be read as a compliance claim.

The Act doesn’t ask whether a team feels confident. For high-risk systems, it asks for documented risk management, technical documentation, testing and validation records, logging, human oversight, accuracy and cybersecurity measures, deployer controls, and post-market monitoring. Each maps to an evidence area above:

  • Technical documentation and system definition. Article 11 requires technical documentation to be drawn up before a high-risk system is placed on the market or put into service, and kept up to date, with Annex IV as the minimum content set (Article 11). Annex IV names the intended purpose, system interactions, versions, APIs and deployment forms, development process, third-party tools or pre-trained systems, architecture, data requirements, human oversight assessment, validation and testing procedures, accuracy and robustness metrics, test logs, and test reports (Annex IV).
  • Testing, accuracy, and cybersecurity. Article 15 requires appropriate accuracy, robustness, and cybersecurity, and resilience against attempts by unauthorised third parties to alter the system’s use, outputs, or performance by exploiting vulnerabilities. It names data poisoning, model poisoning, adversarial examples, confidentiality attacks, and model flaws (Article 15).
  • Control, logging, and human oversight. Article 12 requires high-risk systems to technically allow automatic logging over their lifetime, relevant to risk identification, post-market monitoring, and operation monitoring (Article 12). Article 14 requires human oversight measures commensurate with the risks, including the ability to monitor operation, interpret outputs, override outputs, and interrupt the system where appropriate (Article 14).
  • Ownership and deployer duties. Article 26 places obligations on deployers around use according to instructions, competent human oversight, input data where under their control, monitoring, and log retention (Article 26).
  • Lifecycle and monitoring. Article 72 requires providers to establish documented post-market monitoring proportionate to the technology and risk, with the plan forming part of the technical documentation (Article 72).

Adversarial testing is now a recognised regulatory mechanism for AI risk evidence, though the obligations differ by system type. Article 55 explicitly requires documented adversarial testing for general-purpose AI models with systemic risk (Article 55). This is a different obligation from the high-risk system regime, and it shouldn’t be read as applying to every high-risk system. Regulators now treat documented adversarial testing as credible risk evidence.

Timing is a separate question, and a moving one. The Digital Omnibus political agreement announced on 7 May 2026 would move high-risk rules to 2 December 2027 for stand-alone systems and 2 August 2028 for high-risk systems embedded in products, according to the European Commission and the Council of the EU. As of 12 June 2026, the Council still describes this as a provisional agreement requiring endorsement, legal-linguistic revision, and formal adoption before it becomes binding law, so treat the dates as a planning input, not a settled deadline. Confirm the current legal status with counsel before you rely on it. For a fuller treatment, see our guide to what the EU AI Act requires from high-risk AI systems.

Non-EU frameworks ask for the same things. The voluntary NIST AI Risk Management Framework (AI RMF) 1.0 helps organisations build trustworthiness into AI design, development, use, and evaluation, and its Generative AI Profile, NIST AI 600-1, released on 26 July 2024, sets out suggested actions for regular adversarial testing, red teaming against prompt injection and adversarial examples, threat profiling, and testing to identify misuse scenarios (NIST AI RMF; NIST AI 600-1). A framework like ISO/IEC 42001:2023, an AI management system standard, can structure organisational governance, but it doesn’t replace system-specific testing, evidence, and risk ownership (ISO/IEC 42001).

AI Security Review Must Cover the System, Not Only the Model

Production reviews must test the application around the model. A review that tests model behaviour in isolation will miss most of what breaks systems in production.

The application layer is where prompts, retrieval sources, tools, permissions, APIs, logs, fallback behaviour, human approval paths, and deployment controls live. The OWASP Top 10 for Large Language Model (LLM) Applications 2025, a community-driven security resource, names risks that sit largely at this layer: prompt injection, sensitive information disclosure, supply chain, data and model poisoning, improper output handling, excessive agency, system prompt leakage, vector and embedding weaknesses, misinformation, and unbounded consumption (OWASP GenAI Security Project).

Tool-connected systems create excessive agency risk when they can call too many tools, use broad permissions, or act without approval.

OWASP describes it as risk created when an LLM-based system can take damaging actions because of excessive functionality, excessive permissions, or excessive autonomy. Its mitigation guidance is concrete: limit the tools available, limit each tool’s functionality, minimise permissions, execute in the user’s context, require user approval for high-impact actions, and add complete mediation, logging, monitoring, and rate limiting (OWASP LLM06:2025 Excessive Agency). These read directly as production review questions. Which tools can the system call? With what permissions? What requires human approval? What’s logged?

European cybersecurity guidance now points at the same application-layer risks. ETSI announced EN 304 223 on 14 January 2026 as a European standard setting baseline cybersecurity requirements across secure design, development, deployment, maintenance, and end of life, naming AI-specific risks including data poisoning, model obfuscation, and indirect prompt injection (ETSI). Treat it as a baseline reference rather than a harmonised AI Act standard unless legal review confirms that status.

The Security Gaps That Usually Break Production Readiness

Most failed reviews break on the same patterns. If you recognise several of these in your own system, you have your testing agenda.

  • The system boundary is unclear, so the review can’t agree on what’s in scope.
  • The intended purpose isn’t written tightly enough to judge misuse against.
  • The team tested the model, not the application around it.
  • Retrieval-augmented generation (RAG) sources and third-party tools were never threat-modelled.
  • Tool permissions exceed what the workflow actually needs.
  • High-impact actions have no approval gate.
  • Logs exist, but they can’t support an investigation or live monitoring.
  • Findings have no severity, no owner, and no retest evidence.
  • Human oversight is named in a policy but isn’t operationally usable.
  • Model, prompt, data, or tool changes don’t trigger any reassessment.

Teams create these gaps when they treat governance documents and functional QA as security evidence. Governance documentation shows policy intent. Functional QA shows normal-path performance. Neither necessarily shows adversarial failure modes, misuse scenarios, system-level attack paths, evidence quality, or who owns the residual risk.

What a Review-Ready Evidence Pack Looks Like

A review-ready system can put a specific artefact behind each evidence area, and tie each one to the review question it answers. The table below is a working template you can take into a planning meeting.

Evidence AreaArtefactsReview QuestionAssessment Output
System definitionScope memo, architecture, data flow, tool mapDo we know what is being reviewed?Scope and threat model
Failure modesMisuse scenarios, risk register, attack pathsDo we know how it can fail?Adversarial test plan and findings
TestingTest logs, metrics, proof of concept, limitsWhat has actually been tested?Findings report with severity and evidence
ControlsGuardrails, permissions, monitoring, human reviewWhat stops or detects failure?Control gaps and remediation guidance
OwnershipResponsibility map, escalation path, residual-risk recordWho accepts and fixes risk?Prioritised remediation roadmap
LifecycleMonitoring plan, change triggers, retest evidenceHow does readiness stay current?Retest and evidence refresh path

Use the table to make readiness claims falsifiable. If a row is empty, that’s a finding. The most useful evidence connects technical findings to ownership, remediation, and residual-risk decisions, so that security, engineering, governance, and compliance teams can all use the same pack.

Testing evidence should be specific. Useful records include test objectives, the threat model, test cases, the access model, attacker assumptions, manual testing notes, automated test runs, pass and fail criteria, a severity methodology, known limits, reproducible proof of concept, affected components, logs, remediation recommendations, and retest results. Open-source tools illustrate where this practice is heading. Promptfoo’s red-team documentation frames LLM red teaming as probing a system with adversarial inputs before deployment and separating model-layer threats from application-layer ones such as RAG leakage, tool-based vulnerabilities, privilege escalation, and data exfiltration (Promptfoo). Garak is an open-source LLM vulnerability scanner that probes for hallucination, data leakage, prompt injection, misinformation, toxicity, and jailbreaks (Garak). PyRIT is Microsoft’s open-source Python Risk Identification Tool for generative AI (PyRIT). These show current testing practice; they aren’t a substitute for a structured test plan and human judgement.

When Internal Testing Is Not Enough

Internal teams know the system better than any outsider will. That knowledge is useful, and it’s also where the blind spots come from. Familiarity with the design narrows the set of failures a team thinks to test, and internal incentives rarely reward finding reasons to delay a launch.

This is the practical answer to a fair objection: “we already have governance documentation, model evaluations, and internal QA, so why bring in an independent assessment?” Governance documentation may show policy intent. Functional QA may show normal-path performance. An external adversarial test adds a different perspective, structured evidence, and a report that can move across security, engineering, governance, procurement, and leadership without being rewritten for each audience.

Production review is also a lifecycle gate, not a single meeting. A model update, a new retrieval source, a new tool permission, a new user population, or a changed intended purpose can invalidate old evidence. Article 72’s post-market monitoring logic and NIST AI 600-1’s guidance on regular adversarial testing both point the same way: readiness needs a refresh after material change. Build retesting and evidence refresh into the plan before launch. For more on what external testing covers, see what an external AI red teaming assessment actually includes and our primer on AI red teaming and adversarial testing.

Readiness Is Something You Can Show

Readiness is the ability to show your work on demand: scope, failure modes, controls, evidence, ownership, and monitoring. A system that can’t isn’t ready for serious review, however good the demo looks.

Use the six evidence areas as a meeting agenda. Walk each row of the evidence pack and ask whether you can produce the artefact. The empty rows are your work list, and they’re also where a customer, a board, a procurement team, or a regulator will press hardest.

If your team is preparing a production review, a customer security review, or an EU AI Act readiness exercise for a specific AI system, Provion can scope an independent AI System Robustness Assessment and produce structured findings, evidence, and remediation guidance your security, engineering, and governance teams can use. Book a scoping call for AI system or workflows, or ask to review a sample assessment report first to see the findings format, severity logic, and evidence structure before you commit.

From Insight To Assessment

Need to Assess an AI System?

Request a Scoping Call →