Is AI red teaming the same as penetration testing?

No. Penetration testing targets the infrastructure around a system — network, identity, application surfaces. AI red teaming targets the model itself and the system's behaviour under adversarial input — jailbreaks, prompt injection, data exfiltration through model outputs, model-output incidents that cause real-world harm. The two disciplines share a vocabulary and require different skills. A pen-tester is not an AI red-teamer by default, and the procurement mistake of assuming the existing pen-test vendor can do AI red teaming produces evidence logs that auditors discount.

How often should an enterprise red-team a deployed AI model?

Quarterly for high-risk customer-facing systems, semi-annually for high-impact internal systems, annually for the rest. The cadence aligns with the EU AI Act post-market monitoring obligations for high-risk systems and matches the CISO deployment-gate criteria I wrote about in the CISO governance piece. The cadence question is less important than the evidence-trail question — a once-a-year exercise that produces tracked findings and verified remediation is more useful than a continuous tool whose findings go nowhere.

Should we build an internal red-team capability or hire an agency?

For high-risk systems, both. The internal team owns the continuous evaluation against the eval harness and the production traffic samples; the external agency runs the periodic deep adversarial exercise and produces the independent evidence the auditor will want to see. For non-high-risk systems, an internal team alone is usually sufficient. A platform-tooling-only approach without named human ownership is the configuration I see fail most often — tools generate findings, nobody triages, the evidence log shows volume but not closure.

What does AI red-team work integrate with on the security workflow side?

The SIEM stack receives runtime adversarial-event telemetry; the incident-response platform receives the model-output incidents that cross the harm threshold; the GRC system receives the periodic exercise findings as evidence artefacts. The integration that matters most is the bidirectional one between the red-team findings and the deployment gate — a finding that surfaces in a red-team exercise has to flow back into the gate criteria for the next deployment, or the gate stays static while the threat landscape moves.

Enterprise AI Red Teaming: The Operational Version

Tom Prommer · CIO/CTOUpdated 2026-05-2912 min read

Executive summary

What an enterprise AI red-team programme actually requires in 2026 — not penetration testing, not safety evals, but documented adversarial testing against deployed models with named owners, tracked remediation, and an evidence log that survives audit. The procurement consideration most teams miss.

The red-team finding that prompted the engagement was three months old when I got the call. A model deployed in a customer-facing assistant had been observed leaking another customer’s session data through a prompt-injection path during an internal exercise. The exercise had produced a finding. The finding had been logged in a ticket. The ticket had been closed when the engineer who wrote the original prompt added a guardrail. Nobody had re-tested. Nobody had verified the fix worked against the original injection. Nobody had checked whether the new guardrail produced new bypasses. The auditor for an EU AI Act compliance review pulled the evidence log, found a closed ticket with no re-test artefact, and asked the question the team could not answer: how do you know the fix is effective? The red-team exercise had happened. The evidence had not survived the question.

That is the operational gap most enterprise AI red-team programmes have. The exercise gets run. The findings get logged. The remediation gets shipped. The evidence log shows activity. The artefact that proves the programme actually reduces risk — the chain from finding to remediation to re-test to closure — does not exist. Auditors notice. The August 2026 EU AI Act enforcement timeline notices. The CISOs who have been running this work for two years notice, because they were the ones who told me the chain matters more than the exercise.

This page is the operational version of AI red teaming for enterprises. The policy-level discussion of red-team requirements lives at the governance hub and the deployment-gate criteria are in the CISO governance piece. What follows is the procurement, integration, cadence, and evidence-log work that turns “we have a red-team capability” from an assertion into an artefact.

What red-team capability actually means in 2026

Three things AI red-team work is not.

It is not penetration testing. Pen-testing targets the infrastructure — network, identity, application surfaces. AI red-team work targets the model’s behaviour under adversarial input. The two share vocabulary and require different skills. A traditional pen-test team can probably extend into AI red teaming with significant retraining; assuming they already can without that retraining is the procurement mistake that produces evidence logs auditors discount.

It is not safety evaluation. Safety evals run against benchmarks and known categories of harm; they are the floor, not the test. A model that passes the standard safety eval set can still produce customer-harming outputs under deliberately adversarial input, and the safety-eval pass tells you nothing about that. Red-team work is adversarial; it actively tries to break the system in ways the eval suite did not anticipate.

It is not running a guardrails platform. A runtime guardrails tool — Lakera, Guardrails AI, the hyperscaler equivalents — is a control, not a test. Controls and tests are different artefacts in a governance programme, and conflating them produces a programme that has controls but cannot evidence whether the controls work. The CISO programme owns both; they are not the same thing.

What red-team capability is, then. A documented programme of adversarial testing against deployed models, with named owners, defined cadence, classified findings, tracked remediation, and re-test verification. Each of those words is load-bearing. “Documented” because the EU AI Act Article 72 post-market monitoring obligations want written evidence. “Named owners” because anonymous findings get closed without remediation. “Defined cadence” because ad-hoc work that fires when somebody thinks of it does not survive an auditor. “Classified findings” because the four categories below (jailbreak, prompt injection, data exfiltration, model-output incident) require different triage flows. “Tracked remediation” because untracked fixes regress silently. “Re-test verification” because the chain from finding to evidence of closure is the artefact that distinguishes a programme from a pile of tickets.

If a red-team programme cannot produce all six attributes for a randomly-selected finding, it is not a programme. It is exercises filed in a folder.

The three buying motions

Red-team capability is bought in three motions, and conflating them is the same procurement mistake as conflating the three AI-SRE buying motions covered at the AI-SRE tooling page. The vendor sales motion will present them as one decision; operationally they require different commitments, integrations, and evidence flows.

Internal team build. Hire a small team — typically two to four people for a large enterprise, one for a mid-sized one — with backgrounds in security research, model behaviour, and the specific risk surface of your industry. The team owns the continuous evaluation work, the eval-harness integration, the regression testing on new model deployments, and the day-to-day triage of platform-tooling findings. Cost: a fully-loaded team at 2026 market rates runs €400k–€800k per year. Lead time to staff: six to nine months for the first hire, longer for the team. The internal build is the right choice for enterprises with multiple high-risk systems and a sustained risk-management posture. It is a trap for organisations whose AI portfolio is two or three production workloads. The team is over-staffed for the work, spends six months building a bespoke evaluation harness, realises the scope does not justify the investment, and leaves for a larger problem before the programme has produced a return.

Agency partnership. Engage a specialist red-team agency for periodic deep exercises against specific systems. The agency does not run continuously; it runs scheduled exercises, two to four times a year per system, producing a written report with classified findings, recommended remediations, and an executive summary. Cost: roughly €40k–€150k per exercise depending on scope and the agency’s tier. Lead time: eight to twelve weeks from engagement to first exercise. The agency motion is the right choice for organisations that need the independence of external testing for evidence purposes (and the EU AI Act conformity-assessment work for high-risk systems does benefit from external evidence) and for organisations whose AI portfolio does not justify a full internal team. Agencies worth talking to in 2026 include the AI-specialist arms of the established security firms — NCC Group, IOActive, Bishop Fox — and the newer AI-native specialists like HiddenLayer’s red-team services and the Trail of Bits AI work.

Platform tooling. Buy a platform that runs continuous automated adversarial testing against your deployed models and surfaces findings through a dashboard. The category includes the runtime guardrails tools at one end (Lakera Guard, Robust Intelligence — now Cisco AI Defense) and the dedicated red-team tooling at the other (Microsoft PyRIT as the open-source reference, the commercial offerings from HiddenLayer, Mindgard, and others). Cost: roughly €30k–€200k per year depending on coverage and scale. Lead time: weeks. The platform motion is the right choice as a supplement to either of the other two; it is the wrong choice as a standalone red-team capability, because platforms produce findings volume without producing the named-owner triage and re-test verification that auditors require. The most common failure mode in this category is buying the platform, watching findings accumulate, and discovering at audit that nobody owns the closure flow.

The combination that works for most large enterprises in 2026: a small internal team owning the day-to-day triage and the eval-harness integration, an agency engagement for the periodic deep exercise against high-risk systems, and a platform-tooling layer feeding both. Three motions, three artefacts, one evidence log. The integration between them is where the work lives.

Integration with the security-workflow stack

Three integration surfaces matter. Get these wrong and the red-team findings sit in a separate system from the rest of the security programme, which is the configuration most often associated with evidence-log gaps at audit.

SIEM integration. Runtime adversarial events — observed jailbreaks, observed prompt injections, observed unusual model outputs that match attack patterns — should flow into the SIEM the same way other security telemetry does. The integration is bidirectional in the useful case: SIEM correlation rules can flag adversarial activity across multiple model interactions that individual observations would miss, and the red-team programme can use SIEM data to identify candidate scenarios for the next exercise. Most enterprises in 2026 have not yet wired this integration; the model-monitoring tools speak one telemetry vocabulary and the SIEM speaks another, and the work to bridge them has not been prioritised. It should be.

Incident-response platform integration. Model-output incidents that cross the harm threshold — a hallucination causing customer impact, a data-exfiltration event, a policy-violating output that reaches a user — flow into the existing incident-response platform the same way other security incidents do. The AI-specific runbook the CISO owns (covered in the CISO piece) is the entry point. The integration matters because AI incidents bleed into traditional incident response in both directions: a security incident can have an AI dimension (a compromised account using AI tooling to escalate), and an AI incident can have a security dimension (a prompt-injection that exfiltrates data). Treating them as separate workflows produces blind spots in both.

GRC integration. Periodic exercise findings and the re-test evidence flow into the GRC system as evidence artefacts attached to the relevant model in the inventory. The inventory I wrote about in the CISO piece is the anchor; every model has an inventory record, every red-team exercise produces evidence linked to that record, every finding is tracked from open to remediation to verified-closed against that record. The GRC integration is what makes the evidence log survive audit because it is the artefact the auditor will actually look at — not the red-team tool’s dashboard, not the SIEM, but the GRC record that ties findings to remediation to re-test.

The integration work is unglamorous and load-bearing. Most red-team capability that fails at audit fails here, not at the testing itself. The testing produces findings; the integration is what turns findings into the chain of evidence the regulator wants to see.

The cadence question

The cadence flagged at the CISO governance piece holds operationally. Quarterly for high-risk customer-facing systems, semi-annually for high-impact internal systems, annually for the rest. Three structural points worth adding here.

First, the cadence applies to the periodic deep exercise, not to the continuous evaluation. The continuous work — the eval-as-CI work I wrote about at the orchestration architecture page, the runtime guardrails monitoring, the SIEM-fed adversarial-event detection — runs all the time. The cadence question is specifically about the human-led adversarial exercise that produces an audit-grade evidence artefact.

Second, the cadence is a floor, not a ceiling. A material change in the system — a new model version, a new tool surface, a new use case — triggers an extraordinary exercise regardless of the calendar cadence. Most teams remember this for major model swaps and forget it for tool-surface changes, which is the change-type most likely to introduce new prompt-injection paths.

Third, the cadence has to be defended at audit. “We do this quarterly” is the answer the auditor wants; “we did one last year” is not. The chain from policy to executed exercise to evidence is the artefact the regulator can verify. Teams that skip the exercise in a quarter under budget pressure pay for the gap two years later when the audit cycle reaches the missing quarter.

The four categories of finding

Red-team findings classify cleanly into four categories, and the classification matters because the triage flows and the remediation owners are different.

Jailbreak. The model produces output that violates the system’s defined policy in response to crafted user input designed to bypass the system prompt or guardrails. Triage owner: the prompt-engineering or model-behaviour team. Remediation: typically a prompt change, a guardrail update, or a model-vendor escalation. Re-test: against the original jailbreak attempt plus a small set of variations to catch the next bypass.

Prompt injection. A third-party input (a document the user uploaded, a search result the model retrieved, a tool output) contains adversarial instructions that the model executes as if they were the user’s. Triage owner: the platform engineering team that owns the tool surface or the retrieval layer, with security review. Remediation: input sanitisation, output filtering at the tool boundary, or architectural changes to limit the model’s authority. Re-test: against the original injection plus variations and against the broader category of attacks the architectural change is supposed to mitigate. The prompt-injection category is the most operationally consequential in 2026 because of the rapid expansion of agentic patterns covered at the agentic patterns page — every new tool the agent can call is a new injection surface.

Data exfiltration. The model is induced to reveal information it should not — another user’s data, system prompts containing secrets, training data that should not be disclosed. Triage owner: the CISO and the DPO jointly, because the failure is both security and data-rights. Remediation: typically architectural (the model should not have had access to the leaked data) plus prompt and guardrail updates as defence-in-depth. Re-test: rigorous, with a specific focus on whether the architectural fix closed the underlying access or whether the prompt-and-guardrail layer is the only defence. The latter is fragile.

Model-output incident. The model produces output that causes real-world harm — a wrong medical answer, a biased decision, a customer-impact event. Triage owner: the CISO incident-response programme, with the business owner of the affected use case. Remediation: depends heavily on the incident. Re-test: against the specific scenario plus the broader class of harm.

Findings get prioritised on a combination of likelihood (how reproducible is the exploit) and impact (what is the harm if exploited). Most teams under-prioritise prompt injection because the impact in a single test case looks small; the cumulative impact across an agentic system with twelve tool integrations is much larger and is the right basis for prioritisation.

The evidence log is the load-bearing artefact

Most teams treat the red-team exercise as the load-bearing artefact and the evidence log as bookkeeping. The relationship is the other way around. The exercise is the activity; the log is the artefact that survives the question “how do you know the fix worked.” Under audit, under regulator inquiry, under board oversight, under post-incident review, the log is what gets shown. The exercise is invisible.

The minimum schema for a defensible evidence log: model identifier, exercise date, exercise owner, methodology used, scope statement, findings list with each finding’s category, severity, reproduction steps, status (open / remediating / closed), remediation evidence (the diff or configuration change that closed the finding), re-test date, re-test outcome, residual risk statement, and — critically — the formal business sign-off by the system owner accepting that residual risk. Every field is load-bearing. The fields most often missing in practice are the re-test outcome and the residual risk statement, which is precisely why the log fails at audit.

The log should live in the GRC system as the primary record, with copies of methodology and detailed reproduction steps in a secured artefact store (the reproduction steps are themselves sensitive; a leaked log is itself an attack vector). The GRC record is the auditor-facing artefact. The artefact store is the operational record. Both are required.

The procurement consideration most teams miss, framed earlier and restated here: red-team work generates evidence that has to live somewhere defensible. The evidence log is the load-bearing artefact, not the testing exercise. A team that buys the most expensive red-team platform and does not invest in the evidence log produces volume without verifiability. A team that runs one annual exercise and maintains the log rigorously produces less volume and more verifiability, and the auditor cares about the second much more than the first.

What I would commission in 2026

For an enterprise with one or more high-risk customer-facing AI systems: a small internal team (two to three people), an annual deep exercise from a specialist agency against each high-risk system, a platform-tooling layer fed into the SIEM and the GRC, and an evidence log in the GRC as the primary record. Budget: roughly €600k–€1.2M per year all-in.

For an enterprise with high-impact internal AI systems but no customer-facing high-risk surface: one internal red-team owner working with the existing security team, a semi-annual or annual external exercise on the most important system, a lighter platform-tooling layer, and the evidence log discipline. Budget: roughly €200k–€450k per year.

For an enterprise just starting: an annual external exercise against the most important system, a baseline platform-tooling layer, and one named owner inside the existing security function. Budget: roughly €80k–€200k per year. Plan to expand within twelve months as the AI portfolio grows.

None of these recommendations come with referral fees, affiliate links, or sponsorships. The vendor names above are illustrative of the buying motions, not endorsements; the right choice in any specific procurement depends on which existing relationships the enterprise has and which AI systems are in scope.

The honest signal of a working AI red-team programme is that the auditor reads the evidence log, asks two follow-up questions, and accepts the answers. The signal of a failing programme is that the auditor asks one question and the team cannot trace the chain from finding to remediation to verified closure. Build for the second signal, because that is the one the regulator will probe.

Sources

EU AI Act, Regulation (EU) 2024/1689 — Article 72 post-market monitoring; Annex IV technical documentation obligations
NIST AI Risk Management Framework — Generative AI Profile (NIST AI 600-1) — adversarial testing and red-team guidance for generative AI
Microsoft — PyRIT (Python Risk Identification Toolkit) — open-source reference implementation for AI red-team tooling
OWASP — Top 10 for LLM Applications — canonical taxonomy for prompt injection and related categories
Related: capabilities hub, governance hub, CISO governance responsibilities, agentic patterns, AI-SRE tooling

Methodology: programme design and procurement guidance drawn from fractional CTO and CISO advisory engagements (2024–2026), cross-checked against published regulator guidance and the working evidence-log practices of the small number of enterprises that have already passed an EU AI Act preparedness review.

Thomas Prommer CIO / CTO · 20 years · Practitioner, not consultant

Tom Prommer writes The AI Strategy Guide from the operator's seat — every tool covered, tested with real money before forming a view. Connect on LinkedIn · prommer.net · X