The Cost of Failed AI Projects: A Failure-Mode Catalogue From Forty Engagements
A platform leader at a mid-sized European bank called me in March of last year to ask for a second opinion on a generative AI rollout that had already cost €3.1M and was, by his own count, six months behind schedule. He wanted me to tell him whether to cut it or extend it. I asked one question before reading any of the architecture documents he had sent. What were you measuring in October that you are not measuring now? He thought about it for a long time and said, quietly, that they had stopped tracking the resolution-quality score in January because the dashboard had broken and nobody fixed it. They had been running on adoption numbers for four months. The adoption numbers were healthy. The resolution-quality numbers, when we reconstructed them from logs, had degraded by roughly 30% across the same period. The project was not six months behind schedule. It was already a failure that nobody had named.
That is what this page is about. Not the failure rate in the abstract — the surveys cover that adequately — but the failure modes underneath the rate, the per-cluster variance the published numbers hide, and the four diagnostic questions I work through before deciding whether a programme can be saved or has to be unwound. The honest cost data is harder to publish because the firms with the data are usually selling the remediation engagement that follows. The cost data below is drawn from roughly forty fractional CTO and CIO engagements between 2023 and 2026, anonymised but cross-checked against the published surveys where they overlap.
What the published failure rates actually say
The headline numbers everybody quotes are real but misleading. The RAND Corporation’s 2024 study on AI project failure puts the rate at over 80%. Gartner’s 2024 “AI in the Enterprise” survey put it at roughly 85% for AI/ML pilots that fail to reach production. McKinsey’s 2025 State of AI survey puts the rate lower, around 60%, but their definition counts partial or non-scalable results as wins. The 60-to-85 range is therefore correct as a directional band, and entirely useless for budgeting.
The reason it is useless is that the range averages across workload types that fail for different reasons at different rates. From my engagement data, broken down to roughly match the same definitions:
Customer-facing generative AI deployments fail at roughly 70%. Chatbots, customer-service copilots, content-generation tools that customers see directly. The dominant failure mode is the acceptance-bar mismatch above — teams budget for build cost and underbudget for evaluation, monitoring, and the operational tail that customer-facing systems require.
Internal productivity deployments fail at roughly 40%. Copilots, internal knowledge assistants, generative tools deployed inside the organisation. The dominant failure mode is the adoption-vs-utility conflation. People log in. Nobody measures whether the work got done faster.
AI-SRE and infrastructure deployments fail at roughly 30%. Alert triage, post-incident analysis, the observability-adjacent tooling covered in the AI-SRE page. The dominant failure mode here is integration debt — the AI layer cannot rescue an unobservable system, and the diligence on the observability stack underneath rarely happens in the buy-decision.
Predictive ML in regulated workflows fails at roughly 25%, but with much higher per-failure cost. Credit decisioning, fraud detection, clinical decision support. The rate is lower because the bar to ship is higher; the cost is higher because a shipped failure is a regulatory incident, not a software bug.
The variance across clusters is the load-bearing number for budgeting, not the headline average. Treating a 70% failure-rate cluster the same as a 30% failure-rate cluster — which the published headline numbers encourage — is how programmes end up over-investing in the safe end and under-investing in the contingency they need on the dangerous end.
The four failure modes I check first
Before I read the architecture, the cost line, or the team org chart, I check four failure modes in this order. Roughly 80% of failed engagements I have audited fail on the first or second; the remaining 20% are spread across the third and fourth.
One. The metric stopped measuring the thing. The October-versus-March problem from the opening anecdote. Programmes drift from utility metrics (work displaced, decision quality, time saved) to adoption metrics (logins, prompt count, monthly active users) because the adoption metrics are easier to dashboard and prettier on a board slide. The drift is silent. By the time the board notices, the team has spent three quarters optimising for the wrong number. This is Goodhart’s Law in operational form, and it is the single most common failure I see. The fix is uncomfortable because it requires reconstructing the right metric from logs, retroactively, and presenting a worse number to the board than the dashboard implied.
Two. The acceptance bar was set by someone not in the room. Customer-facing programmes that defined success internally without naming the customer’s tolerance for wrong answers. Internal programmes that defined success by IT adoption goals without naming the line manager’s tolerance for trusting AI output. The failure mode is the same in both cases: the acceptance criterion lives outside the team’s control, and the team measures against an internal proxy that does not predict external acceptance. Fixing this requires going back to the audience — customer interviews, line-manager interviews, regulator conversations — and rewriting the success criteria. Most teams resist because the rewrite usually shrinks the scope.
Three. The run cost was not budgeted separately from the build cost. Especially in generative AI, where inference cost scales with traffic and evaluation cost scales with both. Programmes that built a working system on a pilot budget and then discovered the production run cost was 4x the pilot run cost. I have seen this pattern at roughly half the customer-facing failures I have audited. The fix is to model the unit economics on day one and refuse to ship until they close at production scale, not pilot scale.
Four. The integration debt was treated as a future problem. The orchestration layer that worked for two services becomes a bottleneck at twenty. The vector store that scaled in development became the cost line in production. The observability stack the AI-SRE tool was supposed to plug into did not exist. Integration debt is the most expensive failure mode to remediate because the fix is usually a partial rebuild. It is also the most preventable, because a four-page readiness assessment catches it before the architecture is signed.
The diagnostic value of working in this order is that the first two failures are usually recoverable within the existing budget; the third and fourth often are not. A programme that fails on (one) and (two) can be rescued with a metric reset and a scope cut. A programme that fails on (three) and (four) usually needs to be unwound and restarted with different assumptions.
The actual cost numbers, anonymised
The visible costs are not where the interesting variance lives. On the engagements I have data for, the visible-cost ranges are roughly:
- Customer-facing generative AI failures: €800k to €4.5M, median around €2.1M, across vendor fees, infrastructure, internal team time, and external consultancy. The wide range is driven almost entirely by how long the programme ran before the failure was named. A failure named at month nine costs roughly half what the same failure costs at month eighteen.
- Internal productivity failures: €150k to €600k, median around €280k. Cheaper because the deployment surface is smaller and the run cost lower, but the recovery rate is also lower because nobody wants to write the post-mortem on a tool that quietly nobody used.
- AI-SRE failures: €200k to €900k, median around €450k. The visible cost is dominated by the year-one platform spend; the failure mode is usually that the integration work the vendor promised was not done, and the team either eats the integration cost (recovery) or churns the tool (failure).
- Predictive ML in regulated workflows: €1.5M to €12M+ visible cost, but the regulatory-incident cost dwarfs everything visible. I have seen one engagement where the visible programme cost was roughly €4M and the eventual regulatory remediation work was over €18M. That ratio is not unusual at the dangerous end.
The invisible costs — opportunity cost on the displaced roadmap, trust cost with the funding committee, the talent attrition that follows a high-visibility failure — typically run at roughly the same magnitude as the visible cost. Doubling the visible number is a defensible first approximation when modelling the actual cost of a failed programme to a CFO.
There is one cost line that almost everybody under-budgets and that does not appear in the visible totals above: the cost of the post-mortem itself. Done properly, a post-mortem on a €2M programme failure takes four weeks of senior time across two or three people. Done improperly — outsourced to the firm that scoped the failure, or sized as a slide deck rather than a written verdict — the post-mortem costs more than the original failure and produces no learning the organisation can use on the next engagement. The honest internal post-mortem is the highest-ROI activity in the entire failure-recovery sequence and the one most consistently skipped.
What the recovery actually looks like
The pattern that works, on the engagements I have seen recover successfully, is small and unglamorous. Stop adding scope. Reconstruct the right metrics from logs. Write the four-page failure-mode verdict — which of the four modes above caused this, with evidence. Make the cut decision (extend, restart, or kill) based on the verdict, not on sunk cost. If extending, rebuild the success criteria with the audience that sets the acceptance bar in the room. If restarting, do not reuse the original team without rotating the lead; the team is not the problem, but the cognitive load of inheriting their own assumptions makes the restart twice as slow.
The pattern that does not work, and that I see attempted most often, is the rescue consulting engagement. A second firm is hired to recover the work the first firm scoped. The second firm has the same commercial incentive to extend rather than cut. The recovery engagement runs as long as the original engagement, costs roughly the same, and produces a programme that fails for the same reasons six months later. The structural problem is that the recovery market is shaped like the original engagement market, and both reward extension over honest scope reduction.
This is why the readiness assessment work and the capability planning work upstream matter so much more than the recovery work downstream. The cheapest failure is the one you avoid; the second-cheapest is the one you name at month six; everything after month twelve is expensive in ways that compound.
How to maximise enterprise AI ROI, the unsentimental version
The framing question most enterprises ask — how do we maximise ROI on AI — is the wrong question for the 2026 portfolio. The right question is how do we avoid the 70% failure rate in the customer-facing cluster while preserving optionality in the 30% failure-rate clusters. Phrased that way, the ROI calculation is not a multiplier on a single workstream; it is an expected-value calculation across the portfolio, with the failure-rate variance from above driving most of the spread.
The practical version: shift portfolio weight toward the 25-to-40% failure-rate clusters (AI-SRE, predictive ML in well-governed workflows, internal productivity with measured utility) and away from the 70% failure-rate cluster (customer-facing generative AI without a controlled acceptance bar) until the team has shipped two successful internal-tier programmes. The enterprises I have seen sustain ROI on AI work over three years are the ones that built capability on the lower-risk clusters first and earned the right to take the customer-facing risk second. The enterprises I have seen fail are the ones that started with the customer-facing tier because that is where the board attention was, and lost the budget for the rest of the portfolio when the first failure landed.
That is the answer the Big-4 firms cannot give, because the customer-facing tier is where the remediation engagements are largest. The honest answer is to stay out of the dangerous tier until your capability layer is real, and the capability layer work is the long version of how to get there.
Sources & methodology
- RAND Corporation, “The Root Causes of Failure for Artificial Intelligence Projects” (RR-A2680-1), August 2024 — failure-rate baseline
- Gartner, “AI in the Enterprise” survey, 2024 — 85% pilot-to-production failure rate
- McKinsey, “The State of AI in 2025,” July 2025 — softer definition, ~60% failure rate
- NIST AI Risk Management Framework, v1.0 — failure-mode taxonomy baseline
Cost ranges drawn from fractional CTO and CIO engagements 2023–2026 (n≈40), anonymised by sector and headcount band. Where engagement experience and published survey data disagreed, the per-cluster engagement number is reported and the published average is cited as the comparison. The scoring sheet for the four-failure-mode diagnostic is published under CC-BY-4.0; email me for the working copy or to send a falsification.
