The Cost of Failed AI Projects: A Failure-Mode Catalogue From Forty Engagements — Capabilities illustration

The Cost of Failed AI Projects: A Failure-Mode Catalogue From Forty Engagements

A platform leader at a mid-sized European bank called me in March of last year to ask for a second opinion on a generative AI rollout that had already cost €3.1M and was, by his own count, six months behind schedule. He wanted me to tell him whether to cut it or extend it. I asked one question before reading any of the architecture documents he had sent. What were you measuring in October that you are not measuring now? He thought about it for a long time and said, quietly, that they had stopped tracking the resolution-quality score in January because the dashboard had broken and nobody fixed it. They had been running on adoption numbers for four months. The adoption numbers were healthy. The resolution-quality numbers, when we reconstructed them from logs, had degraded by roughly 30% across the same period. The project was not six months behind schedule. It was already a failure that nobody had named.

That is what this page is about. Not the failure rate in the abstract — the surveys cover that adequately — but the failure modes underneath the rate, the per-cluster variance the published numbers hide, and the four diagnostic questions I work through before deciding whether a programme can be saved or has to be unwound. The honest cost data is harder to publish because the firms with the data are usually selling the remediation engagement that follows. The cost data below is drawn from roughly forty fractional CTO and CIO engagements between 2023 and 2026, anonymised but cross-checked against the published surveys where they overlap.

What the published failure rates actually say

The headline numbers everybody quotes are real but misleading. The RAND Corporation’s 2024 study on AI project failure puts the rate at over 80%. Gartner’s 2024 “AI in the Enterprise” survey put it at roughly 85% for AI/ML pilots that fail to reach production. McKinsey’s 2025 State of AI survey puts the rate lower, around 60%, but their definition counts partial or non-scalable results as wins. The 60-to-85 range is therefore correct as a directional band, and entirely useless for budgeting.

The reason it is useless is that the range averages across workload types that fail for different reasons at different rates. From my engagement data, broken down to roughly match the same definitions:

Customer-facing generative AI deployments fail at roughly 70%. Chatbots, customer-service copilots, content-generation tools that customers see directly. The dominant failure mode is the acceptance-bar mismatch above — teams budget for build cost and underbudget for evaluation, monitoring, and the operational tail that customer-facing systems require.

Internal productivity deployments fail at roughly 40%. Copilots, internal knowledge assistants, generative tools deployed inside the organisation. The dominant failure mode is the adoption-vs-utility conflation. People log in. Nobody measures whether the work got done faster.

AI-SRE and infrastructure deployments fail at roughly 30%. Alert triage, post-incident analysis, the observability-adjacent tooling covered in the AI-SRE page. The dominant failure mode here is integration debt — the AI layer cannot rescue an unobservable system, and the diligence on the observability stack underneath rarely happens in the buy-decision.

Predictive ML in regulated workflows fails at roughly 25%, but with much higher per-failure cost. Credit decisioning, fraud detection, clinical decision support. The rate is lower because the bar to ship is higher; the cost is higher because a shipped failure is a regulatory incident, not a software bug.

The variance across clusters is the load-bearing number for budgeting, not the headline average. Treating a 70% failure-rate cluster the same as a 30% failure-rate cluster — which the published headline numbers encourage — is how programmes end up over-investing in the safe end and under-investing in the contingency they need on the dangerous end.

The four failure modes I check first

Before I read the architecture, the cost line, or the team org chart, I check four failure modes in this order. Roughly 80% of failed engagements I have audited fail on the first or second; the remaining 20% are spread across the third and fourth.

One. The metric stopped measuring the thing. The October-versus-March problem from the opening anecdote. Programmes drift from utility metrics (work displaced, decision quality, time saved) to adoption metrics (logins, prompt count, monthly active users) because the adoption metrics are easier to dashboard and prettier on a board slide. The drift is silent. By the time the board notices, the team has spent three quarters optimising for the wrong number. This is Goodhart’s Law in operational form, and it is the single most common failure I see. The fix is uncomfortable because it requires reconstructing the right metric from logs, retroactively, and presenting a worse number to the board than the dashboard implied.

Two. The acceptance bar was set by someone not in the room. Customer-facing programmes that defined success internally without naming the customer’s tolerance for wrong answers. Internal programmes that defined success by IT adoption goals without naming the line manager’s tolerance for trusting AI output. The failure mode is the same in both cases: the acceptance criterion lives outside the team’s control, and the team measures against an internal proxy that does not predict external acceptance. Fixing this requires going back to the audience — customer interviews, line-manager interviews, regulator conversations — and rewriting the success criteria. Most teams resist because the rewrite usually shrinks the scope.

Three. The run cost was not budgeted separately from the build cost. Especially in generative AI, where inference cost scales with traffic and evaluation cost scales with both. Programmes that built a working system on a pilot budget and then discovered the production run cost was 4x the pilot run cost. I have seen this pattern at roughly half the customer-facing failures I have audited. The fix is to model the unit economics on day one and refuse to ship until they close at production scale, not pilot scale.

Four. The integration debt was treated as a future problem. The orchestration layer that worked for two services becomes a bottleneck at twenty. The vector store that scaled in development became the cost line in production. The observability stack the AI-SRE tool was supposed to plug into did not exist. Integration debt is the most expensive failure mode to remediate because the fix is usually a partial rebuild. It is also the most preventable, because a four-page readiness assessment catches it before the architecture is signed.

The diagnostic value of working in this order is that the first two failures are usually recoverable within the existing budget; the third and fourth often are not. A programme that fails on (one) and (two) can be rescued with a metric reset and a scope cut. A programme that fails on (three) and (four) usually needs to be unwound and restarted with different assumptions.

The actual cost numbers, anonymised

The visible costs are not where the interesting variance lives. On the engagements I have data for, the visible-cost ranges are roughly:

  • Customer-facing generative AI failures: €800k to €4.5M, median around €2.1M, across vendor fees, infrastructure, internal team time, and external consultancy. The wide range is driven almost entirely by how long the programme ran before the failure was named. A failure named at month nine costs roughly half what the same failure costs at month eighteen.
  • Internal productivity failures: €150k to €600k, median around €280k. Cheaper because the deployment surface is smaller and the run cost lower, but the recovery rate is also lower because nobody wants to write the post-mortem on a tool that quietly nobody used.
  • AI-SRE failures: €200k to €900k, median around €450k. The visible cost is dominated by the year-one platform spend; the failure mode is usually that the integration work the vendor promised was not done, and the team either eats the integration cost (recovery) or churns the tool (failure).
  • Predictive ML in regulated workflows: €1.5M to €12M+ visible cost, but the regulatory-incident cost dwarfs everything visible. I have seen one engagement where the visible programme cost was roughly €4M and the eventual regulatory remediation work was over €18M. That ratio is not unusual at the dangerous end.

The invisible costs — opportunity cost on the displaced roadmap, trust cost with the funding committee, the talent attrition that follows a high-visibility failure — typically run at roughly the same magnitude as the visible cost. Doubling the visible number is a defensible first approximation when modelling the actual cost of a failed programme to a CFO.

There is one cost line that almost everybody under-budgets and that does not appear in the visible totals above: the cost of the post-mortem itself. Done properly, a post-mortem on a €2M programme failure takes four weeks of senior time across two or three people. Done improperly — outsourced to the firm that scoped the failure, or sized as a slide deck rather than a written verdict — the post-mortem costs more than the original failure and produces no learning the organisation can use on the next engagement. The honest internal post-mortem is the highest-ROI activity in the entire failure-recovery sequence and the one most consistently skipped.

What the recovery actually looks like

The pattern that works, on the engagements I have seen recover successfully, is small and unglamorous. Stop adding scope. Reconstruct the right metrics from logs. Write the four-page failure-mode verdict — which of the four modes above caused this, with evidence. Make the cut decision (extend, restart, or kill) based on the verdict, not on sunk cost. If extending, rebuild the success criteria with the audience that sets the acceptance bar in the room. If restarting, do not reuse the original team without rotating the lead; the team is not the problem, but the cognitive load of inheriting their own assumptions makes the restart twice as slow.

The pattern that does not work, and that I see attempted most often, is the rescue consulting engagement. A second firm is hired to recover the work the first firm scoped. The second firm has the same commercial incentive to extend rather than cut. The recovery engagement runs as long as the original engagement, costs roughly the same, and produces a programme that fails for the same reasons six months later. The structural problem is that the recovery market is shaped like the original engagement market, and both reward extension over honest scope reduction.

This is why the readiness assessment work and the capability planning work upstream matter so much more than the recovery work downstream. The cheapest failure is the one you avoid; the second-cheapest is the one you name at month six; everything after month twelve is expensive in ways that compound.

How to maximise enterprise AI ROI, the unsentimental version

The framing question most enterprises ask — how do we maximise ROI on AI — is the wrong question for the 2026 portfolio. The right question is how do we avoid the 70% failure rate in the customer-facing cluster while preserving optionality in the 30% failure-rate clusters. Phrased that way, the ROI calculation is not a multiplier on a single workstream; it is an expected-value calculation across the portfolio, with the failure-rate variance from above driving most of the spread.

The practical version: shift portfolio weight toward the 25-to-40% failure-rate clusters (AI-SRE, predictive ML in well-governed workflows, internal productivity with measured utility) and away from the 70% failure-rate cluster (customer-facing generative AI without a controlled acceptance bar) until the team has shipped two successful internal-tier programmes. The enterprises I have seen sustain ROI on AI work over three years are the ones that built capability on the lower-risk clusters first and earned the right to take the customer-facing risk second. The enterprises I have seen fail are the ones that started with the customer-facing tier because that is where the board attention was, and lost the budget for the rest of the portfolio when the first failure landed.

That is the answer the Big-4 firms cannot give, because the customer-facing tier is where the remediation engagements are largest. The honest answer is to stay out of the dangerous tier until your capability layer is real, and the capability layer work is the long version of how to get there.


Sources & methodology

Cost ranges drawn from fractional CTO and CIO engagements 2023–2026 (n≈40), anonymised by sector and headcount band. Where engagement experience and published survey data disagreed, the per-cluster engagement number is reported and the published average is cited as the comparison. The scoring sheet for the four-failure-mode diagnostic is published under CC-BY-4.0; email me for the working copy or to send a falsification.

Frequently asked questions

What is the actual cost of a failed enterprise AI project?
The visible cost — vendor fees, internal headcount, infrastructure — is the small half. The invisible half is the opportunity cost of the workstream the failed project displaced and the trust cost with the board that funds the next round. On the engagements I have audited, a customer-facing generative AI failure runs between €800k and €4.5M in visible costs and roughly the same again in delayed-roadmap value. An internal productivity failure is cheaper to absorb (€150k to €600k visible) but harder to learn from because nobody wants to write the post-mortem on a tool nobody used.
Why do customer-facing generative AI projects fail so much more often than internal ones?
Three reasons stack. The acceptance bar is set by the customer, not by the team, so a 90% useful answer rate is a failure for credit decisions and a success for marketing copy. The evaluation cost scales with traffic, not with development effort, and most teams budget for build cost not run cost. And the failure modes — hallucination, prompt injection, drift — produce externally-visible incidents that internal projects can paper over. The 70% failure rate in this cluster is not a property of the technology; it is a property of how the workstream was scoped against an audience the team does not control.
What is the single most reliable predictor of AI project failure?
Whether the team measured the right thing before they shipped. Across forty engagements the strongest correlation I have found with eventual failure is not the model choice or the architecture; it is whether the team defined a utility metric (the thing the AI is supposed to displace or replace) separately from an adoption metric (the thing people do with the tool). Programmes that conflated the two — counting logins as success — failed at roughly twice the rate of programmes that measured both. The number stopped measuring the thing, and nobody noticed until the budget review.
Should I commission a Big-4 AI project recovery engagement after a failure?
Almost never. Recovery engagements are sized to the previous engagement, not to the actual remediation work, and the firms selling them are usually the firms who scoped the failure in the first place. The cheaper and more effective move is a four-week internal post-mortem run by someone outside the original team, against the four failure modes below, with a written verdict the CFO can read. If the post-mortem cannot fit on six pages it is not a post-mortem, it is a sales document.