Generative AI Maturity Model: A Four-Axis Operator's Read — Maturity illustration

Generative AI Maturity Model: A Four-Axis Operator's Read

The stage-rating debate that convinced me to write this page happened in a Zurich boardroom in February. The maturity assessment on the table put the firm at stage four on the Gartner model — Systemic, integrated, governed. The CTO had built a credible internal AI platform. The data team had a serious feature store. The MLOps function ran on schedule. By the published criteria, the rating was defensible. Then the CFO asked the question that surfaced the blind spot: “how much of this applies to the generative-AI work we shipped last quarter?” The answer, after fifteen minutes of honest accounting, was almost none of it. The platform was a classical-ML platform. The generative-AI workstreams ran on a separate stack assembled by three engineers in eight weeks, with no evaluation harness, no model-vendor governance, and a cost line nobody was reviewing because it was still on the founder’s credit card. On the classical-ML axes the firm was stage four. On the generative-AI axes it was stage one. The same organisation, scored against the wrong axes, had been rated three stages higher than it deserved.

That is the structural problem this page exists to fix. The published AI maturity models, including the six I cover in the parent hub, were designed in the classical-ML era and partially retrofitted for generative AI. The retrofit is leaky. The axes that mattered for ML maturity — data-pipeline depth, feature-store engineering, model-lifecycle discipline — are necessary but not sufficient for generative-AI maturity. The axes that matter specifically for generative AI either do not appear on the published models or appear as sub-bullets under larger categories where they get lost.

I run the four-axis version below on every engagement that touches generative-AI work. It is the framework that has surfaced the most expensive blind spots, and it is the framework that does not let a stage-four classical-ML organisation pretend to be a stage-four generative-AI organisation when the operational reality says otherwise.

Why generative AI deserves its own maturity read

The parent maturity hub makes the case that target stage is conditional on strategic posture, not absolute, and that the three axes that hold up across engagements are capability depth, governance maturity, and cost-base integration. That three-axis read is the right starting point for any AI maturity assessment. It is not sufficient for generative AI, because three of the four properties that matter most for generative-AI maturity sit underneath the surface of those three axes and do not get teased out unless you ask specifically.

The properties that need their own axes. The build-vs-buy-vs-fine-tune posture, because the answer changes the cost structure, the talent requirements, and the failure modes in ways the classical capability-depth axis does not capture. The context-engineering capability, because it is the practice that turns a frontier model into something useful for a specific workflow, and the maturity of this practice in a given organisation is almost entirely uncorrelated with the maturity of the engineering organisation overall. The evaluation-harness maturity, because shipping a generative-AI feature without an evaluation harness is a category of risk that did not exist in the classical-ML world (where the equivalent risk was caught by holdout-set discipline) and that the published frameworks treat as a sub-bullet. And the cost-engineering maturity, because the unit-economics of generative AI move faster than any other category of AI cost, and the organisations that have not built a discipline around managing that movement are the organisations whose CFO discovers a 4x cost overrun in the third quarter after launch.

Four axes. Scored independently. Read against strategic posture. Not compressed into a single stage, for the same reasons the three-axis read in the parent hub refuses to compress.

Axis one: build-vs-buy-vs-fine-tune posture

The build-vs-buy-vs-fine-tune question is genuinely different from the classical ML version of the same question, because the economics flip on a quarterly cadence rather than annually. The decision an organisation made in early 2025 about fine-tuning an open-weights model — defensible at the time, with frontier-API costs at the level they were and open-weights capability narrowing the gap — is a different decision in mid-2026, with frontier costs an order of magnitude lower per quality unit and the open-weights catch-up trajectory having stalled in the second half of 2025.

The maturity read is not whether the organisation can build, buy, or fine-tune. Any engineering organisation with sufficient cloud spend can technically access all three; few can execute any of them to production-grade standard on a calendar the business will accept. The maturity read is whether the organisation has explicit positions on each, written down, with a documented review cadence that matches the quarterly economics of the underlying market, and the organisational discipline to actually reverse a position when the numbers move. The stages I score this axis on:

Stage one: implicit. The organisation is doing some combination of build, buy, and fine-tune, but the choices are tactical and not written down. The most common stage-one state in 2026 is “we use OpenAI for everything because that is what we started with two years ago,” with no review having been run since the original choice. About 40% of organisations I have scored sit here.

Stage two: documented. The organisation has a written position on which workflows use which approach, and the position has a date on it. It may be six months stale, but it exists. The CFO can be shown the document during a budget review. About 35% of the organisations I have scored sit here.

Stage three: reviewed. The position is reviewed quarterly against the underlying economics. The review changes the position when the numbers warrant it. There is a documented case in the last twelve months of the organisation reversing a build, buy, or fine-tune decision because the economics moved. About 20% of organisations sit here.

Stage four: portfolio-managed. The organisation manages the build-vs-buy-vs-fine-tune posture as a portfolio across multiple workflows, with explicit hedging — say, primary frontier vendor with a defined fallback, fine-tuned model behind a frontier router for cost-sensitive paths. The portfolio is managed by named role, reviewed against named metrics, and adjusted on a published cadence. About 5% of organisations sit here, and they are almost exclusively the AI-native firms.

The honest stage-four target for a follower-posture organisation is stage two or three. Stage four is leader-posture economics and is genuinely wasted effort for a follower; the portfolio-management overhead does not pay back unless the organisation is making enough generative-AI buying decisions per quarter to justify the cadence.

Axis two: context-engineering capability

Context engineering is the practice that turns a frontier model into a system that does useful work on a specific workflow — the design of the retrieval layer, the prompt scaffolding, the tool-calling structure, the data pipelines feeding the context window, the eval loop closing back on context quality. It is the closest thing generative AI has to a discipline analogous to feature engineering in classical ML, and the maturity of this practice in a given organisation is the second-most underweighted dimension in published assessments.

The reason this axis matters separately from the build-vs-buy-vs-fine-tune axis is that the organisations that are mature on procurement (axis one) are frequently immature on context engineering, and vice versa. A regulated firm that has correctly settled the build-vs-buy question — frontier API behind a tightly governed router — can still ship generative-AI features whose performance is bottlenecked entirely by poor retrieval design and brittle prompt scaffolding. The maturity reading on axis one is high. The maturity reading on axis two is low. The system performs worse than the procurement maturity suggests it should.

The stages I score this axis on:

Stage one: ad-hoc. Engineers write prompts by trial and error, retrieval is bolted on per-feature, no shared abstractions exist across the engineering organisation. The output quality varies by engineer. About 50% of organisations sit here.

Stage two: pattern-shared. Common prompt patterns and retrieval abstractions are shared across teams via internal libraries or platform components. The variance in output quality between teams has narrowed. About 30% of organisations sit here.

Stage three: measured. Context quality is measured systematically against eval datasets, and changes to context design are gated by eval performance. The team can answer “what is the marginal eval improvement from this retrieval change” with a number. About 15% of organisations sit here.

Stage four: continuous-improvement. Context engineering is a permanent function with its own roadmap, its own evaluation, and its own headcount allocation. Context changes flow through the same review discipline as code changes. About 5% of organisations sit here.

The follower-posture target is stage two; the leader-posture target is stage three or four. A stage-one organisation shipping generative-AI features into production is shipping an unstable surface area whose failures will not be reproducible.

Axis three: evaluation-harness maturity

The third axis is the one that lags every other axis in almost every organisation, and the lag is the single most expensive thing about generative-AI maturity in 2026. Capability adoption runs roughly twelve months ahead of evaluation-harness maturity. Teams ship features, declare them working on the basis of a demo and a small spot-check, and discover the actual failure surface six to twelve months later when a regulator, a customer, or an internal auditor asks the question the team never built the harness to answer. Goodhart’s law applies in a specific shape here: the demo metric — “does the output look right to a human reviewer” — stops measuring the thing it was supposed to measure the moment the system is shipped at production scale, because the production input distribution differs from the demo distribution in ways the team did not anticipate.

The stages I score this axis on:

Stage one: spot-checks. No formal eval harness exists. Quality assurance is performed by engineers manually reviewing outputs on sampled inputs. About 55% of organisations shipping generative AI in production sit here. This is the single most common failure mode in 2026.

Stage two: static eval sets. A static evaluation dataset exists for at least the primary use cases, and the team runs it on a regular cadence before deployment. The dataset is not refreshed against production traffic. About 25% of organisations sit here.

Stage three: production-grounded eval. The evaluation harness is grounded in production traffic. Failure modes observed in production feed back into the eval dataset. Coverage gaps are tracked as engineering work. About 15% of organisations sit here.

Stage four: continuous evaluation with regression gates. Every change to the generative-AI system — model, prompt, retrieval, tooling — is gated by an automated eval harness that runs the equivalent of a regression suite. Deployment without passing the gate is not possible. About 5% of organisations sit here, and the regulated-industry firms are over-represented in this group.

The follower-posture target on this axis is stage two minimum. Anything below stage two is a programme shipping into production with no measurement discipline, which is the configuration that produces the regulatory and reputational incidents that show up in the next year’s analyst surveys.

Axis four: cost-engineering maturity

The fourth axis is the one that connects most directly to the discretionary-to-operating-budget transition discussed in the parent hub. For classical ML, that transition is a one-time event when permanent staffing and infrastructure get pushed into the operating budget. For generative AI, the transition is continuous, because the unit economics move on a six-to-nine-month cadence. A generative-AI workload whose cost per query was acceptable in the planning model can be unacceptable two quarters later if usage scaled faster than expected or if the chosen model-vendor’s pricing did not move the way the planning model assumed.

The stages I score this axis on:

Stage one: untracked. The organisation knows the total monthly bill from each model vendor. It does not know cost per query, per resolved ticket, per generated artefact, or per business outcome. About 45% of organisations sit here.

Stage two: per-workflow tracking. Cost is attributed to specific workflows and tracked monthly. The CFO can answer “what does the customer-service assistant cost us per resolved ticket.” Variance is reviewed against budget. About 35% of organisations sit here.

Stage three: unit-economics governance. Unit costs are tracked as KPIs, model-vendor selection includes explicit cost-engineering analysis, and routing decisions (which model serves which query class) are made on cost-per-quality-unit grounds. About 15% of organisations sit here.

Stage four: cost-aware product engineering. Product engineering decisions explicitly consider cost-per-unit implications. The product team can articulate why a feature is built one way rather than another on cost-engineering grounds. The CFO is a regular consumer of generative-AI cost dashboards. About 5% of organisations sit here.

The follower-posture target on this axis is stage two minimum. The leader-posture target is stage three. Stage four is genuinely required only for organisations whose generative-AI work is at sufficient scale that single-digit-percentage cost differences translate into material P&L impact — typically AI-native firms or large enterprises with at least one generative-AI workload at €10M+ annual operating cost.

What the four-axis read tells you

The four scores together tell you what a single-stage rating does not. The classical-ML stage-four firm I opened with scored, on the four-axis generative-AI read, as: build-vs-buy stage one, context-engineering stage one, evaluation-harness stage one, cost-engineering stage one. Four ones. That is not a stage-four firm hiding inside a stage-four firm. That is a firm that has shipped generative AI without a maturity discipline, and the classical-ML maturity score that the audit report showed had become decorative on the day the first generative-AI feature went live. Conway’s law applies here in an indirect way: the organisation shipping classical ML at stage four had structured itself around the classical ML problem; the generative-AI work was being done by a different, smaller, structurally-different team whose maturity was not the classical-ML team’s maturity. The audit had scored the wrong team.

The pragmatic next step from that diagnosis was not “do everything at once.” It was to write a six-month plan that targeted stage-two on three axes (axis one, axis three, axis four) and stage-three on the fourth axis (axis two — context engineering, because the underlying workflow demanded it). The plan was sized at twelve months of engineering effort plus one new senior hire. It was approved in the next planning cycle. The firm shipped the maturity work on schedule, and the audit eighteen months later told a defensible story for the first time.

That is the use of the four-axis read. It is not a vanity rating. It is the diagnostic that names the gap between where the organisation is and where its strategic posture says it should be, on each of the four axes separately, so the next quarter’s planning can target the highest-leverage gap first.

How this connects to the rest of the maturity work

The four-axis generative-AI read sits underneath the three-axis read in the parent hub, not in opposition to it. The parent-hub axes — capability depth, governance maturity, cost-base integration — remain the right organisation-wide read. The four axes here are the read for the generative-AI slice specifically, and they expand the capability-depth and cost-base-integration axes into the four sub-properties that matter for generative AI in particular.

If you are scoring a generative-AI programme for the first time, do the parent-hub three-axis read first to establish the organisational context. Then do the four-axis read here for the generative-AI work specifically. Compare the two. The gaps between them — the places where the generative-AI scores are materially lower than the organisation-wide scores — are the highest-priority targets for the next planning cycle, because they are the places where the organisation’s overall maturity is being overstated by classical-ML strength masking generative-AI gaps.

If you are pressure-testing an existing generative-AI maturity assessment, the first question to ask is whether evaluation-harness maturity is scored as a primary axis or as a sub-bullet under capability depth. If it is a sub-bullet, the assessment is incomplete and the production failure surface is being understated. The second question: does the assessment treat cost-engineering as a separate axis with quarterly review cadence, or as a one-time consideration. If it is one-time, the unit-economics movement is being missed and the operating-budget surprise is coming.

For the Gartner-specific overlay, the Gartner read covers what to keep and what to overlay from the published model. For the strategy work upstream of any maturity assessment, the root hub and the four-question diagnostic apply equally to generative-AI maturity as to AI maturity overall.


Sources & methodology

If a stage distribution looks wrong against your engagement experience, send the disagreement and I will publish it with attribution.

Frequently asked questions

Is the generative-AI maturity question really different from the classical AI/ML maturity question?
Yes, and the difference is not cosmetic. Classical ML maturity is largely about data-pipeline depth, model-lifecycle discipline, and feature-store engineering. Generative-AI maturity has those concerns but is dominated by four others — the build-vs-buy-vs-fine-tune decision, context-engineering capability, evaluation-harness maturity, and a cost trajectory that flips on a six-to-nine-month cadence rather than annually. Scoring an organisation on classical-ML axes and calling the result generative-AI maturity is the most common assessment error I see in 2026.
Why does the build-vs-buy-vs-fine-tune question deserve its own axis?
Because the answer changes by quarter and the consequences are structural. An organisation that committed to fine-tuning a 70B open-weights model in early 2025 was making a defensible bet at the time. The same commitment in mid-2026, with frontier APIs an order of magnitude cheaper per quality unit, is a different bet — usually a worse one. The maturity reading is not whether the organisation can fine-tune. It is whether the organisation can reverse course when the economics move. That second question is the one classical-ML maturity frameworks were not built to ask.
What is the single most-underweighted axis in published gen-AI maturity assessments?
The evaluation harness. Capability adoption runs roughly twelve months ahead of evaluation-harness maturity in almost every organisation I have seen. Teams ship a generative-AI feature, declare it working on the basis of a demo and a small spot-check, and discover the real failure surface six months in when a regulator, a customer, or an internal auditor asks the question the team never built the harness to answer. The published frameworks describe evaluation as a stage gate. In practice it is a permanent capability that always lags. Frameworks that do not name the lag understate the gap by twelve months.
How fast does the discretionary-to-operating-budget transition arrive for generative-AI work specifically?
Faster than for classical ML, and the reason is unit economics. A classical-ML deployment has a roughly stable inference cost once trained; the discretionary-to-operating transition is mostly about whether the data pipeline and the model lifecycle have permanent staffing. A generative-AI deployment has a per-query cost that moves with usage and a model-vendor cost that moves with the frontier — both of which can double or halve in a quarter. The operating-budget transition arrives the moment usage scales past the discretionary ceiling, which in 2026 is roughly six to nine months from launch for any successful internal tool. Most organisations are not ready when it arrives.