Are the published productivity numbers for AI coding tools (40%, 55%, 75% gains) real?

The individual-task numbers are usually real. The team-level shipping numbers behind them are usually not. The published studies measure time-to-complete on isolated coding tasks under controlled conditions, which the engineers using AI tools do perform meaningfully faster — typically in the 20–40% range for routine work, much less for novel work. Translating that into team-level shipping speed requires the rest of the engineering pipeline — code review, integration, deployment, on-call — to absorb the higher volume without becoming the new bottleneck. In practice it does not. The team-level shipping speed-up I observe in engagements is usually in the 5–15% range in the first year, not the 40%+ the tool vendors quote.

Do senior engineers get paid more now because of AI leverage?

In a narrow segment of the market, yes. Senior engineers at AI-native firms and at FT500 platform organisations are seeing comp pressure upward in 2026 — roughly 10–20% at the senior staff and principal levels compared to 2024 benchmarks. Outside that segment, the comp pressure is more limited. The structural question is not whether comp is rising in aggregate but whether your specific seniors will leave for AI-leverage roles elsewhere if you do not adjust. The answer to that is firm-specific and depends on how visible AI work is in your engineering organisation. Invisible AI work does not retain seniors against a visible offer.

What is the single biggest mistake engineering organisations are making with AI coding tools in 2026?

Treating the rollout as a tools-procurement decision rather than an organisational-design decision. Buying Cursor or Copilot or Claude Code for the engineering team and declaring AI adoption complete is the failure pattern. The tools work; the organisational changes that make the tools actually pay back at the team level — code-review capacity expansion, on-call rotation adjustments, test-coverage investments, the platform-engineering capability for the tooling itself — do not happen by default. The tools without the organisational changes produce more code at the same team-level shipping speed, which is the worst of both worlds.

AI for Engineering Teams: What the VP of Engineering Actually Needs to Know

Q: Should we hire fewer mid-level engineers and more seniors now?

Probably yes, but the reasoning matters more than the conclusion. The hiring shift is not because AI replaces mid-level engineering work — it does not, reliably. It is because AI generates higher volumes of plausible-looking code that needs to be reviewed for correctness, and the engineers who can review high-volume AI-generated code competently are senior. A team that scales AI tool usage without changing its hiring posture finds that code review becomes the bottleneck within three to six months. The fix is more senior-engineering density, not a wholesale replacement of mid-level engineers.

Tom Prommer · CIO/CTOUpdated 2026-06-0612 min read

Executive summary

Generative AI is doing real things to engineering organisations — and most of the published numbers are wrong. The throughput question, the code-review bottleneck, the hiring-posture shift, and what reliably works once the demo glow wears off. Written for the VP Engineering, not the CTO.

The VP Engineering review of throughput claims that prompted this page happened in Berlin in March. The CTO had been told by the head of developer experience that the rollout of Cursor and Claude Code across the engineering organisation had produced a 47% productivity gain, measured by lines of code generated per engineer per week. The number was real. The measurement was wrong. The VP Engineering I was advising — who reported to the CTO and ran the actual delivery — pulled the previous six quarters of velocity data, normalised for team composition and on-call load, and showed that team-level shipping velocity had risen by roughly 8%. The two numbers were both correct. They were measuring different things. The 47% described the activity. The 8% described the outcome. The board memo went out with the 47% number because the head of developer experience had drafted it and the VP Engineering was not in the meeting. Six months later, when shipping velocity had failed to maintain even the 8% gain, the CTO had a credibility problem the original measurement framing had set him up for.

That is the structural problem this page exists to address. Generative AI is doing real things to engineering organisations in 2026. Most of the published numbers describing those things are wrong in a specific direction — they describe individual-task activity at the cost of describing team-level outcomes, and the gap between the two is the gap that turns published productivity claims into board-memo embarrassments six months later. The CTO-level view of this question lives separately because the CTO is the executive answering for the strategic decisions; this page is calibrated for the VP Engineering or platform lead who has to make the operational decisions and who will be the one asked, on a Tuesday afternoon in November, why the team-level shipping velocity has not moved the way the rollout deck said it would.

I have had this exact conversation across a dozen fractional CTO and CIO engagements in the last year. The shape is always the same: the org rolls out AI coding tools, changes nothing else about its delivery process, and then asks why the headline productivity numbers do not show up at the bottom line. The pattern is consistent enough to be worth writing down, and the fixes are operational enough that they can be implemented without a strategy refresh.

The throughput question, honestly

The published productivity numbers fall into roughly three categories. The first is vendor-published controlled studies — GitHub Copilot’s earlier published numbers, the Cursor benchmarks, Anthropic’s evaluations of Claude Code on standardised coding tasks. The numbers in this category are typically in the 30–55% range for time-to-complete on isolated coding tasks. They are real and they are reproducible. They also do not generalise to team-level shipping velocity in any predictable way.

The second category is industry-survey numbers. The DORA State of DevOps reports, the GitHub Octoverse data, and the McKinsey developer productivity studies all publish team-level productivity figures that show AI-tool-using teams outperforming non-using teams by margins in the 10–25% range. The numbers in this category are more representative but suffer from selection bias — the teams adopting AI tools earliest are also the teams with stronger engineering practices generally, and the AI tools are getting credit for productivity gains that would have happened regardless.

The third category is the academic and methodologically-careful studies — METR’s 2025 work on developer productivity, the Stanford HAI productivity panels, the smaller academic studies that control for task difficulty and engineer seniority. The numbers here are more sobering. METR’s widely-cited mid-2025 study of experienced open-source maintainers found a 24% self-reported speed-up alongside a 19% measured slow-down on PR merge time in complex repositories. The gap between perception and measurement on that study is the single most important data point in the literature, and it is the one most often left out of vendor decks.

The honest reading. Individual-task throughput rises with AI coding tools by a measurable amount — somewhere between 15% and 35% for routine work, depending on the task class, the engineer’s seniority, and the tool. Novel work — where the engineer is solving a problem genuinely new to them or the system — sees much smaller gains and sometimes regressions, because the AI tools confidently generate plausible-looking solutions that the engineer then spends time debugging. Team-level shipping velocity, integrated over all the work types and the surrounding pipeline, rises by a smaller amount — typically 5–15% in the first year, occasionally higher in organisations that paired the tool rollout with operational changes I cover below.

A 5–15% team-level gain is a real gain. It justifies the tool spend, the rollout effort, and the operational adjustments. It does not justify the 47% claim in the board memo, and the gap between the claim and the reality is where engineering organisations damage their own credibility.

The bottleneck moves to code review

The mechanism behind the gap between individual-task throughput and team-level shipping velocity is simple enough to state and surprisingly hard to internalise. The engineering pipeline has multiple stages, and AI tools accelerate one of them — the writing of the first-draft code. The other stages — code review, integration, test coverage, deployment, on-call response, post-incident learning — do not accelerate by default. When the writing stage gets faster while the others do not, the bottleneck moves downstream. The most common new location is code review.

Brooks’ law applies in a generalised form here: adding capacity to one stage of a pipeline without adding capacity downstream does not increase pipeline throughput. The engineering organisations adopting AI coding tools in 2025–2026 have been adding capacity to the code-writing stage. Most of them have not added capacity to code review. The predictable result is that code review queues grow, review quality drops, and the throughput gain from the tools is partially or wholly absorbed by the new bottleneck.

The mechanism has a second-order effect that compounds the problem. AI-generated code looks more confident than equivalent human-written code in the first draft, because the tools produce syntactically clean, well-structured output. Reviewers, especially less-senior ones, are systematically more likely to approve AI-generated code without catching subtle correctness issues, because the surface signals — formatting, structure, naming — that reviewers use as priors for code quality are exactly the signals AI tools optimise. The result is that review queues grow and review effectiveness drops simultaneously. The first problem shows up in cycle-time metrics. The second problem shows up in incident rates three to six months later, when the unreviewed correctness issues hit production.

The operational fix that works. Code-review capacity has to scale alongside AI-tool adoption, and the scaling cannot be linear with line-count throughput because the per-review effort required for AI-generated code is higher, not lower. The engineering organisations that have made the tool rollout pay back at the team level are the ones that have, simultaneously, raised the seniority bar for reviewers on AI-generated code, expanded review staffing, and invested in automated testing that catches the correctness categories AI tools systematically miss. The organisations that did the tool rollout without those three adjustments are the ones now reporting flat or slightly negative team-level velocity changes despite enthusiastic engineer-level adoption.

The systems view: why velocity is the wrong lever

There is a cleaner way to state the whole argument, and it comes from Will Larson. In his systems model of LLM adoption and developer experience he draws the engineering pipeline as a flow — open tickets move through coding, testing, and deployment to closed tickets, with error flows running backwards from each stage — and then runs the obvious experiment: turn up the development-velocity dial. Tripling the coding rate on its own changes almost nothing about the rate of tickets actually closed. The conclusion is blunt: “The constraint on this system is errors discovered in production, and any technique that changes something else doesn’t make much of an impact.”

That is the mechanism the code-review section above describes, stated formally. AI coding tools turn up the one dial — development velocity — the system is least sensitive to, while leaving the binding constraint untouched. And if the faster first-draft code carries a higher defect rate into production, the move is not merely neutral: in Larson’s words, “any approach that increases development velocity while also increasing production error rate is likely net-negative.” That single sentence is the most useful thing a VP Engineering can carry into a tool-rollout review. The question is never how much faster the team wrote code; it is what the rollout did to the rate of errors discovered in production.

The practical consequence is a measurement reframe. The number worth tracking is not lines, PRs, or self-reported speed-up; it is the production-error-discovery rate, and the time from start-of-work to a ticket that stays closed — error-free — for ninety days or more. Larson’s own defence of the approach is that its value lies in its simplicity: “simple models are usually more effective at refining thinking across a group than complex models.” A leadership team that aligns on one diagram of where errors enter and leave the system will make better rollout decisions than one arguing over which vendor’s productivity percentage to believe.

The hiring-posture shift

The hiring-posture implication of the code-review bottleneck is the part of the AI-for-engineering-teams story that the VP Engineering has to communicate to the CFO and the head of recruiting, and that most engineering organisations have not yet thought through.

The reasoning chain. AI coding tools generate high volumes of plausible-looking code that requires senior-level judgment to review effectively. The engineers who can review high-volume AI-generated code competently are senior — staff, principal, and the top half of senior engineers. The engineers who are most directly substituted by AI coding tools are not mid-level generalists writing routine business logic; the tools are not yet reliable enough for that substitution at production quality. The engineers who are most directly accelerated by AI coding tools are mid-level engineers doing routine work, who now produce work at a rate that requires senior review at higher volume than before. The hiring-posture conclusion follows: the engineering organisation in 2026 needs higher senior-engineering density to support AI-leveraged mid-level engineers, not lower senior density and replacement of mid-levels with tooling.

The conclusion is the opposite of the narrative that has dominated 2025–2026 engineering-leadership discourse, which has been that AI tools will allow engineering organisations to shrink. They will not, in any organisation I have advised that has actually run the numbers honestly. They will reshape the organisation toward higher senior density and a different ratio between senior, mid-level, and platform-engineering staffing. The headcount may stay flat or rise, not fall. The composition shifts.

The specific shifts I have seen work. A platform-engineering capability dedicated to the AI tooling itself — selecting, integrating, and maintaining the tools, building the evaluation harness, running the rollout discipline. A higher proportion of senior engineers in code-review roles, with explicit review responsibilities written into senior job descriptions and review load tracked as a workload metric. A continued investment in mid-level hiring, because the mid-level engineers are the population whose individual productivity rises most measurably with the tools. A retention focus on the senior engineers whose judgment the AI-leveraged workflow now depends on, including comp adjustments and explicit recognition that the senior role has become more valuable in the AI-leveraged organisation, not less.

The hiring-posture shift is the operational consequence of the throughput honesty above. Engineering organisations that did the tool rollout without the hiring-posture shift are the ones now experiencing the structural problem the cleanest version of which is “we have plenty of code being written and not enough being shipped competently.”

The compensation question

The compensation pressure on senior engineering roles in AI-leveraged organisations is real but uneven. The AI-native firms and the FT500 platform organisations are seeing meaningful upward pressure at the senior staff and principal levels — roughly 10–20% above 2024 benchmarks at those firms. Outside that segment, the compensation pressure is more limited and more concentrated on specific roles: engineers with credible AI-platform experience, engineers who can lead the evaluation-harness work, engineers with credentialed experience on the agentic-architecture decisions that are the highest-leverage technical bets of the period.

The structural question for a VP Engineering is not whether to raise comp across the board. It is whether the specific seniors in the engineering organisation will leave for AI-leverage roles elsewhere if comp does not adjust, and the answer is firm-specific. The signals that indicate the retention risk is real: a visible AI-platform initiative at the firm that is staffed below the senior level required to execute it; senior engineers receiving credible inbound recruiting interest from firms with stronger AI-platform stories; the firm’s published technical work conspicuously silent on AI-related topics, which the senior engineers read as a signal that their career development is at risk by staying.

The fix is not always a comp adjustment. Often the more effective intervention is visibility — putting the senior engineers on the AI-platform initiative, giving them the technical authority over the agentic-architecture decisions, publishing their work externally with their bylines. Senior engineers whose careers are visibly developing inside the firm leave at lower rates than those whose careers are stalling regardless of comp.

The cases where a comp adjustment is the right intervention. When the firm is in direct competition for talent with AI-native firms or FT500 platform organisations in the same geography, and the comp gap has grown beyond what visibility-and-development levers can close. In those cases the adjustment should be targeted at the senior-staff and principal levels rather than spread across the engineering organisation, because that is where the leverage is and where the retention risk is highest.

Smaller teams shipping more — sometimes

The narrative that AI tools enable smaller teams to ship more is partially true and largely misunderstood. The teams that have successfully shrunk while shipping more have specific characteristics: high senior-engineering density to begin with, mature engineering practices that limit the code-review bottleneck the tools create, and a workload composition heavy on routine work where the AI tools are most reliably effective.

The mid-cap engineering organisation that has been told it can shrink because of AI tools and that does not have those characteristics is being set up to fail. The shrinkage happens; the shipping does not improve; the on-call quality degrades because the remaining engineers are spread thinner across a code base that is now generating more incidents per week because of the unreviewed correctness issues; the team enters the failure spiral the published vendor case studies do not describe.

The honest operational read for the VP Engineering who is being asked about shrinkage. The teams that can credibly shrink with AI tooling are the teams that did not need to shrink for cost reasons in the first place — the teams whose senior density and practice maturity meant they were already operating efficiently. Those teams can shrink modestly while maintaining velocity, and the shrinkage is more about being able to absorb attrition without backfill than about active reduction. The teams that need to shrink for cost reasons are typically the teams whose senior density and practice maturity will not absorb the operational hit of the tools, and AI tooling does not rescue that situation.

Conway’s law applies in a specific way here. An engineering organisation that ships a software architecture mirroring the team structure will ship a different architecture as the team structure changes. Shrinking a team and adding AI tooling changes both the structure and the implicit communication patterns simultaneously, and the resulting architecture drifts in ways the VP Engineering cannot easily predict. The architectural drift shows up six to twelve months later as integration issues, capability gaps, or the kind of platform debt that takes three years to unwind. The decision to shrink is not just a headcount decision; it is an architectural decision in disguise, and the consequences of getting it wrong land on the same VP Engineering who will be asked, in the next planning cycle, why the platform team is asking for more headcount.

What reliably works

The operational changes that have reliably translated AI-tool adoption into team-level shipping gains, across the engagements where I have watched it work:

A platform-engineering investment in the tooling itself, with named ownership and a roadmap. The organisations that treat the AI coding tools as a procurement decision and not a platform investment do not see the team-level gains. The investment includes integration work, evaluation work, rollout discipline, and ongoing maintenance — typically two to four engineers of permanent staffing for a 100–200 engineer organisation.

Code-review capacity scaling that anticipates the higher review load, with senior-engineer review responsibilities written into job descriptions and review effectiveness tracked as a metric. The organisations that watch review queue length and review-comment density alongside cycle time are the ones that catch the bottleneck before it becomes a delivery problem.

Investment in automated testing infrastructure that catches the correctness categories AI tools systematically miss — boundary conditions, error handling, state-management bugs, subtle concurrency issues. The investment is in test infrastructure more than in test coverage; the tests that catch AI-generated correctness issues are property-based, integration-level, and fuzz tests, not the unit tests the AI tools already generate competently themselves.

A measurement discipline that distinguishes individual-task throughput from team-level shipping velocity, and that reports both honestly to leadership. The organisations that report only the individual-task number to leadership are the organisations whose AI rollout becomes a credibility problem six months in when the team-level numbers do not follow.

A clear position on which workloads are AI-leveraged and which are not. The most common pattern that works: AI-tool usage is high for routine implementation work, moderate for refactoring, low for novel architectural work, and minimal for production-incident response. The organisations that try to use AI tools uniformly across all workload types see the worst quality outcomes; the organisations that match tool usage to workload type see the best.

None of this is glamorous. None of it is what the tool vendors put in their slide decks. All of it is the operational discipline that turns the genuine 5–15% team-level shipping gain into a sustainable capability rather than a six-month enthusiasm cycle followed by the inevitable disappointment.

How this connects to the rest of the cluster

The parent capabilities hub covers the broader capability layer that sits underneath the strategy. This page is the engineering-team-specific read on what generative AI is doing to one of those capability layers. The context-engineering page covers the related but distinct discipline of designing the data and prompt context flowing into the model layer, which becomes a permanent capability inside the engineering organisation once the AI workload reaches production scale.

The AI-for-CTO page covers the executive-level view of the same questions and is calibrated for the CTO making the strategic decisions. This page is calibrated for the VP Engineering making the operational decisions. The two pages should be read together if you are in either role and the other role is your immediate stakeholder.

The strategy work upstream of all of this lives at the root hub and the frameworks cluster. If your engineering organisation has been asked to roll out AI tooling without a clear strategy underneath, the four-question diagnostic on the root hub is the place to start to surface the questions the rollout will otherwise stumble over.

Sources & methodology

METR, “Measuring the impact of AI on experienced open-source developer productivity,” July 2025 — the perception-vs-measurement gap reference
DORA State of DevOps Report 2025 — team-level productivity baselines and AI-tool-adoption correlation data
GitHub Octoverse 2025 — AI-tool adoption rates and language-mix data
Stanford HAI AI Index Report 2025 — developer-productivity panel data and methodology
Will Larson, systems model of LLM adoption and developer experience (lethain.com) — the systems argument that production-error rate, not development velocity, is the binding constraint on shipping
Brooks, F. (1975), “The Mythical Man-Month” — pipeline-throughput reasoning underlying the code-review bottleneck argument
Conway, M. (1968), “How Do Committees Invent?” — the architectural-drift consequence of changing team structure
Methodology: operational observations drawn from fractional CTO engagements across roughly fifteen engineering organisations adopting AI coding tools (2024–2026), with team sizes ranging from 40 to 800 engineers. The team-level shipping-velocity ranges (5–15%) reflect my engagement experience and are consistent with the published academic studies; the vendor-published 30–55% figures reflect individual-task throughput and are not directly comparable.

If a number disagrees with your own organisation’s measurement, send the disagreement and I will publish it with attribution.

Thomas Prommer CIO / CTO · 20 years · Practitioner, not consultant

Tom Prommer writes The AI Strategy Guide from the operator's seat — every tool covered, tested with real money before forming a view. Connect on LinkedIn · prommer.net · X