AI Software Engineering: An Engineering Leader's Map — Software Engineering illustration
Software Engineering

AI Software Engineering: An Engineering Leader's Map

What AI is doing to software engineering as a discipline, not a procurement line: the code review, testing, maintenance, and platform shifts a VP owns.

The platform-engineering standup I sat in on last month spent forty minutes on a single thread. The team had built a golden-path template for AI-augmented backend services three months earlier — the canonical opinionated path, the kind every platform team produces — and roughly nobody was using it. The application teams had quietly forked the template within a week, replaced the model-vendor abstraction layer with direct API calls because the abstraction was three model releases behind, and were now running a parallel stack the platform team had not authorised and could not observe. The platform lead’s question to the room was the right one: “is the golden path the wrong shape, or is the model market just moving faster than any platform team can absorb?” The answer turned out to be both, and the operating-model implications were the reason the team was in the room in the first place.

That is what this hub is for. Not the procurement question of which AI coding tool the engineering organisation buys: that lives at the AI coding tools hub and is genuinely a CIO-and-VPE decision with its own scoring frame. Not the C-suite strategy question of whether the engineering function itself is a capability the business is leading on, following on, or absent from: that lives at the AI-for-CTO page and at the root strategy hub. This hub is for the third question, the one in between: how does software engineering as a discipline actually run when half the code in the repository was generated by a tool, when the code-review queue is twice as long as it was eighteen months ago, when the testing harness has to detect categories of failure it was not designed for, when the platform team is being asked to support tooling that did not exist when the team was last sized? That is the engineering-leadership question, and it has almost no vendor literature underneath it because there is nothing to sell.

The audience for this hub is the VP Engineering, the Director Engineering, and the staff+ engineer who has been asked by their leadership to own one of the four discipline shifts below. The register is engineering-craft, not boardroom, closer to a senior architect’s notebook than to a McKinsey slide. The bias is toward what has actually worked across the engagements where I have watched it work, with the named failure modes from the engagements where it has not.

Why this cluster is not the capabilities cluster

The capabilities hub covers procurement. AI coding tools, RAG architecture, ML platforms, GPU inference, LLM observability, agentic orchestration, AI-SRE. The pages there answer the question a CIO or CAIO answers in a procurement engagement: which vendor, against which workload, on what security surface, with what cost trajectory at three times the current scale. They are decision frameworks for a buyer, and a buyer is exactly the right audience for them.

The pages here answer a different question. After the procurement decisions have been made — after Cursor or Claude Code has been rolled out to the engineering organisation, after the LLM observability vendor is integrated, after the orchestration framework is chosen — what does the engineering organisation actually do differently? What changes in the code-review process? What changes in how testing is approached? What changes in how the platform team is staffed? What is the maintenance posture for AI-generated code at the eighteen-month horizon? Those questions do not have vendors, do not have decks, and do not have a procurement category. They have a discipline, and the discipline has to be built by the engineering organisation itself.

The split matters because the same vendor will sell content on both sides and call them the same thing. They are not the same thing. A page that argues you should buy Sonar’s PR-bot is a procurement page. A page that argues your code review process needs to change in five specific ways regardless of which PR-bot you bought is a discipline page. This hub holds the second category.

The four spokes, named

The four pages under this hub each take one of the discipline shifts seriously.

AI for platform engineering. The platform-team operating model in an AI-augmented engineering organisation. The expanded charter that the platform team now has to carry: golden-path templates for AI-augmented services, model-vendor abstraction libraries, evaluation primitives, the observability surface for the agentic systems the application teams are now building. The headcount-and-staffing question, which is the one most platform teams have wrong: the platform team should scale with senior density, not with the size of the engineering organisation it serves. The build-versus-buy decision at the platform layer, which is different from the application-layer build-versus-buy decision and is more nuanced than most platform leads have time to think through. And the platform-engineering anti-patterns: the over-engineered orchestration that nobody uses, the evaluation harness divorced from the deployment pipeline, the golden-path template that lags the model market by six months and gets quietly forked.

AI code review. The two genuine shifts that have happened. Review volume is up because AI generates more code; the human reviewer becomes the bottleneck. AI tools can now do first-pass review themselves, catching shape-of-change issues at PR time before a human looks. The page covers the practice changes — smaller PRs more strictly, structured commit messages so AI review can ground, automated test-coverage gates that catch the categories AI tools miss — and the four buying motions for AI code review tools (PR-bot first-pass, pair-coding inline, security and vulnerability scanning, architectural-consistency enforcement). Named tool verdicts on Sonar, Bito, CodeRabbit, Greptile, Diamond, Codacy, plus Claude Code and Cursor in the inline-assist motion. Honest about what reliably works and what does not.

AI testing strategy. The testing-discipline shift, and the two genuine problems behind it. AI generates plausible-looking tests that pass without actually testing anything: the vibe-passing-test pattern, which is Goodhart’s law in its most direct form. AI-generated code shifts the test-coverage failure modes, with fewer bugs in happy-path code and more bugs in edge-case handling that the model glossed over. The page covers the practice changes (test-first prompting, structured test-spec language the AI can ground in, evaluation harnesses that catch what standard test-runners miss), the four buying motions for AI test tooling, and the honest read on the LLM-output-as-test-target problem. You cannot ship an LLM output to a test runner without an eval harness; the standard unit-test discipline does not transfer to non-deterministic systems, and pretending it does is one of the more expensive category errors of the period.

AI maintenance and tech debt. The 2026 reality. Organisations that have been using AI coding tools seriously for two years are starting to discover the long-tail consequences. The page covers the four categories of tech debt specific to AI-augmented work: model-vendor-coupled abstractions, prompt-as-implementation drift, eval-harness staleness, AI-generated code that nobody understands well enough to refactor. The maintenance-discipline shifts (codebase-grounded AI tools become more useful at year two than at year one for exactly the same reason). The deprecation calendar problem. And the honest read on whether AI-generated code is a net positive or net negative for maintainability at the two-year horizon. It depends on the discipline applied at write-time. The discipline is mostly absent in the engagements I have seen.

How this hub connects outward

Several pages elsewhere on the site are direct neighbours of this work and should be read alongside.

The AI coding tools hub is the procurement frame I mentioned above. Read it before any tool decision; read this hub for what to do with the tool once it is rolled out. The two together cover most of what an engineering leader needs to know in 2026 about how AI changes the function. Either one alone is a partial view, and the partial views are how organisations end up either buying tools they cannot operate or trying to build disciplines that no tool supports.

The AI for engineering teams page is the throughput-versus-velocity argument in detail: the structural gap between individual-task productivity and team-level shipping velocity, and the operational changes that close some of it. The pages under this hub are the discipline-specific working version of that argument. They share the same diagnosis. They differ in the specificity of the prescription. If you only have time for one read, read the engineering-teams page; if you are running one of the discipline-specific changes, read the page under this hub that covers it.

The agentic AI architecture patterns page is the pattern-vocabulary sibling. When the platform-engineering page below talks about supporting application teams building agentic systems, the four patterns in that page are the vocabulary the platform team needs to standardise on. When the testing-strategy page talks about evaluation harnesses for non-deterministic outputs, the same pattern vocabulary applies. The pages here borrow the vocabulary; that page defines it.

The LLM observability cluster covers the tooling for monitoring AI-augmented systems in production. The platform-engineering page below treats it as a build-versus-buy decision (mostly buy). The maintenance-and-tech-debt page treats it as a year-two requirement most teams underinvest in until the incident postmortem forces the upgrade.

The ML platforms page and GPU inference page are the procurement-frame upstream of the platform-engineering page. The platform team’s build-versus-buy decisions on inference and on the model-vendor abstraction depend on what the rest of the AI infrastructure stack looks like, and those two pages cover that infrastructure stack from the procurement angle.

Three sibling pieces in this cluster cover the board-and-CAIO angle on software-engineering AI work, distinct from the practitioner-and-VPE angle the four spokes above carry. The AI strategy for software engineering page is the Gartner-2030 read framed as a board playbook. The software engineering AI readiness page is the assessment instrument that sits one step upstream of any of the spoke decisions on this page. The AI software engineering workforce strategy page is the headcount-and-talent question: why cutting engineering headcount in response to AI tooling underperforms upskilling, and what the workforce-planning posture should be instead.

The AI-for-CTO page covers the C-suite read of the same questions. The CTO answers for the strategic direction; the VP Engineering and the engineering leadership cohort below the CTO answer for the operational execution. The two pages should be read together if your role sits at either level and the other level is your immediate stakeholder. Most of the failure modes I have seen in the last two years happen when the CTO-level and VPE-level views of the same question disagree without anyone surfacing the disagreement until it has produced an incident or a board-memo embarrassment.

What is not in this hub

Tooling reviews. The procurement question of which AI coding tool to standardise on, which LLM observability vendor to integrate, which agentic framework to commit to, those questions belong in the capabilities cluster, which is where the named-vendor scoring sheets live. The pages here name tools where naming them is honest (Sonar, CodeRabbit, Greptile in the code review piece; the same kind of references in the testing piece), but the purpose is to describe what the discipline looks like with or without any specific tool, not to recommend which one to buy.

Strategy-level questions. Whether your firm should be a leader, follower, or absentee on AI-augmented engineering as a strategic posture: that question lives at the root hub and at the frameworks cluster. The pages here assume the strategic posture is set and focus on the engineering-leadership execution beneath it.

Individual-engineer technique. How to prompt Claude Code better, which Cursor configuration produces the best autocomplete, which inline assistant flow saves the most keystrokes. There is a lot of good writing about this elsewhere on the open web; my version would not add anything an engineer practising twice a week would not already know. The audience here is the leader who is responsible for how the engineering organisation operates, not the engineer writing the code.

How to use this hub

If you are setting the engineering organisation’s direction for the next two quarters, read this overview, then read whichever of the four spokes covers the discipline that is most immediately broken in your organisation. In most engagements I have walked into in the last six months, the answer has been code review. In about a third of them it has been platform engineering. In a smaller but growing number it has been maintenance, because the codebase has aged enough that the tech-debt categories below are now expensive. Testing is the discipline most teams underinvest in proactively and the one whose underinvestment produces the slowest and most expensive consequences.

If you are a staff+ engineer leading one specific change inside the engineering organisation, the spoke pages are the working notes you would otherwise have to write yourself. They are written to be the document that exists when you walk into the meeting where the change is being scoped. The bias is operational; the named patterns are the ones that have worked across the engagements where I have observed them. The frameworks are the ones I use when I am the staff engineer in the room.

If you are a CTO using this hub as a reference for the C-suite conversation upstream, read the spoke pages once for the working detail, then read the AI-for-CTO page for the strategic-language version of the same diagnosis. The two map onto one another deliberately; the AI-for-CTO page is the version you would put in front of the board, and the pages here are the version you would give to the VP Engineering executing against the board’s approved direction.

The hub will get updated as the discipline moves. The two-year-old AI-augmented codebases that the maintenance page describes did not exist in 2024; by 2027 the relevant horizon will be the four-year-old codebase, and the discipline shifts will be different. The pages here are dated and will get refreshed; the diagnoses are the ones that look stable enough to commit to in writing in 2026.


Sources & methodology

  • METR — Measuring the impact of AI on experienced open-source developer productivity, July 2025 — the perception-versus-measurement gap that underlies the team-level shipping argument
  • DORA — State of DevOps Report 2025 — team-level productivity baselines and AI-tool-adoption correlation
  • Brooks, F. (1975), “The Mythical Man-Month” — pipeline-throughput reasoning underlying the code-review bottleneck and the maintenance argument; “No Silver Bullet” (1986) for the essential-versus-accidental complexity distinction the pages below borrow
  • Conway, M. (1968), “How Do Committees Invent?” — the organisational-shape-determines-architecture argument that the platform-engineering page leans on
  • Anthropic — Building effective agents — the minimal-orchestration argument that constrains what the platform team should standardise on
  • Methodology: discipline shifts drawn from fractional CTO engagements across roughly twenty engineering organisations (2024-2026) ranging from 40 to 1,200 engineers. The patterns are the ones I have watched succeed and fail across that sample; the named tools are the ones the engineering teams I respect have converged on; the failure modes are the ones the engagements where I was not consulted at the right point produced

If a claim disagrees with your organisation’s observed reality, send the disagreement and I will publish it with attribution. The discipline is moving fast enough that disagreement is the most valuable form of feedback this hub receives.

Across the guide

Frequently asked questions

How is this hub different from the AI coding tools hub?
The AI coding tools hub is procurement: which vendor, on what licence, with what data path, against which security surface. This hub is engineering discipline: what changes about how code review actually runs, how testing has to be rethought, how platform teams now staff against AI tooling, how a two-year-old AI-augmented codebase ages. The procurement question is the one the CIO signs. The discipline questions are the ones the VP Engineering owns the day after the licence closes, and they do not have vendor decks attached.
Why a separate cluster from capabilities?
Because the audience is different. The capabilities cluster is the CIO or CAIO procuring against a strategy: RAG, ML platforms, GPU inference, LLM observability. This cluster is the VPE, Director Engineering, or staff+ engineer running an engineering organisation in the AI-augmented era. The pages here cover the practice changes nobody puts on a slide because they cannot be bought. They have to be built into how the team operates. A capabilities page describes a market. A software-engineering page describes a craft.
What is the one practice shift you would prioritise if you only had budget for one?
Code review capacity and discipline. Every other failure mode in an AI-augmented engineering organisation routes through code review eventually. The unreviewed correctness bugs land in production three months later as incidents. The unreadable AI-generated code becomes unmaintainable a year later. The shipping bottleneck the throughput-versus-velocity gap names is, in mechanism, a code-review bottleneck. If you only fix one thing in 2026, fix code review: both the staffing and the tooling, in that order.
Are AI-augmented engineering organisations actually shipping faster?
Some are. Most are shipping the same amount with more code being written, which is not the same thing. The teams that ship measurably faster have invested in the four discipline shifts these pages cover: platform-engineering operating model, code review at higher senior density, testing that catches what AI tools miss, and active maintenance of AI-generated code before it ossifies. The teams that bought the tools and changed nothing else are running the same delivery cadence as 2023 against a larger and lower-quality codebase. The visible gap between those two groups widens through 2026 and will be the defining engineering-organisation question of 2027.