AI Maintenance and Tech Debt: The Bill Comes Due — Software Engineering illustration

AI Maintenance and Tech Debt: The Bill Comes Due

The maintenance crisis I walked into in February was, by the time I got involved, well past the point where any single intervention was going to fix it. The team had been using Cursor and Claude Code aggressively since early 2024, the codebase had grown to roughly 800,000 lines, and the company had hit a wall the previous quarter when a routine model-vendor migration — Anthropic deprecating an older Claude version on a published schedule — turned into a six-week emergency. The migration touched 47 distinct surfaces in the codebase, each one a place where some piece of code depended on the deprecated model’s specific behaviour, and approximately none of those surfaces had been documented as model-coupled. The engineers who had originally written the code had partly left the company and partly forgotten the context, the AI tooling that had written the code could not remember any of it, and the team spent the six weeks reverse-engineering their own codebase under deadline pressure. The post-incident review concluded that the codebase had three to four more deprecation events of similar magnitude already scheduled in the next eighteen months, and the team’s read was that they did not know how they were going to absorb them. This is what the two-year horizon on AI-augmented engineering looks like in 2026, and the discipline shifts to manage it are concrete enough to be worth writing down.

The 2026 reality is that organisations using AI coding tools seriously for two years are discovering the long-tail consequences. The discovery is uneven — some codebases have aged well and some have aged terribly, and the predictors of which is which are knowable in advance — but the maintenance burden specific to AI-augmented work is a real and growing line item, and the engineering organisations that have not budgeted for it are the ones the next eighteen months are going to be unkind to. Brooks’ Mythical Man-Month framing applies directly here, and the No Silver Bullet distinction between essential and accidental complexity is the working vocabulary the next sections lean on. Some of the tech debt the AI-augmented era has produced is essential — a real complexity intrinsic to the AI-native application stack that has to be managed. Most of it is accidental — complexity introduced by how the AI-generated code was written rather than by what it had to do, and the accidental kind is the more expensive one because it could have been avoided at write-time.

The four AI-specific tech-debt categories

The four categories below are the ones I see most often in 2026 maintenance engagements. They are distinct from the standard tech-debt taxonomy in ways that matter; the standard playbooks for paying down tech debt do not entirely transfer.

Model-vendor-coupled abstractions. The application code couples directly to a specific model vendor’s API surface — Anthropic’s tool-use shape, OpenAI’s Responses API shape, Google’s Gemini-specific structured-output handling — without abstracting the coupling cleanly. The result is that every code path touching the model has a vendor-specific shape baked into it, and every model version change becomes a cross-cutting refactor. The pattern is not the same as “we are locked in to Anthropic” — that is a procurement question. The pattern is “we are locked in to the specific shape of the Anthropic tool-use API as it existed in Q3 2024,” which is a more granular and more expensive lock-in because Anthropic itself moves the shape forward. The platform engineering page covers the model-vendor abstraction layer in detail; the maintenance consequence of not having built one is concentrated in this debt category.

Prompt-as-implementation drift. Prompts embedded in the codebase as implementation details — the function that generates the user-facing summary uses a 400-token prompt template; the function that classifies the support ticket uses a 300-token prompt template — drift in three ways over an eighteen-month horizon. The model versions they were tuned against deprecate. The downstream functions consuming the model outputs evolve, while the prompts do not. The prompt-engineering knowledge that produced them in the first place was rarely written down, so the engineers maintaining them in year two do not know what the prompts were trying to optimise for and cannot tell whether a change is safe. The result is a codebase with hundreds of prompt strings, most of them outdated, none of them well-documented, and all of them somewhere on the spectrum between “still works fine” and “is silently producing worse output than it used to.” Tracking the drift is hard. Fixing it is harder. The eval-harness investment the testing strategy page recommends is the only mechanism I have seen work reliably for detecting prompt drift before it becomes a production-quality issue.

Eval-harness staleness. The eval harness was built in year one against a labelled dataset that reflected the application’s surface area at the time. The application’s surface area has grown by an order of magnitude. The labelled dataset has not grown with it, because labelling is expensive and the team that built the eval harness has rotated. The eval harness still runs in CI, still produces a passing metric, and still gates deployments — but it is now gating against a small fraction of the application’s actual behaviour, while the bulk of the behaviour is shipping without effective regression coverage. This is the more insidious version of the testing-strategy problems on the sibling page, because the harness still appears to be working. The fix is a deliberate investment in eval-dataset growth as part of the maintenance discipline; the cost is real but bounded, and the cost of not doing it compounds.

AI-generated code nobody understands well enough to refactor. The most expensive of the four. The engineer who originally accepted the AI-generated code has left the team, did not write down why the code is shaped this way, and the AI tool that produced it cannot reconstruct the original reasoning either. The code works in production. The code is not refactorable, because nobody on the current team can tell which behaviours are load-bearing and which are accidental. The pre-AI version of this pattern existed — every codebase has some — but the AI-augmented version is more pervasive because the production rate is higher and the original-author context-capture is lower. The teams I have watched manage this best treat AI-generated code with the same documentation discipline they apply to externally-sourced code, on the grounds that the model is effectively an external contributor whose context will not be there in eighteen months. This is uncomfortable for engineers who think of AI-generated code as their own code; it is, in practice, closer to library code in its maintenance characteristics.

These four categories cluster. A codebase with one of them tends to have all four, and the underlying mechanism is consistent — write-time discipline was absent, the AI tooling did not enforce what the engineering process did not enforce, and the year-two consequences accumulated in proportion to the absence. Codebases with one of them and not the others are rare; codebases with none of them are rarer and are almost always the codebases where the write-time discipline was rigorous from the start.

The maintenance-discipline shifts that work

The shifts below are the ones I have watched produce measurably better year-two outcomes across the engagements where they were applied early. None of them are revolutionary; all of them are the kind of disciplines that are easy to articulate and hard to maintain consistently.

Codebase-grounded AI tooling at year two — the compound advantage. A codebase-grounded AI tool produces better suggestions when the codebase has coherent patterns to ground against. A year-one codebase has fewer patterns and less consistency. A year-two codebase, if it has been reviewed competently, has more patterns and more consistency, which is exactly what gives the grounded tool a strong retrieval surface. Greptile’s deeper-grounding mode, Cursor’s codebase indexing, Claude Code’s project-context surface — all of these get noticeably more useful in a year-two codebase than they were in a year-one codebase, for the same reason a competent human engineer becomes more productive in a codebase they have worked in for two years. The implication is that the year-two maintenance investment can be made cheaper, not more expensive, if the codebase-grounded tooling is deployed deliberately and the codebase’s pattern coherence is maintained. This is one of the few cases where the year-two investment compounds positively, and the engineering organisations that have noticed it are the ones using their year-two AI tooling more effectively than their year-one version, not less.

Active documentation of model-coupling surfaces. Every place in the codebase that couples to a specific model vendor’s behaviour gets documented at write-time with a comment naming the coupling and the deprecation horizon, and the documentation is grep-able. The pattern reads as overkill in year one and is essential in year two. The cost is roughly two minutes per surface at write-time; the saving is several days per surface at deprecation time. The teams I have watched do this consistently are the teams whose model-vendor migrations are routine work rather than crises.

Eval-dataset growth as planned maintenance. The eval-harness staleness category above is fixed by treating eval-dataset growth as planned maintenance work — a recurring quarterly investment in adding labelled examples for new application surfaces, retiring labelled examples for deprecated surfaces, and auditing the dataset’s coverage against the application’s actual production traffic. The cost is real (engineering time, plus the labelling effort, plus the production-traffic-sampling infrastructure) and bounded. The cost of not doing it is the silent regression that ships because the eval harness was gating against the wrong surface.

Refactoring AI-generated code requires more aggressive context-capture than refactoring hand-written code. The pattern that works: when refactoring AI-generated code, the engineer first uses the AI tool to reverse-engineer the code’s intent and writes that intent down in a structured comment or a separate document, then refactors against the documented intent rather than against the code’s surface behaviour. The reverse-engineering step takes longer than it would for hand-written code because the AI tool does not have the original context. The documentation step is essential because the next engineer touching the code will not have it either. Skipping either step produces refactors that miss subtle behaviours and create the kind of regression that does not show up in tests until production.

Senior-engineer time allocated explicitly to year-two maintenance. The structural change behind all of the above. The pre-AI staffing-and-budgeting assumption was that maintenance was a fraction of new-feature work — roughly 20 to 30 percent of engineering capacity, well-distributed across the team. The AI-augmented version of the staffing question is that year-two maintenance specifically requires senior-engineer attention disproportionate to its line-of-code share, because the four debt categories above all require senior judgement to address. The teams that have noticed are the teams whose senior engineers spend visibly more time on maintenance than the pre-AI norm; the teams that have not noticed are the teams whose senior engineers are absorbed in the same review-queue bottleneck the code review page describes, with the maintenance debt accumulating in parallel and unattended.

The deprecation calendar problem

The structural-maintenance issue that is hardest to manage organisationally is the deprecation calendar. Every model vendor retires older model versions on a schedule — Anthropic, OpenAI, Google, the open-source-model providers, all of them. The schedules are published, the deprecation windows are typically several months long, and the migrations are non-negotiable. A model your code was built against in 2024 may not be callable in 2026, or may be callable only at significantly higher latency or cost or at the bottom of a priority queue. The deprecation calendar is, in the language Brooks would use, an essential complexity of building on a third-party model — it cannot be designed away.

The maintenance discipline that absorbs this cleanly has three pieces.

Track the vendor deprecation calendars actively. Every model vendor publishes a deprecation timeline; the engineering organisation should maintain a single internal view of the timelines for every model the codebase touches, with the model-coupling surfaces (from the documentation discipline above) mapped against the timeline. The view should be updated whenever a vendor publishes a new deprecation, and the engineering planning process should treat upcoming deprecations as scheduled work the way it treats upcoming library upgrades.

Refactor away from model-version-coupled code where the abstraction earns its place. The model-vendor abstraction layer the platform engineering page recommends absorbs most of this for the operational surface — credentials, logging, retries — but does not and should not absorb the capability surface. Application code that uses a specific model’s tool-use shape is genuinely coupled to that capability surface, and refactoring it requires either pinning to a specific model version (which has its own deprecation horizon) or absorbing the differences between model vendors’ capability shapes (which is expensive). The right answer is usually to refactor incrementally as the deprecation horizon approaches rather than to over-abstract preemptively, but the refactor has to be planned and budgeted.

Treat the migration as planned work, not as emergency work. The maintenance crisis the page opened with happened because the deprecation was treated as emergency work. The work itself was not extraordinary in volume — 47 surfaces over six weeks is roughly one engineer-day per surface — but the cost of doing it under deadline pressure with incomplete documentation was several multiples of the cost of doing the same work as planned maintenance. The organisational change that prevents the crisis is treating model deprecations as part of the recurring maintenance backlog, planning the migration work in the quarter before the deprecation lands, and treating the deprecation date as a hard commitment the planning process accommodates rather than absorbs as a fire.

The honest read on maintainability

The question the engineering leaders I work with ask most often is whether AI-generated code is net positive or net negative for maintainability at the two-year horizon. The honest answer is that it depends on the discipline applied at write-time, and the discipline is mostly absent in the engagements I have seen.

AI-generated code written with all of the disciplines above — code review, structured commit messages, eval-harness coverage, explicit human ownership of architectural decisions, documented model-coupling surfaces — ages roughly as well as carefully-written hand-coded software. The GitClear analyses of code-churn data over the last two years are useful here as one input among several; the picture they paint is consistent with the engagements I have observed, which is that the code-quality differences are smaller than the discourse on either side suggests, and the differences that exist are dominated by process variables rather than by the AI-versus-human variable directly.

AI-generated code written without those disciplines ages noticeably worse than the equivalent hand-coded software. The mechanism is consistent across the engagements where I have watched it happen: the lower write-time effort produces lower context-capture, the higher production rate produces less consistency across PRs, the absence of model-coupling documentation produces more expensive deprecation migrations, and the absence of eval-harness coverage produces silent regressions that compound over quarters. The result is a codebase that is harder to refactor, harder to onboard new engineers into, and more expensive to maintain than a hand-coded equivalent would have been.

The net answer on maintainability is therefore not a property of AI-generated code in the abstract; it is a property of how the AI-generated code was reviewed, grounded, and documented at write-time. The engineering leaders who internalise this are the ones whose two-year codebases age well. The engineering leaders who treat AI tooling as a write-time productivity intervention without addressing the year-two consequences are the ones whose two-year codebases have all four debt categories above, in roughly the proportions the production rate of the AI tooling predicted.

Brooks’ No Silver Bullet argument is the load-bearing one here. The essential complexity of software is the complexity intrinsic to the problem; the accidental complexity is the complexity introduced by the tools and processes used to solve it. AI tooling reduces some accidental complexity (the boilerplate, the look-up, the rote translation between specifications and code) and introduces some new accidental complexity (the model-coupling, the prompt-as-implementation, the context-capture failure). The net effect is approximately neutral on accidental complexity, which means the maintainability outcome is dominated by how well the engineering organisation manages the new accidental complexity — exactly the four debt categories and four discipline shifts above. There is no silver bullet here either. The discipline is the work.

What this connects to

The parent software-engineering hub covers the four discipline shifts of which maintenance is the temporal one — the others happen at write-time, this one happens over the two-year horizon. The code review page covers the write-time discipline that prevents most of the debt categories above from accumulating; the two pages are best read in temporal sequence — code review at write-time, maintenance as the long-run consequence of how the review went. The testing strategy page covers the eval-harness investment that the eval-harness-staleness category above depends on; the maintenance version of the eval-harness question is the staleness pattern this page covers, and the build-it-right version is the page that covers the original investment.

The platform engineering page covers the model-vendor abstraction layer that absorbs much of the model-vendor-coupled-abstractions debt category at the operational layer; the application-side residual coupling is what this page covers. The AI coding tools hub covers the year-one procurement frame; the year-two re-evaluation of those tools’ usefulness as the codebase matures is part of what this page is implicitly arguing for. The LLM observability cluster covers the production-side observability that detects most of the prompt-as-implementation-drift category before it becomes a quality incident.

Conway’s law applies in a temporal form here. The engineering organisation’s shape at write-time determines the codebase’s shape at year two, and the maintenance discipline at year two is therefore a function of the staffing and process decisions made at the start of the AI-augmented period. The organisations that did the structural work in 2024 — senior-density rebalancing, code-review tightening, eval-harness investment, explicit model-coupling documentation — are the ones whose 2026 maintenance burden is manageable. The organisations that treated AI tooling as a write-time productivity intervention without the structural work are the ones discovering in 2026 that the structural work was the actual investment, and that they are doing it now under maintenance pressure rather than then under planning conditions. The cost is several multiples of what it would have been.


Sources

Methodology: maintenance-debt patterns and discipline-shift recommendations drawn from engagements across roughly twelve engineering organisations that had been using AI coding tools in production for 18 to 30 months as of Q1 2026. Codebase sizes ranged from 200,000 to several million lines; team sizes from 40 to 800 engineers. The four debt categories are the ones that appeared consistently across the engagements; the discipline shifts are the ones that produced measurably better year-two outcomes in the engagements where they were applied early. The “mostly absent in the engagements I have seen” verdict on write-time discipline reflects the modal engagement, not the best-case one. Disagreement from your organisation’s observed reality is the most useful form of feedback this page receives.

Frequently asked questions

Is AI-generated code a net positive or negative for long-term maintainability?
It depends on the discipline applied at write-time, and the discipline is mostly absent in the engagements I have seen. AI-generated code written with code review, structured commit messages, eval-harness coverage, and explicit human ownership of architectural decisions ages roughly as well as carefully-written hand-coded software — there is genuine evidence on both sides of the comparison and the differences are small. AI-generated code written without those disciplines ages noticeably worse than the equivalent hand-coded software, in three specific ways: it is harder to refactor because nobody remembers why it was written this way, it accumulates inconsistencies faster because the model varied across PRs, and the surfaces that depend on specific model behaviours decay as the model versions deprecate. The net answer on maintainability is therefore not a property of AI-generated code; it is a property of how the AI-generated code was reviewed and grounded. Most organisations have not understood this yet.
What is the model-deprecation calendar and why does it matter?
Every model vendor retires older model versions on a schedule — Anthropic, OpenAI, Google, the open-source-model providers, all of them. A model your code was built against in 2024 may not be callable in 2026, or may be callable only at significantly higher latency or cost. Any code that depends on specific model behaviours — particular prompt structures, specific token-limit assumptions, the exact shape of a tool-use response — has a maintenance window forced on it by the deprecation. The maintenance window is non-negotiable; the only choice is whether the engineering organisation knows about it in advance or discovers it when an API call starts returning an error in production. Track the deprecation calendars actively, treat them as planned-maintenance work, and refactor away from model-version-coupled code where the abstraction earns its place.
Why do codebase-grounded AI tools become more useful at year two than year one?
Because the codebase has more pattern in it to ground against, and the patterns are now established enough to be worth enforcing. A codebase-grounded AI tool — Greptile's deeper-grounding mode, Cursor's codebase indexing, Claude Code's project-context surface — produces better suggestions when the codebase has a coherent set of patterns it can retrieve and apply. A year-one codebase has fewer patterns and less consistency. A year-two codebase, if it has been reviewed competently, has more patterns and more consistency, which is exactly what gives the grounded AI tool a strong retrieval surface. This is the same reason a competent human engineer becomes more productive in a codebase they have worked in for two years than in one they joined yesterday. The compound advantage is real, and it is one of the few cases where year-two maintenance investment compounds positively rather than negatively.