AI Code Review: When the Bottleneck Moves — Software Engineering illustration

AI Code Review: When the Bottleneck Moves

A PR review last quarter ran an extra ninety minutes because three people in the thread disagreed about whether the change was actually correct. The diff was 1,400 lines, generated mostly by Cursor over a two-day stretch by a competent mid-level engineer, and the surface review had gone fine — the tests passed, the linter was happy, the PR-bot the team was trialling had left two minor nits and an approval. The senior reviewer who finally looked at it noticed that the change had quietly introduced a second pattern for handling timezone-aware datetimes that contradicted the one established eighteen months earlier in a different module. The author had not noticed because the AI tool had not noticed; the AI tool had not noticed because the contradicting module was not in the context window. The PR was eventually merged, after a two-day rewrite that nobody had budgeted time for, and the team’s retrospective the next week landed on the same conclusion most of these retrospectives land on: the review process worked exactly the way it always had, and the process is the wrong shape for the volume and the kind of code now arriving at it.

Code review is the discipline shift inside an AI-augmented engineering organisation that produces the most second-order consequences. Every other failure mode I have watched in the last eighteen months routes through code review eventually. The unreviewed correctness bug lands in production three months later as an incident. The unreadable AI-generated code becomes unmaintainable a year later. The shipping bottleneck the throughput-versus-velocity gap names is, in mechanism, a code-review bottleneck — the AI tools have moved the constraint from typing-speed to reviewer-attention, and the engineering organisation that has not noticed is the engineering organisation whose senior engineers are quietly burning out behind a queue they cannot drain.

Two genuine shifts have happened. Review volume is up because AI generates more code, and the human reviewer is now the bottleneck in a way they were not in 2022. AI tools can perform a first-pass review themselves, catching shape-of-change issues at PR time before a human looks. Both shifts are real. Both have implications the review process has to absorb. Most engineering organisations have absorbed neither cleanly, which is why this page exists.

Shift one — the reviewer is the bottleneck now

The mechanism is straightforward. Pre-AI, the binding constraint on shipping velocity was roughly the rate at which engineers could produce code that worked. Review was a tax on top of that constraint, important but rarely the binding one. Post-AI, the rate at which engineers produce code that works has approximately doubled — the METR 2025 developer productivity study gives a more nuanced perception-versus-measurement picture, but the perceived-throughput number is roughly consistent across teams I have audited — while the rate at which reviewers can review has not changed. The reviewer’s eyes still move at human speed. The reviewer’s judgement still requires holding context the bot cannot hold. The PR queue lengthens. Senior engineers spend more of their time in review and less in design. The team’s effective shipping cadence stops tracking its individual-task productivity and starts tracking its review throughput.

The teams that have noticed this and adjusted have done four things, in roughly this order of leverage.

They have shrunk PRs more aggressively, with a hard ceiling enforced by tooling. The pre-AI working norm of “PRs should be small” was a guideline. The AI-augmented version has to be a rule the CI enforces. The pattern that works across the teams I have watched is a soft target of 200 lines changed, a hard ceiling of 500 lines that requires explicit reviewer approval to merge past, and a CI check that flags PRs above the ceiling for the engineering manager. The reasoning is that AI-generated code packs more decisions per line than hand-written code; a 500-line AI-generated PR puts roughly the cognitive load of a 1,500-line hand-written PR on the reviewer. Teams that do not adjust the size ceiling for this discover the consequences in the review queue.

They have moved to structured commit messages and PR descriptions so the AI review tooling can ground itself. This sounds like process theatre and it is not. The PR-bots that work — CodeRabbit, Greptile, Sonar’s AI work, Graphite Diamond — perform meaningfully better when the PR description names the change’s intent, the affected surfaces, and the testing approach. The same is true for the human reviewer. The pattern that has stabilised across teams is a templated PR description (intent, surfaces touched, test approach, risk areas) that takes the author three minutes to write and saves both the bot and the reviewer ten minutes each. The teams that resist this on the grounds that “good engineers know how to write a PR description” are the teams whose review-queue length tells them otherwise.

They have invested in automated test-coverage gates that catch the categories the AI tools systematically miss. The PR-bots are uneven on test coverage analysis. The thing that catches more is a coverage gate in CI that fails the PR when coverage drops on a touched file, combined with a mutation-testing pass on a sample of merged PRs that catches the vibe-passing-test pattern the testing strategy page covers in detail. This is a build-side investment, not a review-side investment, but its purpose is to reduce the review load by catching mechanical issues before the reviewer’s attention is required.

They have re-weighted the engineering team’s seniority composition. This is the structural change most engineering leaders resist and most engineering leaders eventually make. The team that ships in an AI-augmented organisation has more senior engineers per mid-level engineer than the pre-AI version did, because the binding constraint has moved from production to review. The platform-engineering page makes the same argument about the platform team specifically — see the senior density section there — and the same logic applies to application teams. The teams that do not adjust seniority composition discover that their throughput-versus-velocity gap is not closing despite continued tooling investment.

Shift two — AI tools can now do first-pass review competently

The PR-bot pattern has stabilised in 2026 in a way it had not in 2024. The early generation of AI code review tools (broadly, anything before about late 2024) produced reviews that were syntactically plausible and substantively shallow — they caught obvious bugs the linter would have caught anyway and missed the things a senior reviewer cared about. The current generation is meaningfully better. CodeRabbit and Greptile in particular have moved past the “wrapping an LLM around a diff” pattern into something closer to “indexing the codebase, retrieving the relevant context, and reviewing the diff against that context.” Sonar’s AI work sits on top of their decade-plus of static-analysis tooling, which gives the AI side genuine grounding. Graphite Diamond has converged on a similar pattern. The category is real now.

What the current generation reliably does well:

  • Shape-of-change review. Missing null handling, inconsistent naming, obvious edge-case gaps, missing tests for new code paths, log lines that leak secrets, basic security issues like SQL string concatenation or unparameterised template rendering. The kind of thing a senior reviewer flags in the first thirty seconds of looking at a PR.
  • Cross-file consistency in the small. If the change touches three files and the third file’s pattern contradicts the first file’s pattern, the bot will usually catch it. The bot’s context window is large enough for this in 2026 in a way it was not in 2024.
  • Surface-level architectural drift. The change that introduces a new dependency, the change that breaks a layering rule, the change that touches a module the author rarely touches. The bot flags these for human attention rather than judging them, which is the right division of labour.

What the current generation reliably does not do:

  • Architectural consistency in the large. The change that contradicts a pattern established three modules away, in code the bot has not retrieved, gets through. The Cursor-generated 1,400-line timezone-handling PR I opened this page with is the canonical example, and the version of it that goes through a current-generation PR-bot still gets through.
  • Domain correctness in regulated or specialised contexts. Financial calculations with non-obvious precision constraints, healthcare workflows with regulatory edges, anything where the right answer requires domain context the bot has not been given. The bot will approve confidently, and the bot is wrong.
  • The wrong abstraction in the right shape. The pattern where the code is structurally clean, the tests pass, the linter is happy, the change looks like a competent refactor, and the abstraction it introduces is subtly wrong for the codebase’s actual evolution direction. This is a senior-engineer judgement call, and the bot does not make it.

The honest read on this is that the PR-bots clear the floor competently, freeing the human reviewer to focus on the things that require senior judgement. They do not replace senior judgement. The vendors that claim otherwise are selling something that does not exist, and the engineering organisations that buy that claim are the ones whose codebases drift in ways nobody notices until the incident.

The four buying motions, named

The AI code review tooling category does not have one shape. It has four, and they are not interchangeable. An engineering leader scoping the investment should understand which motion they are buying into, because the four motions have different procurement decisions and different organisational consequences.

PR-bot first-pass review. The bot reviews every PR before a human looks. The model output is a structured set of comments inline on the diff. The leading vendors are CodeRabbit, Greptile, Sonar (the AI-enhanced layer on top of static analysis), and Graphite Diamond. The procurement decision is straightforward — pick one, integrate with the PR platform, calibrate the noise level downward over the first month until the false-positive rate is below about 15 percent. The cost is low (mostly per-seat or per-repo SaaS pricing), and the time-to-value is short.

Pair-coding inline assist. The AI sits inside the IDE and reviews code as it is written, before the PR is even opened. The leading tools are Cursor, Claude Code, GitHub Copilot’s review surface, and Windsurf. This is review-shifted-left, and it catches a different category of issue than the PR-bot — the bug the engineer would have introduced if the inline assist had not flagged it during writing. The procurement decision overlaps with the AI coding tools procurement decision generally (covered at the AI coding tools hub); the review value is one of several reasons to invest, not the primary one.

Security and vulnerability scanning. The AI-enhanced version of what Snyk, Sonar, and Codacy have done for years. The category was real before AI; the AI layer on top makes it materially better at catching the bugs that look correct in the immediate code but are exploitable in context. Snyk’s AI work, Sonar Cloud’s AI features, Codacy’s AI integration, and the security-focused PR-bots like Bito’s security mode all fit here. The procurement decision is usually orthogonal to the general PR-bot decision — security review is its own surface with its own compliance requirements, and the security team typically owns the procurement.

Architectural-consistency enforcement. The newest of the four motions, and the one with the least mature vendor landscape. The premise is that the AI tool indexes the codebase, learns the patterns established in it, and flags PRs that deviate from those patterns. Greptile’s deeper-grounding mode and CodeRabbit’s codebase-indexing features are pointing at this, and a handful of newer vendors (Bito at scale, plus several genuinely experimental products) are claiming the territory. The procurement decision here is harder because the category is moving fast and the vendor verdicts are still settling. I would not commit a multi-year procurement here in 2026; I would pilot two vendors against the same codebase for a quarter and see which one’s grounding matches the codebase’s actual structure.

The four motions can run simultaneously — most mature engineering organisations have at least two of them — but they should be procured as separate decisions with separate evaluation criteria. The vendor that wins the PR-bot motion is rarely the vendor that wins the architectural-consistency motion, and pretending they are the same procurement is how engineering organisations end up with three half-deployed tools and one fully-deployed quarterly invoice.

Named vendor verdicts, honest about what works

The vendors below are the ones I have watched in real engagements through 2025 and the first half of 2026. The verdicts are the patterns that have stabilised across the engagements; your codebase may produce different ones.

Sonar (SonarSource). Works, for the static-analysis-plus-AI motion. The decade of static-analysis tooling underneath gives Sonar’s AI layer genuine grounding, and the false-positive rate is the lowest in the category. The trade-off is that the AI layer is conservative — it catches what it is sure about and stays quiet about what it is not, which means it under-flags compared to the louder vendors. For regulated or enterprise contexts where false positives are expensive, this is the right trade.

CodeRabbit. Works, for the PR-bot first-pass motion. The codebase-indexing surface in 2026 is meaningfully better than the 2024 version, and the comment quality is competitive with senior-reviewer first-pass attention. The trade-off is volume — out of the box, the noise level is higher than most teams can absorb, and the calibration work in the first month is non-trivial. Worth it once tuned.

Greptile. Works, for the codebase-grounded review motion specifically. Greptile’s approach is closer to “retrieve relevant context from the codebase and review against it” than to “review the diff in isolation,” which matches the category of failure mode I described earlier — cross-file architectural consistency. The trade-off is that Greptile requires substantial indexing of the codebase to work well, which is fine for engineering organisations that can give it that access and a non-starter for ones that cannot.

Bito. Useful for the security-and-vulnerability motion, less differentiated for the general PR-bot motion. The product is competent across the board and best-in-class in none of the four motions, which is fine for a smaller engineering organisation that wants one vendor to cover multiple bases and wrong for a larger organisation that should be buying best-of-breed per motion.

Graphite Diamond. Works, for the PR-bot first-pass motion specifically inside Graphite’s stacked-PR workflow. If your team is on Graphite already, Diamond is the right PR-bot. If your team is not on Graphite, the Diamond product is not on its own a reason to migrate.

Codacy. Works, for the static-analysis-plus-AI motion in compliance-heavy contexts. Similar territory to Sonar with a slightly different shape — Codacy leans harder into the quality-gate-as-CI surface, which is useful for organisations where compliance review needs a paper trail. Trade-off similar to Sonar — conservative on false positives, under-flags compared to the louder vendors.

Claude Code and Cursor (inline assist). Both work for the pair-coding inline assist motion, and the verdict on which one to use is mostly the same verdict as the general AI coding tool decision. For review purposes specifically, Claude Code’s structured-output discipline produces marginally better in-line review comments; Cursor’s autocomplete is marginally smoother during writing. The two surfaces are complementary, and an engineering organisation that has standardised on one will not get meaningful additional review value from adding the other.

The vendors that claim to replace senior-engineer architectural review. They do not. The pattern is consistent — vendors who pitch this are pitching against the wrong problem. The senior reviewer’s job is not first-pass review; it is the judgement call the bot cannot make. Replacing the bot-replaceable surface is genuine value. Claiming to replace the senior reviewer is selling a product that does not work yet, and engineering leaders who buy it discover the consequence in the codebase a year later.

What this connects to

The parent software-engineering hub holds the four discipline shifts of which code review is one. The platform engineering page is the most adjacent — the platform team owns the integration of the PR-bot into the engineering organisation’s CI/CD pipeline, and the platform team’s own code is the highest-leverage place to apply the senior-density and small-PR disciplines this page recommends. The testing strategy page covers the test-coverage gates that catch what the PR-bots systematically miss; the two pages together cover the pre-merge quality surface. The maintenance and tech debt page covers what happens at the eighteen-to-twenty-four-month horizon to the code that did get merged, and the link between weak code review and expensive maintenance is the dominant signal in that page’s diagnosis.

The AI coding tools hub is the procurement frame for Cursor, Claude Code, and the inline-assist category generally; the review motion is one of several reasons to invest, and the procurement decision lives there. The LLM observability cluster covers the post-merge surface — monitoring what AI-augmented code actually does in production — and the review-and-observability pairing is one of the higher-leverage pre-and-post quality investments an engineering organisation can make in 2026.

The Brooks reading of all of this is the one the parent hub names — the Mythical Man-Month pipeline-throughput argument applied to code review. The reviewer is the slowest stage in the pipeline post-AI; adding more code-generation capacity upstream does not increase throughput unless the review stage gets faster too. The four practice changes above are, in mechanism, four ways to increase the review stage’s throughput. The teams that have done them are shipping faster. The teams that have not are running the same delivery cadence as 2023 against a larger queue, which is the pattern the METR 2025 study measured and named.


Sources

Methodology: vendor verdicts and discipline-shift patterns drawn from fractional CTO and engineering-advisory engagements across roughly eighteen engineering organisations using AI code review tooling in production (2024-2026). Codebase sizes ranged from 50,000 lines to several million; team sizes from 12 to 800 engineers. The verdicts reflect the patterns that stabilised across the engagements; your codebase may produce different ones, and disagreement is the most useful form of feedback this page receives — send it and I will publish it with attribution in the next refresh.

Frequently asked questions

Does an AI PR-bot replace a human reviewer?
No, and the vendors that say otherwise are selling the wrong thing. The PR-bot pattern that actually works — CodeRabbit, Greptile, Sonar's AI work, Graphite Diamond — does the first-pass review competently. It catches naming inconsistencies, missing tests, obvious null-handling gaps, the kind of shape-of-change issues a senior reviewer would flag in the first thirty seconds. It does not catch architectural drift, the wrong abstraction in the right shape, or the subtle interaction with another part of the codebase the bot has not indexed. Use the PR-bot to clear the floor so the human reviewer's attention lands on the things that need senior judgement. Do not use it to replace the senior reviewer; that pattern produces a codebase that drifts in ways nobody notices for a quarter.
How small should PRs be in an AI-augmented codebase?
Smaller than they were pre-AI, with a sharper ceiling. The working number across the engagements where the review process is healthy is roughly 200 lines changed as the soft target and 500 as the hard ceiling that requires explicit justification to exceed. The reasoning: AI-generated code is denser than hand-written code in the specific sense that more decisions are packed into each line, and the cognitive load on the reviewer scales accordingly. A 500-line AI-generated PR is closer to a 1,500-line hand-written PR in the load it puts on the reviewer. The teams that ignore this and let PR size drift past 800 lines because the AI generated the code quickly are the teams whose review queue collapses six months later.
Where do the AI PR-bots reliably fail?
Three places, consistently. Cross-file architectural consistency — the bot is grounded in the diff and a shallow context window around it, and the change that looks correct locally but contradicts a pattern established three modules away gets through. Subtle correctness in unfamiliar domains — anything regulated, anything with non-obvious numerical-precision constraints, anything where the right answer requires domain context the bot has not been given. And the category of change that is technically correct but wrong for the codebase — the wrong-abstraction-in-the-right-shape pattern. These three failure modes are exactly the ones senior reviewers exist to catch, which is why the PR-bot is a complement to senior review rather than a substitute for it.