AI Testing Strategy: When Green Tests Lie
The PR review last month spent an extra two hours on a single test file. The author had used Cursor to generate the function under test and the test cases in one prompt, the test runner was green, the coverage number was 94 percent, and the senior reviewer who looked at it noticed within ninety seconds that two of the four test cases were testing the wrong thing entirely. The function was supposed to round a monetary amount to the nearest cent using banker’s rounding; the tests were checking that it rounded to the nearest cent using standard rounding, which is a different rule for half-cent values. Both the function and the tests had been generated in the same prompt, so both were wrong in the same way, and both agreed with each other in the test runner. The coverage metric was unaffected. The test report looked exemplary. The bug would have shipped to production except for a senior reviewer who happened to know that the codebase uses banker’s rounding and bothered to check the test cases against the spec rather than against the function. This is what a vibe-passing test looks like in the wild, and the pattern is now common enough that the testing discipline has to change shape to detect it.
Testing in an AI-augmented engineering organisation is the discipline shift most teams underinvest in proactively and the one whose underinvestment produces the slowest and most expensive consequences. The two genuine problems behind the shift are concrete enough to name. AI generates plausible-looking tests that pass without actually testing anything — the vibe-passing-test pattern, which is Goodhart’s law in its most direct form on a CI server. AI-generated code shifts the test-coverage failure modes — fewer bugs in happy-path code, more bugs in edge-case handling the model glossed. Both problems are real. Both are widespread. Both have practice-change responses that work. Most engineering organisations have not made the changes, and the consequences land in the incident postmortems six to twelve months downstream.
Problem one — vibe-passing tests, Goodhart at the CI server
Goodhart’s law is the social-science observation that when a measure becomes a target, it stops being a good measure. The standard formulation lands easily in policy contexts. The version that lands on a CI server is more direct. The measure is “tests pass.” The target is “tests pass.” The AI tool that generates both the function and the tests in the same prompt optimises directly for the target — it produces a function-and-test pair that satisfies the metric — without producing the underlying property the metric was meant to measure, which is “the function does what was intended.” The metric is satisfied. The property is not. The CI server reports green.
The pattern is not subtle once you know to look for it. The version above with banker’s rounding is one shape; there are several others.
Function and test generated in one prompt, both agreeing on a wrong interpretation of the spec. The version this page opened with. Both artifacts share the same misunderstanding because they came from the same generative process. The tests are green because they were generated to be green against the function, not because the function does what the spec requires.
Tests that exercise the function shallowly enough to pass anything that runs. The AI generates “tests” that call the function with one or two inputs and assert that the return is not null, or that the type is correct, or that no exception is raised. The coverage number is high — those lines are executed. The tests are useless — they would pass against any implementation that returned the right type.
Tests that mock everything interesting. The AI generates tests for a function with substantial external dependencies (a database call, an API call, a file-system interaction) by mocking all the dependencies and asserting that the mocks were called with the right arguments. The test passes; the function would not work in production. This is a known anti-pattern that pre-dates AI tooling, but the AI tools produce it at a higher rate than mid-level human engineers do, because the mocking pattern is what the training data shows when the prompt is “write tests for this function.”
Snapshot tests against AI-generated outputs. The AI generates a function and immediately writes a snapshot test against its output, capturing whatever the function produced on the first run as the expected value. The test will pass forever against that function and will fail the moment the function is corrected. The snapshot is locking in the current behaviour, which is exactly the wrong thing to lock in if the current behaviour is wrong.
The mechanism behind all four shapes is the same. The AI tool was given a target (a passing test) and produced an artifact that hits the target, without verifying that the artifact reflects the underlying property (a correctly tested function). This is not an AI-tool failure in the sense of a bug to be fixed; it is what AI tools do by construction. The engineering discipline has to accommodate it.
The practice changes that work, in roughly the order of leverage I have watched them produce.
Test-first prompting, with an independence rule. The engineer writes the test spec first, in a form the AI can read but cannot have generated alongside the implementation, and verifies the test fails against an empty implementation before asking the AI to fill in the implementation. This forces the test to be an independent check on the implementation rather than a generated-together-with-it artifact. The pattern is the same TDD discipline that worked pre-AI, made explicit because the AI tools collapse the boundary that TDD relied on. Teams that adopt this rigidly produce noticeably fewer vibe-passing tests; teams that adopt it loosely produce the same vibe-passing tests with extra steps.
Structured test-spec language the AI can ground in. The test spec is written in a structured form — given/when/then, or property-based specs, or a tabular test-case definition — that the AI tool generates implementation against rather than generates implementation and tests against. The structure forces the engineer to articulate the property being tested before the AI produces anything. This is the testing-side analogue of the structured commit messages the code review page recommends — the structure is for the AI tool’s grounding as much as for the human reader’s clarity.
Mutation testing as a periodic check on the test suite’s actual sensitivity. Mutation testing introduces small changes to the production code (flip a comparison, swap a constant, invert a boolean) and runs the existing test suite to see if the change is caught. The mutation testing surface measures whether the tests actually exercise the code’s behaviour or just its execution. A test suite that passes mutation testing at, say, 60 percent kill rate is genuinely testing; a test suite that passes at 20 percent kill rate is decorative. Mutation testing pre-dates AI tooling but is more useful in AI-augmented contexts because the vibe-passing-test pattern is what it catches by construction. Run it periodically — weekly is a reasonable cadence — on a sample of the codebase rather than in CI on every PR; the cost-versus-signal trade-off favours periodic sampling.
Senior-reviewer attention on the test cases specifically, not just the implementation. The code review discipline shift on this page’s sibling code review page applies with extra force to test files. The reviewer should be reading the test cases against the spec, not against the implementation, and the PR-bots largely do not catch the misalignment between test cases and intent. This is human work and there is no shortcut.
Problem two — coverage failure modes shift, the test distribution should follow
The other genuine problem is less visible than the vibe-passing pattern but more pervasive. AI tools are systematically better at happy-path code than at edge-case code. The model has seen the happy-path version of any common function hundreds of thousands of times in its training data; it has seen the specific edge-case handling for your specific domain considerably fewer times. The output reflects this — the happy-path code is competent, often genuinely good, and the edge-case handling is glossed. Empty inputs, malformed inputs, boundary values, race conditions, timezone-aware datetimes, character encoding edges, locale-specific number formatting, the long list of “this is where the bugs actually live” categories — AI-generated code is uneven on all of them, with the unevenness varying by how well-represented the specific edge case is in the training data.
The test distribution should follow this. The pre-AI default — roughly proportional coverage across happy-path and edge-case code — is no longer the right distribution, because the failure-rate distribution has shifted. Happy-path code is more reliable now, so happy-path tests are doing less work. Edge-case code is less reliable now (less, in the sense of “the AI tool produces it less carefully”), so edge-case tests are doing more work. The investment should follow the failure rate.
The teams that have noticed this and adjusted do four things.
They write fewer happy-path tests and more edge-case tests. The shift is uncomfortable for teams that have been trained on lines-of-coverage metrics, because lines-of-coverage does not measure this. Two test suites with the same coverage number can have radically different failure-detection capability depending on how the tests are distributed across happy-path versus edge-case code. The teams that have made the shift have moved their test-quality conversation away from coverage percentage and toward edge-case enumeration — they keep a list of the edge categories that bite them in production and grade test suites on how thoroughly they exercise those categories.
They use property-based testing for the categories where it lands. Property-based testing — Hypothesis in Python, fast-check in TypeScript, QuickCheck in the Haskell-family languages — generates inputs the engineer would not have thought to write. This is exactly what catches the edge cases the AI-generated code glossed and the AI-generated tests did not exercise. The investment is modest (a property-based test takes longer to write than a unit test but covers more ground), and the return is concentrated on exactly the failure modes AI-augmented codebases produce most often.
They invest in fuzz testing on the surfaces where adversarial input matters. Anything taking external input — API surfaces, parsers, deserialisation paths, anything reading user-supplied data — gets a fuzz-testing harness. Fuzzing has been a security-engineering tool for decades; the AI-augmented version of the same surface is that the AI tools produce parsers and deserialisers with more shallow edge-case handling, and the fuzz harness catches what would otherwise become a CVE-shaped incident in production.
They re-weight CI runtime budgets toward the slower but more sensitive tests. Integration tests, end-to-end tests, mutation testing samples, property-based tests with larger search budgets. The pre-AI default of “keep CI fast by skewing toward unit tests” is the wrong default in an AI-augmented codebase, because the unit tests are exactly the category that has lost effectiveness. The teams that have noticed have accepted longer CI runs in exchange for higher detection rate, and the cost-versus-signal trade is heavily favourable on the integration-test side.
The four buying motions in AI test tooling
The AI test tooling category is less mature than the code-review tooling category, and the procurement decisions are correspondingly less settled. Four motions exist; the vendor landscape under each is moving fast enough that named-vendor verdicts age poorly. The motion descriptions are the part to internalise.
Autogen test stubs. The AI tool generates test scaffolding from the function signature or the spec. This is the most common motion, the lowest-leverage one, and the one most likely to produce vibe-passing tests if the engineer does not exercise the test-first discipline above. The tools that do this competently — the test-generation features inside Cursor, Claude Code, Copilot, and the dedicated test-gen products like Diffblue Cover and CodiumAI’s testing surface — are competent at the scaffolding and roughly silent on whether the scaffolded tests are meaningful. Treat this as a productivity surface, not as a quality surface. The engineer still has to write or audit the test cases.
Mutation testing with AI-augmented analysis. Mutation testing is the technique above; the AI-augmented version of it uses the model to prioritise which mutations to surface to the engineer (the ones likely to be meaningful as opposed to the ones that are trivially equivalent). Stryker (in the JavaScript family), PIT (in the JVM family), mutmut (in the Python family) are the mature mutation-testing platforms; the AI-augmented analysis layer on top is newer and the vendor landscape is unsettled. Worth piloting, not yet worth a multi-year procurement.
AI-powered end-to-end testing. The motion that generates and maintains E2E test suites for web applications, using the AI to interpret the application’s UI and produce tests that exercise user flows. The leading vendors are Mabl, Testim, Functionize, and the AI-augmented work in the Playwright ecosystem. The motion is useful for organisations whose E2E test investment has been chronically underfunded; it is not a substitute for a thoughtful E2E test design, and the vendors that pitch it that way are pitching the wrong thing.
Eval-harness-as-CI. The newest motion and, in the long term, the most consequential one. The premise is that AI-augmented applications need a separate testing surface for the LLM outputs themselves — the prompts, the model outputs, the agentic-system traces — and that surface has to be tightly coupled to the CI/CD pipeline. The leading platforms are Langfuse’s evaluation surface, LangSmith’s eval features, Helicone’s evaluation work, and the dedicated eval-platform vendors (Braintrust, Patronus, Promptfoo for the open-source-tooling end). The build-versus-buy question on this is the one the platform engineering page covers in detail; the short version is that the production-side observability is mostly buy, the evaluation harness is mostly build-on-top-of-bought-primitives, and the integration with CI is bespoke per organisation.
The fourth motion deserves its own section because the category error it solves is large enough to be worth naming explicitly.
The LLM-output-as-test-target problem
You cannot ship an LLM output to a standard unit-test runner. The premise of unit testing is that the function under test produces deterministic outputs — same input, same output, every run. LLM outputs violate this premise by construction. A standard test runner against an LLM call is either flaky (sometimes passes, sometimes fails, depending on what the model sampled) or relies on a tolerance threshold so wide that it does not catch real regressions. Both failure modes are common, and the failure modes that are not “flaky” or “too loose” are usually “the test is checking exact string equality against an LLM output and breaks on every model release,” which is the worst of the three options.
The standard unit-test discipline does not transfer. The eval harness — the testing surface for non-deterministic outputs — is a distinct piece of engineering, and treating it as “just CI for LLMs” produces a test surface that fails silently for months. The eval harness assumes non-determinism, runs each prompt against a labelled dataset multiple times, computes aggregate metrics across the runs, and gates on metric thresholds rather than exact outputs. The metrics are typically a combination of programmatic checks (the output conforms to the expected schema, contains the required entities, satisfies length constraints), reference-comparison metrics (the output matches a reference output along some similarity dimension), and LLM-as-judge metrics (a separate, stronger model evaluates the output against the spec). The gating in CI is on the aggregate metric being above a threshold, not on any individual run passing.
Building this surface from scratch is a quarter’s work for a competent platform team and is the right investment for any engineering organisation deploying LLM-augmented applications at any scale. Buying it is a more nuanced decision — the LLM observability vendors above all offer some version of the eval surface, and the dedicated eval platforms offer richer ones — and the platform engineering page covers the build-versus-buy split in detail. The key point for this page is that the eval harness is not optional, and the teams that ship LLM-augmented applications without one are running their production AI surface without test coverage in any meaningful sense.
The harder version of this problem is the agentic-system case. When the system under test is a multi-step agent rather than a single LLM call, the eval-harness design has to evaluate trajectories rather than outputs — the sequence of tool calls, the intermediate states, the eventual outcome — and the labelled-dataset construction becomes substantially more expensive. The agentic AI architecture patterns page covers the pattern vocabulary; the eval harness for an agentic system has to be designed against that vocabulary. The teams that have skipped this step and tried to evaluate agentic systems with single-output evals are the teams whose agentic systems fail in production in ways that nobody can reproduce.
What this connects to
The parent software-engineering hub covers the four discipline shifts of which testing is one. The code review page is adjacent — many of the testing failures named above are review failures in disguise, in the sense that a senior reviewer reading the test file against the spec would have caught them. The two pages together cover most of the pre-merge quality surface. The platform engineering page covers the platform-team-owned primitives — eval-harness infrastructure, observability integration, CI/CD gating — that the testing discipline depends on; the application-team-side investments on this page assume those primitives exist.
The maintenance and tech debt page covers what happens to the test suite over an eighteen-to-twenty-four-month horizon, and the eval-harness staleness pattern in that page’s diagnosis is one of the more expensive maintenance categories. The pages should be read together if you are scoping the test-discipline investment beyond a single quarter.
The AI coding tools hub is the procurement frame for the autogen-test-stub motion, which overlaps with the general AI coding tool procurement. The LLM observability cluster is the procurement frame for the eval-harness-as-CI motion, where the observability vendors are the natural starting point for the build-on-top-of-bought-primitives pattern this page recommends.
The Goodhart reading of all of this is the load-bearing one. The metric (“tests pass”) is not the property (“the code is correct”); the AI tools optimise the metric directly without producing the property; the discipline shift is the set of practice changes that re-couples the metric to the property. The teams that internalise this shift produce test suites that catch what they should catch. The teams that resist it on the grounds that “the tests are passing” produce, eventually, the incident postmortems where the tests were passing the whole time. The original Goodhart formulation is from 1975; the version that lands on a 2026 CI server is the same observation in a tighter loop.
Sources
- METR — Measuring the impact of AI on experienced open-source developer productivity, July 2025 — the perception-versus-measurement gap underlying the test-quality discussion
- GitClear — AI Copilot Code Quality 2024 / 2025 — code-churn and test-quality empirical data
- Stack Overflow Developer Survey 2025 — AI testing tool adoption and satisfaction data
- Anthropic — Building Effective Agents (December 2024) — the agentic-systems eval-harness argument’s foundational framing
- Brooks, F. (1986), “No Silver Bullet” — the essential-versus-accidental complexity distinction that underlies the eval-harness-as-distinct-discipline argument
- Goodhart, C. (1975), “Problems of Monetary Management” — the original formulation of the law that the vibe-passing-test section borrows
- Vendor product documentation (current as of Q2 2026): Langfuse, LangSmith, Helicone, Braintrust, Patronus, Promptfoo, Diffblue Cover, CodiumAI, Mabl, Stryker mutation testing
- Related: parent hub, platform engineering, code review, maintenance and tech debt, AI coding tools, LLM observability, agentic AI architecture patterns
Methodology: testing-discipline patterns drawn from engagements across roughly fifteen engineering organisations deploying AI-augmented application stacks in production (2024-2026). The vibe-passing-test pattern names are mine; the underlying mechanism (Goodhart applied to a generative test-and-implementation pair) has been observed widely across the engineering community and the naming is the only original contribution. The eval-harness build-versus-buy verdict is consistent with the pattern across the engagements; your organisation’s specific evaluation requirements may shift it, and disagreement is the most useful feedback this page receives.
