AI-Generated Tests Find Different Bugs — And Miss the Ones That Matter

LLMs can write test cases from a spec in seconds, but research shows they catch a different class of bug than human-written suites and routinely fail the oracle problem. Here is what engineering leaders should do about it.

Anystack Engineering

Every QA modernisation pitch in 2026 includes the same promise: point an LLM at your codebase, generate a thousand tests overnight, watch coverage climb. The coverage numbers are real. The confidence they create is not.

A growing body of empirical work — summarised well in Schäfer et al.'s study of LLM-based unit test generation and reinforced by follow-up evaluations from Meta, Google, and the SWE-bench community — shows that AI-generated tests catch a meaningfully different distribution of bugs than human-written suites. They are good at boundary conditions, null checks, and shallow input validation. They are bad at the things that actually take production down: incorrect business logic, broken invariants across calls, and subtle concurrency bugs. Worse, they routinely fabricate assertions that pass against the current implementation regardless of whether the implementation is correct.

This is the oracle problem in a new costume, and it deserves a place on every engineering leader's risk register before another quarter of AI-assisted test generation is approved.

What the research actually shows

Three findings recur across the literature and the internal post-mortems we have seen at enterprise clients.

First, LLM-generated tests have high coverage but low fault-detection power per test. Schäfer et al. reported that LLM-generated suites achieved coverage within a few percentage points of human-written ones, but mutation testing — where deliberately broken code variants are introduced to see which tests fail — showed the AI suites detecting noticeably fewer mutants. Coverage measures whether a line ran. Mutation score measures whether your tests would actually notice if that line were wrong. The gap between these two metrics is where false confidence lives.

Second, assertion quality degrades as the function under test becomes more business-specific. For pure functions and standard library wrappers, LLMs write reasonable assertions. For domain logic — pricing rules, eligibility checks, settlement flows — the model has no idea what the correct answer should be, so it asks the code and writes an assertion that matches whatever the code returns. These are change-detector tests: they fail when behaviour changes, regardless of whether the change is a bug fix or a regression. They add noise to CI without adding signal.

Third, flaky and order-dependent tests are over-represented in generated suites. Generated tests often share fixtures, hit the clock, or rely on iteration order that happens to be stable on the developer's laptop. Martin Fowler's canonical write-up on non-determinism in tests — and Google's reported 1.5M flaky test runs per day at peak — make clear that flakiness is already the single biggest tax on CI signal. Generating tests faster than you can curate them makes the tax worse, not better.

What to do about it this week

These findings do not mean abandon LLM test generation. They mean treat the output as a draft, not a deliverable. Three actions are worth taking immediately.

Add mutation testing to the CI path for any module where AI-generated tests now dominate. Tools like Stryker, PIT, or Mutmut will tell you within a single run whether your shiny new 85% coverage is detecting real faults or rubber-stamping current behaviour. Treat a mutation score below roughly 60% on critical modules as a failed quality gate, not a nice-to-have report.
Split your test suite by provenance and track them separately. Generated tests, human-written tests, and human-curated-from-generated tests should live in distinguishable folders or be tagged in your runner. When a test flakes or a regression slips through, you want to know which bucket failed you. Without this, you cannot tell whether AI-assisted testing is a net win or a net liability.
Make the oracle explicit before generation, not after. For any non-trivial function, write the property the test should verify in plain language — invariants, postconditions, examples — and feed that to the model as the spec. This is the same discipline that makes property-based testing work. It is the single highest-leverage change we see teams make when moving from generated-tests-as-toy to generated-tests-as-asset.

There is a deeper organisational point underneath all of this. The bottleneck in QA modernisation has never been the typing speed of test code. It has been the codification of what "correct" means for systems whose specifications live in tribal knowledge, Confluence pages from 2019, and the heads of three senior engineers. LLMs do not solve that problem. They make it more urgent, because they will happily generate thousands of tests that encode the wrong oracle at scale.

The pod approach to this problem

When a 3-person AI-augmented pod takes on a QA modernisation engagement, the first two weeks are almost never spent generating tests. They are spent identifying the highest-value oracles in the codebase — the business rules whose violation would actually matter — and writing those as executable specifications. Only then does generation start, with the spec as the prompt and mutation testing as the acceptance criterion. The output is usually a smaller test suite than the one a generate-everything approach would produce, but with materially higher fault-detection power and a fraction of the flake rate.

The headline metric for AI-assisted testing should never be tests-generated-per-hour or coverage-percentage-delta. It should be the mutation score on the modules that matter, the flake rate trend over the last 30 days, and the number of production incidents caught by a test that was added in the last quarter. If your AI testing programme is not reporting on those three numbers, you are buying coverage theatre. Get the oracles right, instrument the suite honestly, and the generation tools become genuinely useful — not as a replacement for test design, but as a faster way to execute it.