LLMs Can Write Your Tests — But Not the Assertions That Matter

Research shows AI-generated test suites catch different bugs than human-written ones, but stumble on the oracle problem. Here's how engineering leaders should actually deploy LLM test generation in 2026.

Anystack Engineering

The promise sounds irresistible: point an LLM at your codebase, get back a test suite. The reality, documented in a growing body of empirical research, is messier. The most-cited synthesis — Schäfer et al.'s "An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation" — found LLMs can generate compilable, executable tests at roughly human-comparable coverage rates, but the assertions they produce are frequently wrong, tautological, or simply restate the implementation.

Follow-up studies have sharpened the picture. Across multiple benchmarks, LLM-generated tests catch a meaningfully *different* set of bugs than human-written suites — useful as a complement, dangerous as a replacement. And the failure mode that matters most for engineering leaders is the one the research calls the oracle problem: the model can generate inputs and execute code, but it cannot reliably decide what the correct output should be without a specification it doesn't have.

If your QA strategy for 2026 assumes Copilot-style tools will generate your regression suite while your engineers focus on features, you are about to learn this the expensive way.

Finding 1: Coverage is not correctness

The Schäfer study and several replications show LLMs hitting 70–80% line coverage on standard Java and Python benchmarks — competitive with EvoSuite and Pynguin. But when researchers manually audited the assertions, a substantial fraction were either trivially true (assertNotNull(result) on a method that cannot return null), circular (asserting the output equals what the implementation just computed), or wrong in subtle ways that would pass on the current buggy code and fail on a corrected version.

This is the worst possible outcome for a test suite. It locks in current behaviour, including current bugs, while reporting green. Your CI signal degrades from "this code is correct" to "this code is unchanged" — and you won't notice until a regression slips through and the post-mortem reveals the test that should have caught it was asserting the wrong thing all along.

Finding 2: AI-generated tests catch different bugs

Mutation testing studies — where researchers deliberately introduce small faults and measure which tests detect them — consistently find that LLM-generated suites and human-written suites have only partial overlap in their mutation kill sets. The Microsoft Research paper "CodaMosa" and subsequent work showed LLMs are good at generating edge cases human engineers forget (empty strings, Unicode, boundary integers) but poor at testing business logic invariants that require domain understanding.

The practical implication: treating LLM test generation as a *supplement* to human-authored tests can meaningfully increase real defect detection. Treating it as a *replacement* removes exactly the tests that protect your domain logic.

Finding 3: Flakiness compounds the problem

LLM-generated tests are also more prone to flakiness than well-engineered human suites. They reach for timestamps, random seeds, and order-dependent collection iteration without the discipline a senior engineer applies. Martin Fowler's canonical piece on non-determinism in tests catalogues these failure modes; LLMs reproduce all of them at scale. Google has reported over 1.5 million flaky test runs per day at peak — and that's with human-authored tests. Generating tests with an LLM and merging them without review is a direct route to a CI signal nobody trusts.

What to do this week

Audit your assertion quality, not your coverage number. Pick 30 recently-merged tests — whether AI-assisted or not — and have a senior engineer classify each assertion as: meaningful, tautological, or wrong. If more than 20% fall into the latter two categories, your coverage metric is lying to you. This is a 2-hour exercise that will reframe every QA conversation for the next quarter.

Separate generation from acceptance in your workflow. LLMs are excellent at proposing test cases. They are poor at deciding which ones should be merged. Require a human-authored assertion review step for any AI-generated test, and reject any test where the assertion was generated from the same prompt as the implementation. The model cannot serve as both author and oracle.

Run mutation testing on a critical module. Tools like PIT (Java), Stryker (JavaScript), and mutmut (Python) introduce synthetic bugs and report which your tests catch. Run it on one high-stakes service this sprint. The mutation score will tell you whether your suite — AI-generated or not — is actually protecting you. Most teams discover their kill rate is 40–60%, not the 80%+ their coverage report implied.

Treat flakiness as a P1 the day it appears. A single flaky test costs more in eroded trust than the feature that introduced it delivered in value. If a test fails intermittently, quarantine it within 24 hours and fix or delete it within a week. Flakiness has compound interest.

How Anystack approaches this

When we run a QA modernisation engagement with a 3-person AI-augmented pod, the first two weeks are almost never about generating new tests. They're about characterising what the existing suite actually proves — assertion audits, mutation scores, flake rates, signal-to-noise on CI. Only then does LLM-assisted generation enter the picture, and always as a proposer with a human acceptance gate. The teams that get value from AI in QA are the ones that already know what a good test looks like; the teams that don't are the ones generating thousands of green assertions that will silently rot their codebase. The research is clear that LLMs change the economics of test generation. It is equally clear they do not change the economics of test *design* — and conflating the two is the most expensive mistake engineering leaders are making in 2026.