18 May 2026

·

5 min read

QA & TestingCI/CDDeveloper Productivity

Flaky Tests: The Hidden Tax Slowing Your Release Cycle

Google sees over 1.5 million flaky test runs per day. Most enterprise CI pipelines are paying a similar tax in lost engineer hours, eroded trust, and delayed releases. Here's how to quantify and cut it.

Anystack Engineering

Martin Fowler's long-standing essay Eradicating Non-Determinism in Tests remains the clearest treatment of why non-deterministic tests are corrosive — and the data behind it has only got worse. Google engineers have publicly reported peaks of over 1.5 million flaky test executions per day across their internal CI, with roughly 16% of their tests showing some level of flakiness. Microsoft's research on the Chrome project found similar patterns: a small number of tests generate the majority of false failures, but the cost is paid across every engineer who waits, retries, and learns to ignore red builds.

For a CTO, flaky tests aren't a QA hygiene issue. They are a delivery throughput tax — one that compounds silently until your release cadence stalls and your best engineers spend their mornings hitting the Retry button.

What the research actually shows

Three findings deserve attention from anyone running a non-trivial CI estate.

First, flakiness is concentrated, not uniform. Across multiple studies — including Microsoft Research's work on flaky test localisation and Google's internal data — roughly 1–4% of tests account for the majority of flaky failures. The implication is that a small, targeted intervention can recover the bulk of the lost signal. You don't need a quality transformation programme. You need a triage list.

Second, flakiness destroys signal value, not just time. Once developers learn that red builds are often false, they begin re-running pipelines reflexively. Fowler's essay describes this as the test suite losing its ability to act as a regression detector. The cost isn't the 12 minutes a re-run takes — it's that a genuine regression now hides inside a population of dismissed failures. Google's data shows engineers re-running pipelines without investigation in the majority of flaky failure cases.

Third, the most common root causes are well-understood and fixable. Studies categorising flaky test causes consistently find the same top offenders: async/timing assumptions, test order dependencies, shared mutable state (often database fixtures), network and external service calls, and time-of-day or timezone-sensitive assertions. None of these are exotic. All of them are addressable with patterns that have been documented for over a decade.

The gap between "well-understood" and "actually fixed in your repo" is where most enterprise engineering orgs are losing weeks of throughput every quarter.

What engineering leaders can do this week

Three concrete actions, in priority order.

1. Measure the tax before you try to cut it.** Instrument your CI to record, per test, the rate of pass-on-retry within a 24-hour window. That metric — flake rate — is the single number that matters. Most CI platforms (GitHub Actions, Buildkite, CircleCI, GitLab) can emit this with a small wrapper or existing plugin. Once you have it, two reports are immediately useful: the top 20 flakiest tests by absolute failure count, and the total engineer-minutes spent on re-runs per week. The second number is what you take to the exec team. In a 200-engineer org with a 15-minute pipeline and a 5% flake rate, you are typically looking at 40–80 engineer-hours lost per week.

2. Quarantine aggressively, fix on a clock.** The Google playbook is straightforward: any test exceeding a flake threshold (commonly 1% over a rolling window) is automatically moved to a quarantine suite that does not block merges. The owning team gets a deadline — typically 7 to 14 days — to either fix or delete. This is uncomfortable culturally and trivial technically. The cultural part is the bottleneck: engineering leadership has to publicly back the policy that a flaky test in main is worse than no test, because a flaky test trains the team to ignore failures. Without that backing, quarantine becomes a graveyard.

3. Treat the top five root causes as platform work, not team work.** If your top flaky tests share root causes — shared database state, async race conditions in a particular framework, a fragile end-to-end harness — the fix belongs to a platform team, not the feature teams whose tests happen to expose the pattern. A reusable test fixture pattern, a deterministic clock injection, or a properly isolated database-per-test setup will eliminate categories of flakiness in a way that whack-a-mole bug fixing cannot.

Where AI-generated tests fit — and where they don't

It is tempting to point an LLM at the problem. Recent work on LLM-based test generation shows that models can produce plausible test cases from specifications, but they tend to inherit and sometimes amplify the same flakiness patterns — particularly around async assertions and oracle correctness. AI-generated tests are useful for expanding coverage in well-isolated unit-test territory. They are not, in 2026, a substitute for fixing the structural causes of flakiness in your existing suite. If anything, generating more tests on top of a flaky harness multiplies the problem.

The more interesting use of LLMs in this space is root-cause classification: feeding flaky test logs into a model to bucket failures by likely cause (timing, data, environment, ordering). That is a tractable, well-bounded problem and one we have seen produce real triage time savings.


How Anystack approaches this

Flaky test remediation is a textbook brief for a 3-person AI-augmented pod: it requires senior judgement on what to quarantine versus rewrite, deep familiarity with CI internals, and the discipline to ship platform-level fixes rather than patch individual tests. A typical 90-day engagement on this looks like instrumenting flake metrics in week one, quarantining the top offenders by week two, and spending the remaining time on the structural fixes — deterministic fixtures, isolation patterns, async primitives — that prevent the next wave. The output is usually a 60–80% reduction in re-run rate and, more importantly, a CI signal that the team trusts again.

If you want to dig further, our notes on QA modernisation and delivery velocity cover the patterns we see most often in enterprise estates.

Free engineering audit

Want a structured assessment of where this applies to your stack? Our 30-minute tech audit is free.

Request audit →

Start a conversation

Facing a version of this in your organisation? We scope engagements in a single call.

Book a 30-min call →

See the evidence

Read how we've delivered these outcomes for clients in fintech, healthcare, and telecom.

Browse case studies →