The Real Cost of Flaky Tests: Why Your CI Signal Is Lying to You

Flaky tests don't just waste compute — they erode trust in CI, mask real regressions, and quietly add weeks to release cycles. Here's what the data says and what to do about it.

Anystack Engineering

At peak, Google's internal CI ran more than 1.5 million flaky test executions per day — tests that pass and fail non-deterministically against the same code. Martin Fowler's canonical write-up on non-determinism in tests frames the problem bluntly: a flaky test is worse than no test, because it trains engineers to ignore red builds. Microsoft Research's 2020 study of Chrome and the Firefox test suites found that between 12% and 16% of test failures across major OSS projects were flaky, not real regressions. Spotify has publicly reported that flakiness was the single largest contributor to CI pipeline duration before they invested in detection infrastructure.

The second-order effects are where the real money goes. Each re-run consumes CI minutes, blocks merges, and — most expensively — teaches engineers that a red build is probably nothing. By the time a real regression slips through, the signal is already dead.

What the research actually shows

Three findings consistently show up across the academic and industry literature on flaky tests.

First, flakiness is dominated by a small number of root causes. The Luo et al. study at the University of Illinois (An Empirical Analysis of Flaky Tests) categorised 201 flaky tests from 51 open-source projects and found that async wait, concurrency, and test order dependency accounted for over 75% of cases. This matters because it means flakiness is tractable — you do not need a hundred different fixes, you need a handful of patterns applied rigorously.

Second, flaky tests cluster. Microsoft's data on Chrome showed that a small percentage of test files generate the majority of flaky failures. Fix the top 20 flaky tests in most enterprise codebases and you reclaim more than half of the wasted CI cycles. This is a power law, not a uniform distribution.

Third — and this is the finding most engineering leaders underestimate — flakiness has a compounding cultural cost. A 2022 Meta engineering post described how their internal surveys found flaky tests were the number one cited reason engineers lost trust in CI. Once trust erodes, engineers start merging on red builds, disabling tests locally, and pattern-matching failures as 'probably flaky' without investigation. That is how real regressions reach production.

Three actions for this week

These are not transformation programmes. They are things a Head of Engineering can put in motion in the next sprint.

Instrument flake rate as a first-class metric. Most teams track pipeline duration and pass rate. Add a third: percentage of failing builds that pass on retry without code change. If that number is above 5%, you have a trust problem, not a code problem. Surface it in the same dashboard as deployment frequency and change failure rate.

Quarantine, don't delete. When a test is identified as flaky, move it out of the blocking pipeline into a non-blocking quarantine suite within 24 hours. Set a hard policy: tests in quarantine for more than 14 days are either fixed or deleted. This stops the slow accumulation that kills CI hygiene.

Audit your top 20. Pull the failure logs from the last 30 days, rank tests by retry-pass count, and assign owners to the top 20. Luo's data suggests three patterns will cover most of them: timing assumptions in async code, shared test state, and order dependencies. None require novel engineering — they require focused attention.

Why LLM-generated tests won't save you (yet)

There is growing enthusiasm for using LLMs to generate test suites from specifications, and the research on LLM-based test generation shows genuine gains in coverage for boilerplate cases. But the same research highlights a critical limitation: LLMs are particularly bad at the oracle problem — knowing what the *correct* output should be — and they routinely produce assertions that are either too loose to catch real bugs or too tight to survive minor refactors. That second category becomes flakiness.

In other words, naive AI test generation can make your flake problem worse, not better. The teams getting real value are using LLMs for property identification, mutation testing, and assertion strengthening on existing tests — not for greenfield test authoring against production code.

The economic case in concrete numbers

For a 200-engineer organisation running CI on every PR, conservative assumptions — 8 PRs per engineer per week, 15% flake rate on failing builds, 12 minutes average pipeline time, $0.08 per CI minute — yield roughly $180,000 per year in pure compute waste. That is before counting engineer context-switching cost, which research from the SPACE framework authors suggests is 5–10x larger than the direct compute cost.

Double both numbers if your CI is gated on a long integration suite. Triple them if your release process requires manual reverification when builds go red.

How Anystack approaches this

When we deploy a 3-person AI-augmented pod into a CI hygiene engagement, the first two weeks are pure instrumentation and triage: flake rate by service, quarantine policy, ownership graph for the top failing tests. Weeks three through eight are pattern-based fixes — async waits, shared fixtures, order dependencies — typically clearing 60–80% of flake volume. Weeks nine through twelve focus on durable guardrails: pre-merge flake detection, retry budgets, and integrating AI-assisted assertion review into the PR flow without handing test authorship to the model. The pod model works here because flakiness is a focused, finite engineering problem — exactly the shape of work that suits a small senior team rather than a large programme. If you want to see what that looks like end-to-end, our work in QA modernisation and test automation covers the full playbook.

Flaky tests are not a glamorous problem. They will not appear on a board slide, and no one is going to get promoted for fixing them. But the teams that ignore them pay the cost twice — once in wasted compute, and once in the slow erosion of every other engineering metric that depends on CI signal being trustworthy.