Flaky Tests: The CI Tax No One Budgets For

Google reports 1.5M flaky test runs per day at peak. The real cost isn't compute — it's the eroded trust in CI signals that quietly slows every release. Here's what engineering leaders can do this quarter.

Anystack Engineering

Martin Fowler's essay on eradicating non-determinism in tests remains the canonical reference on a problem most enterprise engineering orgs still under-invest in. The headline data point — Google reporting over 1.5 million flaky test runs per day at peak across its internal CI — gets quoted often, but the underlying lesson rarely lands with the people who fund engineering: flaky tests are not a hygiene issue. They are a trust issue, and they compound.

The pattern repeats at every company we've worked with at scale. A test fails. An engineer reruns the pipeline. It passes. The team learns, quietly and over months, that red builds are sometimes real and sometimes noise. From that point on, every CI signal is discounted. Real regressions slip through because nobody trusts the alarm. Releases slow down because merges queue behind reruns. The most expensive engineers in your org spend their afternoons clicking 'retry'.

What the research actually says

Three findings from Fowler's work, the Google Testing Blog's follow-ups, and academic studies on test flakiness in large monorepos are worth pulling out for engineering leaders.

First, flakiness is rarely the test's fault in isolation. It's a property of the system under test — shared state, async timing, test order dependencies, network calls to services that shouldn't be in the test path. Treating the symptom (quarantining the test) without treating the cause (architectural coupling, missing test doubles, leaky fixtures) just moves the problem.

Second, flaky tests have a Gresham's Law effect on test suites. Bad tests drive out good ones. Once a suite is known to be unreliable, engineers stop writing new tests against it, stop investigating failures rigorously, and start adding @Retry(3) annotations as a coping mechanism. The suite degrades faster than it's maintained.

Third, the dominant cost is human, not machine. Google's published analyses suggest that compute spend on reruns is a rounding error compared to the engineer-hours lost to context-switching, investigation, and the slower release cadence that flakiness forces. Internal data from clients we've worked with shows median time-to-merge increases of 30–60% in suites with more than 2% flake rates.

Why most remediation programmes fail

The usual response is a quarterly 'test stability sprint'. A squad is pulled off feature work for two weeks, they fix the loudest 50 tests, dashboards turn green, and within a quarter the flake rate is back where it started. This fails for a structural reason: flakiness is generated continuously by the way the org writes tests, not as a one-off backlog to drain.

The orgs that actually fix this treat flakiness as a flow problem, not a debt problem. They instrument the rate at which new flakes appear, set a budget (flakes-per-thousand-runs), and block PRs that push the suite over budget. They invest in the test infrastructure — proper hermetic fixtures, deterministic clocks, contract tests instead of end-to-end soup — that makes flaky tests structurally harder to write.

Three actions for this quarter

Measure flake rate as a first-class CI metric. Not 'pass rate' — flake rate, defined as tests that produced both a pass and a fail on the same commit within a rolling window. If you don't have this number per service, you cannot manage it. Most CI platforms (Buildkite, CircleCI, GitHub Actions with the right plugins) expose the data; what's missing is the dashboard and the owner.

Set a flake budget and enforce it at the PR gate. Pick a number — 0.5% is aggressive, 2% is realistic for legacy suites — and treat breaches the same way you'd treat an SLO breach. New flakes block merges to the affected service until they're either fixed or explicitly quarantined with an expiry date. Quarantine without an expiry is just hiding the problem.

Audit your end-to-end test pyramid. The single biggest source of flakes in most enterprise suites is over-reliance on end-to-end tests that should have been contract tests or component tests. If more than 20% of your test runtime is end-to-end, you have a structural problem that no amount of retry logic will fix.

What 90 days looks like in practice

A 3-person AI-augmented pod typically approaches flakiness as a 90-day engagement structured in three arcs.

The first 30 days are diagnostic. The pod instruments flake rate across the top 5–10 pipelines, classifies the existing flakes by root cause (async timing, shared state, external dependency, order dependency), and builds the dashboards the platform team will own afterwards. LLM-assisted code analysis accelerates the classification step significantly — what used to take a team a quarter of manual triage now takes a senior engineer with the right tooling about two weeks.

Days 30–60 are remediation of the highest-leverage failure modes. Not all flakes are equal. The pod fixes the ones gating the most merges and rewrites the test infrastructure (fixtures, doubles, deterministic clocks) that's generating new flakes downstream. AI-generated test scaffolding helps here too, though with an important caveat: research on LLM-generated tests, including recent work on test generation from specifications, consistently shows that LLMs produce tests with weak oracles. They check that code runs, not that it's correct. Senior review of every generated assertion is non-negotiable.

Days 60–90 are about making the gains stick: PR-gate enforcement, flake budgets wired into the deployment pipeline, runbooks for the on-call engineer who'll inherit the dashboards. The pod hands over to the internal platform team with the metrics, the tooling, and a documented bar that the org now holds itself to.

The compounding effect

The reason this work pays back disproportionately is that CI trust is upstream of nearly every other delivery metric. Lead time, change failure rate, mean time to recovery — all of them improve when engineers believe their tests. The DORA research has shown this repeatedly: high-performing teams are not the ones with the most tests, they're the ones whose tests they actually trust.

For an engineering leader trying to decide where to spend the next quarter of platform budget, the calculation is usually straightforward. If your flake rate is above 2% and your median PR takes more than a day to merge, fixing the test suite will return more velocity per pound than almost any other investment — including, in many cases, the AI coding tools your board is asking about. Faster CI is a precondition for those tools paying off anyway; an LLM that can write code ten times faster doesn't help if the test suite gates merges for hours.

Anystack runs QA modernisation engagements as senior-only pods. We don't ship junior engineers, we don't run discovery phases that bill for six weeks before any code changes, and we measure success in flake rate and merge time, not deliverables. If your CI signal is no longer trusted, that's the problem worth fixing first.