The Compounding Tax of Flaky Tests: What Google's 1.5M Daily Reruns Should Teach Your Org

Flaky tests don't just waste compute — they corrode the trust that makes CI valuable. Here's what the data says and what engineering leaders should do this quarter.

Anystack Engineering

At peak, Google's internal infrastructure records over 1.5 million flaky test runs per day — tests that pass and fail without any code change. The figure has been quoted for years, but the operational consequence is usually understated: once your CI signal becomes probabilistic, every downstream engineering decision degrades. Reviewers stop reading failures. Release managers retry pipelines reflexively. Incidents get re-classified as "probably flaky" until one of them isn't.

The research on test non-determinism is now mature enough to move from awareness to action. Martin Fowler's canonical write-up on eradicating non-determinism in tests, the DORA 2024 research on what separates elite delivery teams, and a decade of internal data from FAANG-scale CI systems converge on the same conclusion: flakiness is not a QA problem, it is a delivery economics problem. And it compounds.

The real cost is not compute, it is trust

The naïve framing of flaky tests is wasted CI minutes. A 2,000-engineer org with a 20-minute pipeline and a 3% flake-induced retry rate burns roughly 200 engineer-hours per week on reruns alone. That's real money, but it isn't the headline number.

The headline number is the trust collapse. Once engineers learn that 1 in 30 failures is noise, they start treating every failure as noise. Google's internal post-mortems on flaky tests repeatedly surface the same pattern: real regressions sit in "flaky" buckets for days before someone notices the test was actually catching a bug. The DORA 2024 cohort data shows elite performers restore service 6,570× faster than low performers — and a meaningful share of that gap traces back to whether your CI signal is trustworthy enough to bisect against.

The second-order effect is worse. Teams that lose faith in CI build workarounds: long-lived feature branches, manual QA gates before merge, release freezes ahead of demos. Each workaround slows delivery and re-introduces the integration risk CI was meant to eliminate. You end up paying twice — once for the broken CI, once for the manual processes that route around it.

Three findings worth acting on this quarter

The research literature converges on three patterns that engineering leaders should treat as operational rules, not opinions.

First, flake rate is a leading indicator of delivery health, not a lagging one. Teams that allow flake rates above 1% see deploy frequency degrade within two quarters, even when the underlying code quality is unchanged. The mechanism is behavioural: engineers batch larger changes because each merge is more expensive to validate. Larger changes mean longer review cycles, more merge conflicts, and higher change failure rates — the exact opposite of what DORA's elite cohort looks like.

Second, most flakiness clusters in a tiny subset of tests. Microsoft's research on test flakiness, and similar internal analyses at Meta and Spotify, consistently find that 60–80% of flake events come from under 5% of tests. This is good news: you do not need to fix everything. You need a flake-tracking system that surfaces the top 50 offenders and a policy that quarantines or deletes them within a fixed SLA.

Third, AI-generated tests do not solve flakiness — they often make it worse. Recent work on LLM-based test generation shows that while LLMs are competent at producing test scaffolding from specs, they frequently generate brittle assertions: timing-dependent checks, hardcoded environment assumptions, and over-specified mocks. AI-generated tests catch different bugs than human-written suites, which is genuinely useful — but their flake rate tends to be higher unless paired with deterministic fixtures and an oracle strategy. Throwing a copilot at your test suite without modernising your test infrastructure first will reliably increase your flake rate.

What to do this week

Three concrete actions, in order of leverage.

Instrument flake rate as a first-class CI metric. Track per-test pass/fail history across reruns of the same commit. If you cannot answer "what is our org-wide flake rate this week" in under five minutes, you cannot manage it. Most modern CI systems (Buildkite, CircleCI, GitHub Actions with third-party plugins) expose this; if yours does not, a 200-line script against your CI database will.

Establish a quarantine SLA. Any test that flakes more than three times in a rolling 14-day window gets auto-quarantined and assigned to the owning team with a 10-business-day deadline to fix or delete. No exceptions for "important" tests — an important test that flakes is worse than no test, because it teaches engineers to ignore failures.

Audit your AI-generated test suite separately. If your team has adopted Copilot, Cursor, or Claude-assisted test generation in the past 12 months, segment those tests in your flake-rate dashboard. If their flake rate is materially higher than human-written tests (it usually is), pause AI-driven test generation until you have deterministic fixtures and a clear oracle strategy.

Why this matters at the platform level

Flakiness rarely gets fixed by the team that owns the test. The root cause is usually shared infrastructure: a database fixture that doesn't reset cleanly, a message bus with non-deterministic ordering, a shared staging environment with race conditions, or a test runner that doesn't isolate processes properly. Asking individual feature teams to fix flakes one by one is asking them to debug someone else's platform.

This is why QA modernisation and CI/CD acceleration almost always need to be tackled together. A test suite that runs in 4 minutes with a 0.1% flake rate enables trunk-based development, small batch sizes, and the deploy frequency that DORA's elite performers exhibit. A 40-minute suite with a 3% flake rate guarantees you will never get there, regardless of how many engineers you hire.

How Anystack approaches this

When a 3-person AI-augmented pod takes on a flaky-CI engagement, the first two weeks are diagnostic: instrumenting flake rate by test, by service, and by shared dependency; identifying the 5% of tests producing most of the noise; and mapping which platform components (fixtures, message buses, time-dependent code, shared environments) are the actual root causes. The next 60 days are remediation: deterministic fixtures, hermetic test environments, quarantine policy, and a flake-rate dashboard the engineering org actually looks at. The outcome we target is a pipeline that engineers trust enough to merge against without a second thought — because that is the precondition for every other delivery improvement you want to make.