Flaky Tests: Retrying Isn't a Fix

A pull request crossed my review: half the diff was tests, and a model had written them. They passed locally. In CI they failed once, passed on a re-run, then failed again two days later under load. Nobody had touched the code.

That test wasn’t wrong. It was flaky — and flaky is its own kind of broken.

AI didn’t invent flakiness — it mass-produced it

Writing a test used to cost something, so you wrote fewer and thought harder about each. A model writes them in seconds, from the code’s current shape alone. It can’t see the concurrency, the resource limits, or the real timing of an async call — so it reaches for the patterns that look right and are quietly non-deterministic:

A fixed sleep to “wait” for async work: time.Sleep(2 * time.Second), await page.waitForTimeout(2000), cy.wait(2000).
An unseeded random source whose value differs every run: Math.random(), rand.Intn(100), a faker with no fixed seed.
A brittle selector pinned to DOM structure — nth-child, an auto-generated class name — that breaks the moment the markup shifts.
A mock for everything, so the test verifies an interaction, not a result — the same empty green I wrote about in chasing coverage with tests that do nothing.

Each looks reasonable in isolation. At scale they’re a flakiness factory.

The suite gets less reliable as it grows

A single test with a 0.5% chance of a spurious failure is invisible. Put 800 of those in a suite and the chance that some test fails on a clean run is 1 − 0.995^800 ≈ 98%. The failures aren’t independent of effort either: CI runs on shared, throttled hardware where the async test that finishes in 50 ms on your laptop times out under load.

As the model keeps adding tests, that count climbs and the suite’s odds of a false red march toward certainty. The pass you care about drowns in noise you don’t.

Retrying is a treadmill, not a cure

The reflex is to absorb the noise: auto-retry the job, mark the test flaky, quarantine it, open a ticket. Every one of these is reactive — it acts after the test has already broken CI, usually more than once.

Retries burn real CI minutes and stretch the feedback loop; you pay, twice, to learn nothing new.
Quarantine grows a silent pile of disabled tests. That pile is debt, and it compounds — each skipped test is a check nobody runs anymore.
The ticket ages into backlog noise and quietly stops mattering.

None of it makes the test deterministic. It hides the symptom and bills you for the privilege.

Catch the anti-pattern, not the failure

A flaky test rarely breaks for a subtle reason. It breaks because of a small set of well-known anti-patterns — and those are visible in the diff, before the test ever runs. A fixed sleep, an unseeded RNG, a wall-clock assertion, a structural selector: each one is detectable statically and deterministically, at the moment the test is written.

I run that check at the commit boundary — a static pass over the added lines of changed test files, flagging the known anti-patterns. Same diff, same result, no model, no network. It won’t catch a clever race, and it isn’t meant to. The point is narrower and worth a lot: stop the obvious flake from ever reaching CI, where catching it costs a hundred times more and a human’s afternoon.

Design the test so there’s nothing to detect

Detection is the backstop. The real fix is to write the test deterministically in the first place:

Replace fixed waits with condition-based waiting — poll until the state you expect, or use the framework’s Eventually/waitFor. Wait on a fact, not a clock.
Seed every random source, and inject a deterministic clock and IDs instead of reading now() and Math.random() mid-test.
Select by role or test id, never by DOM position.
Mock the boundary, not the logic — and keep at least one real integration path, or the suite verifies a world that doesn’t exist.

When does this stop being the right answer?

A static gate is precision-first: it catches the obvious anti-patterns, not a race condition buried in your own code. Keep its rules conservative — a noisy gate gets ignored, and an ignored gate is worse than none. It will miss the clever flake; don’t let it guess.

And none of this substitutes for the harder read: a test that’s flaky because the system under test is non-deterministic isn’t telling you about the test. It’s telling you about the system.

A green CI you have to re-run to trust isn’t green. The cheapest place to kill a flaky test is the line where it’s written — before it ever costs you a build.

See also: the commit-boundary check described here is what I shipped in CommitBrief — its flaky-test detector statically catches fixed sleeps and unseeded randomness in changed test files, deterministically, before the model.