Flaky Tests: Root Cause Analysis

Flaky tests are the single most corrosive force in a CI pipeline. Not because individual flakes are expensive — they're often just a few seconds of rerun time — but because they systematically erode the team's relationship with the test suite. Once engineers internalize that "red sometimes doesn't mean broken," they start ignoring CI signals more broadly. By the time a real regression shows up as a CI failure, the noise-to-signal ratio has trained everyone to assume it's another flake.

The first step in fixing flaky tests is refusing to treat "flakiness" as a single category. It isn't. Lumping all non-deterministic failures together prevents systematic root cause analysis. There are four distinct root causes, and each requires a different treatment.

Root Cause 1: Timing and Asynchrony

This is the most common root cause, and the one most frequently misdiagnosed as "the test runner being slow." The pattern is: a test step fires before the application has finished processing the previous action. The classic example is clicking a submit button and immediately asserting on the success message before the async HTTP request has returned.

The wrong fix is adding a fixed sleep() or time.sleep(2) call. This is wrong because it doesn't solve the underlying race condition — it just makes it less likely to trigger by padding the timeline. On a slow CI runner under load, that 2-second sleep may not be enough. On a fast local machine, you wasted 2 seconds.

The right fix is deterministic waiting: explicit wait conditions tied to observable application state. In Playwright terms, page.waitForSelector('.success-message') or page.waitForLoadState('networkidle'). In Selenium terms, WebDriverWait with an ExpectedCondition rather than implicit waits (implicit waits are notorious for making timing flakiness worse, not better, because they interact unpredictably with explicit waits).

The diagnostic question for timing flakiness: does the test fail more often on slower CI runners or under higher CI load? If yes, it's timing. Profile the failure screenshots — they typically show the UI in a transitional state (spinner still visible, form not yet cleared) rather than an incorrect final state.

Root Cause 2: Test Isolation Failures

A test that depends on the side effects of a previously run test is not a unit — it's a fragment. Isolation failures are common in suites that share a single test user account across all test cases, or that run against a shared staging database without cleaning up between runs.

The diagnostic signature here is ordering dependence: the test passes when run in isolation but fails when run as part of the full suite. To confirm, run the test alone and then as the tenth test in the suite. If the failure rate changes, you have an isolation problem.

Fixes: use separate test user accounts per test (or per test class), tear down created test data at the end of each test, and use database transactions that roll back after each test case when the architecture allows it. For E2E tests against a real staging environment, maintain a dedicated test-data cleanup step that runs before the test class, not just after.

We're not saying shared test state is always wrong — some flows genuinely need to test sequences, and setting up initial state from scratch for every test case is expensive. The key is making the dependencies explicit and ordered rather than implicit and fragile.

Root Cause 3: Environment Inconsistency

Tests that pass in one environment and fail in another — or that pass 9 out of 10 times in the same environment — are often responding to environmental non-determinism: different browser versions between local and CI, inconsistent CDN cache states affecting asset load times, or staging environment database replication lag creating eventual-consistency windows that tests hit intermittently.

A practical example: a growing SaaS team ran their Cypress suite against a staging environment backed by a read replica. About 8% of their test runs failed with "element not found" errors on a dashboard page — but only in CI. Investigation revealed the dashboard query ran against the read replica, and a ~200ms replication lag meant recently created test data wasn't always visible when the test loaded the page. The fix was routing staging queries to the primary during test runs, not adding more waits.

Environment flakiness is the hardest to diagnose because the failure signals look similar to timing issues. The distinguishing factor: timing flakiness is proportional to runner speed; environment flakiness persists even on faster hardware and often correlates with specific times of day or specific test data patterns.

Root Cause 4: Application-Level Non-Determinism

Sometimes the test isn't flaky — the application is. Features that depend on random ordering (a "recommended items" list not sorted deterministically), time-sensitive displays (a "last seen X minutes ago" field), or A/B test bucket assignment can cause tests to fail not because of test authoring issues but because the application genuinely produces different outputs on different runs.

This is actually valuable information: it means the application has behavior that isn't fully specified. A "recommended items" list that can appear in any order is untestable without either seeding the recommendation algorithm or loosening the assertion to check membership rather than exact order. Tests that surface these ambiguities are doing their job — they're telling you the application behavior isn't deterministic and the spec needs clarifying.

The fix here is either to make the relevant application behavior deterministic for test runs (by injecting a fixed seed, disabling A/B assignments in the test environment, or freezing time with a test clock) or to make the assertion match the actual guaranteed behavior rather than the order-specific output you happened to observe when writing the test.

Building a Flakiness Tracking System

You cannot systematically fix flaky tests without first quantifying them. Every CI run should log, at minimum: test name, pass/fail, and run duration. With that data, you can compute a flakiness rate per test over a rolling window. Tests with a pass rate between 60% and 95% are your first-order targets — highly flaky tests (below 60%) should be quarantined rather than fixed immediately, as they're actively corrupting the signal. Stable tests (above 95%) don't need intervention.

Most CI systems expose this data through their API. A simple script pulling run results over the last 30 days and computing pass rates per test name will give you a flakiness inventory in an afternoon. Prioritize by frequency multiplied by impact: a test that fails 15% of the time and sits on the critical path to deploy is a more urgent problem than one that fails 40% of the time but covers an isolated edge case.

Once you have the inventory, assign ownership. Flaky tests that belong to no team get ignored indefinitely. Flaky tests with a named owner and a sprint-level deadline get fixed. The cultural side of flakiness management — making someone responsible for each test's reliability — is as important as the technical root cause analysis.

When to Quarantine vs. Fix

Quarantining (marking tests as non-blocking in CI) is appropriate in two situations: when a test is so flaky it's generating more noise than signal and you don't have bandwidth to fix it this sprint; or when the root cause is environmental and fixing it requires infrastructure changes that will take weeks. Quarantined tests should run in a separate CI job that doesn't block the pipeline, with results sent to a tracking dashboard. A quarantined test that stays quarantined for more than 30 days should be deleted — at that point it isn't providing coverage, it's accumulating technical debt.

The goal is a CI signal you can trust: when the pipeline is red, it means something is broken, not that a selector changed or a runner was slow. Reaching that state requires categorizing flakiness by root cause, not just retrying until green.