Sequential Testing

An experimental method that allows continuous monitoring of results and valid early stopping, without inflating false positive rates the way traditional peeking does.

Sequential testing is an approach to running experiments that lets you check results at any point during a test — and make a valid decision to stop — without inflating your false positive rate. It solves one of the most common ways A/B tests go wrong in practice: peeking.

The Peeking Problem It Solves

In a standard fixed-horizon A/B test, you commit to a sample size upfront and are supposed to wait until you reach it before drawing conclusions. In practice, almost nobody does this. Teams check dashboards daily, stop tests early when they see p < 0.05, and inadvertently inflate their false positive rate well beyond the nominal 5%.

The reason: each time you check a running test and apply a significance threshold, you're making an independent claim about the result. The more times you check, the higher the probability that you'll see a significant result by chance at least once — even if the true effect is zero.

Sequential testing addresses this by spending the significance budget over time rather than consuming it all in a single final check.

How Sequential Testing Works

The key idea is the always-valid p-value (or its equivalent, the confidence sequence). Instead of a fixed threshold applied once at the end, sequential methods define a boundary that adjusts based on how much data you've seen so far.

Two common approaches:

  • mSPRT (mixture Sequential Probability Ratio Test): Produces a test statistic you can evaluate at any sample size. Valid stopping is built in — if the statistic crosses a threshold, you can stop and claim significance
  • Sequential confidence sequences: Instead of tracking a p-value, you track a confidence interval that is valid at all sample sizes simultaneously. If it excludes zero (or your equivalence bound), you can stop

Both approaches trade some statistical efficiency for monitoring flexibility. Compared to a properly-run fixed-horizon test, sequential tests require somewhat more data on average to reach a decision. But compared to peeking at fixed-horizon tests — which is the real alternative — sequential testing is dramatically more reliable.

Fixed-Horizon vs. Sequential

Fixed-Horizon TestSequential Test
Valid early stoppingNoYes
Valid continuous monitoringNoYes
False positive controlAt final check onlyThroughout
Average sample to decisionLower (if run correctly)Higher
Practical peeking behaviorInvalidates the testBuilt in
Implementation complexityLowMedium

When to Use Sequential Testing

Sequential testing is a good fit when:

  • Your team monitors dashboards during tests and will inevitably peek — sequential testing makes that safe
  • Decisions are time-sensitive and stopping early for a clear winner or loser has real business value
  • You're running low-traffic tests where early stopping for a losing variant prevents wasted time
  • You want to build a culture of valid experimentation without enforcing strict no-peeking rules

Most modern experimentation platforms (Statsig, Eppo, Optimizely Stats Engine) offer sequential testing options, often framed as 'continuous testing' or 'always-valid inference.'

Limitations

  • Lower average efficiency than fixed-horizon. If your team is disciplined enough to commit to a sample size and not peek, a fixed-horizon test will typically reach a decision with less data
  • More complex to implement from scratch. Correct mSPRT or confidence sequence implementations require statistical care. Use a platform that handles it rather than rolling your own
  • Not a fix for underpowered tests. Sequential testing gives you flexibility on when to stop, but it doesn't make a fundamentally underpowered test useful. If your MDE is too small to detect at any reasonable sample size, no stopping rule helps

Paired with CUPED for variance reduction, sequential testing is one of the most practical combinations for teams that need to move faster without sacrificing statistical validity.