Sequential Testing | Surface AI Hub

Sequential testing is an approach to running experiments that lets you check results at any point during a test — and make a valid decision to stop — without inflating your false positive rate. It solves one of the most common ways A/B tests go wrong in practice: peeking.

The Peeking Problem It Solves

In a standard fixed-horizon A/B test, you commit to a sample size upfront and are supposed to wait until you reach it before drawing conclusions. In practice, almost nobody does this. Teams check dashboards daily, stop tests early when they see p < 0.05, and inadvertently inflate their false positive rate well beyond the nominal 5%.

The reason: each time you check a running test and apply a significance threshold, you're making an independent claim about the result. The more times you check, the higher the probability that you'll see a significant result by chance at least once — even if the true effect is zero.

Sequential testing addresses this by spending the significance budget over time rather than consuming it all in a single final check.

How Sequential Testing Works

The key idea is the always-valid p-value (or its equivalent, the confidence sequence). Instead of a fixed threshold applied once at the end, sequential methods define a boundary that adjusts based on how much data you've seen so far.

Two common approaches:

mSPRT (mixture Sequential Probability Ratio Test): Produces a test statistic you can evaluate at any sample size. Valid stopping is built in — if the statistic crosses a threshold, you can stop and claim significance
Sequential confidence sequences: Instead of tracking a p-value, you track a confidence interval that is valid at all sample sizes simultaneously. If it excludes zero (or your equivalence bound), you can stop

Both approaches trade some statistical efficiency for monitoring flexibility. Compared to a properly-run fixed-horizon test, sequential tests require somewhat more data on average to reach a decision. But compared to peeking at fixed-horizon tests — which is the real alternative — sequential testing is dramatically more reliable.

Fixed-Horizon vs. Sequential

	Fixed-Horizon Test	Sequential Test
Valid early stopping	No	Yes
Valid continuous monitoring	No	Yes
False positive control	At final check only	Throughout
Average sample to decision	Lower (if run correctly)	Higher
Practical peeking behavior	Invalidates the test	Built in
Implementation complexity	Low	Medium

When to Use Sequential Testing

Sequential testing is a good fit when:

Your team monitors dashboards during tests and will inevitably peek — sequential testing makes that safe
Decisions are time-sensitive and stopping early for a clear winner or loser has real business value
You're running low-traffic tests where early stopping for a losing variant prevents wasted time
You want to build a culture of valid experimentation without enforcing strict no-peeking rules

Most modern experimentation platforms (Statsig, Eppo, Optimizely Stats Engine) offer sequential testing options, often framed as 'continuous testing' or 'always-valid inference.'

Limitations

Lower average efficiency than fixed-horizon. If your team is disciplined enough to commit to a sample size and not peek, a fixed-horizon test will typically reach a decision with less data
More complex to implement from scratch. Correct mSPRT or confidence sequence implementations require statistical care. Use a platform that handles it rather than rolling your own
Not a fix for underpowered tests. Sequential testing gives you flexibility on when to stop, but it doesn't make a fundamentally underpowered test useful. If your MDE is too small to detect at any reasonable sample size, no stopping rule helps

Paired with CUPED for variance reduction, sequential testing is one of the most practical combinations for teams that need to move faster without sacrificing statistical validity.