Peeking Problem

The peeking problem is the inflation of false positive rates that occurs when experimenters check results during a running test and stop early upon seeing significance.

The peeking problem occurs when you check a test's p-value repeatedly during its runtime and stop the experiment as soon as it crosses the significance threshold. This practice — common in practice, wrong in theory — dramatically inflates your false positive rate.

A fixed-horizon test run to completion at α = 0.05 has a 5% chance of a false positive. A test checked daily and stopped at first significance can have a false positive rate of 25–30% — even though the displayed p-value shows 0.05.

Why Peeking Inflates False Positives

Statistical significance thresholds assume you collect all data, then run one test. Each time you peek and compute a p-value, you're running another test on the same data. Multiple comparisons inflate the chance of seeing an extreme result by chance.

Think of it like flipping a coin 100 times and stopping whenever you see 6 heads in a row — you'll stop far more often than the probability of any single 6-head sequence suggests.

How Often Peeking Happens

The false positive rate scales with how often you peek:

Peeks at intermediate pointsTrue false positive rate (nominal α = 0.05)
0 (run to completion)5%
5~14%
10~19%
20+~25%+

Solutions

Sequential testing / always-valid p-values — Methods like mSPRT (mixture Sequential Probability Ratio Test) or e-values are designed to be checked at any time without inflating error rates. Several modern platforms use these by default.

Pre-commitment — Decide your sample size and runtime upfront. Write them down. Don't stop the test until the planned endpoint, regardless of interim results.

Bayesian A/B testing — Bayesian methods produce probabilities (e.g., "85% chance variant beats control") that can be interpreted at any point without the same peeking inflation, though they have their own assumptions.

Alpha spending — Formal interim analysis procedures that "spend" portions of the Type I error budget at each checkpoint, maintaining the overall rate at α.

The easiest fix is also the most discipline-demanding: set your runtime before you start, and don't look at the results until it's done.