Sample size is the minimum number of visitors (or conversions) required for an A/B test to produce a statistically reliable result. Running a test with too few visitors leads to inconclusive or misleading results. Running with too many wastes time you could spend on the next test.
Why Sample Size Matters
Small sample sizes produce noisy data. If you flip a coin 10 times and get 7 heads, that doesn't prove the coin is rigged — you just didn't flip enough. A/B testing works the same way. Early results often look dramatic but are unreliable.
A properly calculated sample size ensures:
- Enough statistical power to detect a real improvement if one exists (typically 80% power)
- Low false positive rate (typically 5%, corresponding to 95% confidence)
- Sensitivity to the effect size you care about
How to Calculate Sample Size
The key inputs are:
- Baseline conversion rate — Your current rate (e.g., 3%)
- Minimum detectable effect (MDE) — The smallest improvement worth detecting (e.g., 10% relative lift)
- Significance level — Usually 95% confidence (alpha = 0.05)
- Statistical power — Usually 80% (beta = 0.20)
As a rough guide for a 3% baseline conversion rate:
| MDE (Relative) | Sample Size Per Variant |
|---|---|
| 20% | ~13,000 |
| 10% | ~52,000 |
| 5% | ~205,000 |
Smaller effects require dramatically more traffic. This is why many teams focus on testing bigger, bolder changes rather than minor tweaks.
Common Mistakes
- Peeking at results early and stopping when you see a "winner" — this inflates your false positive rate
- Not accounting for multiple variants — Testing 5 variants requires more total traffic than testing 2
- Ignoring business cycles — Always run tests for at least one full week (ideally two) to account for day-of-week effects, regardless of sample size
Sample Size and Test Velocity
Sample size requirements are the biggest constraint on experimentation velocity. A page with 1,000 visitors per week simply cannot run the same volume of tests as a page with 100,000 visitors per week. This is one reason adaptive methods like bandit testing — which require fewer visitors to converge — are becoming popular for lower-traffic sites.