Sample Size | Surface AI Hub

Sample size is the minimum number of visitors (or conversions) required for an A/B test to produce a statistically reliable result. Running a test with too few visitors leads to inconclusive or misleading results. Running with too many wastes time you could spend on the next test.

Why Sample Size Matters

Small sample sizes produce noisy data. If you flip a coin 10 times and get 7 heads, that doesn't prove the coin is rigged — you just didn't flip enough. A/B testing works the same way. Early results often look dramatic but are unreliable.

A properly calculated sample size ensures:

Enough statistical power to detect a real improvement if one exists (typically 80% power)
Low false positive rate (typically 5%, corresponding to 95% confidence)
Sensitivity to the effect size you care about

How to Calculate Sample Size

The key inputs are:

Baseline conversion rate — Your current rate (e.g., 3%)
Minimum detectable effect (MDE) — The smallest improvement worth detecting (e.g., 10% relative lift)
Significance level — Usually 95% confidence (alpha = 0.05)
Statistical power — Usually 80% (beta = 0.20)

As a rough guide for a 3% baseline conversion rate:

MDE (Relative)	Sample Size Per Variant
20%	~13,000
10%	~52,000
5%	~205,000

Smaller effects require dramatically more traffic. This is why many teams focus on testing bigger, bolder changes rather than minor tweaks.

Common Mistakes

Peeking at results early and stopping when you see a "winner" — this inflates your false positive rate
Not accounting for multiple variants — Testing 5 variants requires more total traffic than testing 2
Ignoring business cycles — Always run tests for at least one full week (ideally two) to account for day-of-week effects, regardless of sample size

Sample Size and Test Velocity

Sample size requirements are the biggest constraint on experimentation velocity. A page with 1,000 visitors per week simply cannot run the same volume of tests as a page with 100,000 visitors per week. This is one reason adaptive methods like bandit testing — which require fewer visitors to converge — are becoming popular for lower-traffic sites.