Statistical Significance | Surface AI Hub

Statistical significance is a threshold used in A/B testing and other experiments to determine whether an observed difference in performance between variants is likely due to the change you made — or simply random variation in the data.

In practical terms, it answers: "How confident are we that this result is real?"

The 95% Confidence Standard

The most commonly used threshold in CRO is 95% confidence (p < 0.05). This means there is only a 5% probability that the observed difference occurred by chance.

Some teams use 90% confidence to move faster, accepting a higher false positive rate. High-stakes decisions — major site redesigns, pricing changes — often warrant 99% confidence.

Why Statistical Significance Matters

Without statistical significance, you cannot reliably tell the difference between a genuine lift and random noise. The two most common errors:

False positive (Type I error): Declaring a winner when the difference was actually due to chance. Leads to shipping a change that has no real effect.

False negative (Type II error): Failing to detect a real difference because the sample size was too small. Leads to discarding a change that actually works.

Both errors cost money. Running tests to the required sample size is the primary defense against both.

Sample Size and Duration

Statistical significance is tied to sample size. The formula for minimum sample size depends on:

Baseline conversion rate — Your current performance
Minimum detectable effect (MDE) — How small a difference you need to detect
Confidence level — Typically 95%
Statistical power — Typically 80%, meaning an 80% chance of detecting a real effect if one exists

Baseline Rate	MDE (relative)	Approx. Visitors Needed (per variant)
2%	20%	~12,000
5%	20%	~4,800
10%	20%	~2,300

Higher baseline rates and larger effects require fewer visitors to reach significance.

Common Mistakes

Stopping early — Peeking at results and stopping when you see a winning number inflates your false positive rate significantly. Commit to a sample size before the test starts.

Running too many metrics — Testing for significance across 10 metrics at once increases the probability of a false positive on at least one of them. Pre-specify your primary metric.

Ignoring seasonality — Run tests for at least one full business cycle (typically 7–14 days) to avoid results skewed by day-of-week traffic patterns.