In A/B testing, the p-value measures the probability that the difference you observed between your control and variant happened by chance alone. A lower p-value means stronger evidence that the difference is real.
The industry standard threshold is p < 0.05, which means there's less than a 5% chance the result is due to random noise. When a test crosses this threshold, it's considered statistically significant.
How to Interpret P-Values
- p = 0.01 — 1% chance the result is random. Strong evidence.
- p = 0.05 — 5% chance the result is random. The standard threshold for significance.
- p = 0.20 — 20% chance the result is random. Not significant — you need more data or the effect may not exist.
Common Misconceptions
"A p-value of 0.05 means there's a 95% chance the variant is better." Not quite. It means that if there were no real difference, you'd see a result this extreme only 5% of the time. The distinction is subtle but important.
"A non-significant p-value means the variant had no effect." No — it means you don't have enough evidence to conclude there's an effect. It could mean the sample size was too small, or the real effect is smaller than your test could detect.
P-Value and Sample Size
P-values are heavily influenced by sample size. With enough traffic, even a tiny, meaningless difference (like a 0.1% lift) can produce a statistically significant p-value. This is why experienced teams look at confidence intervals and practical significance alongside p-values.
When P-Values Slow You Down
Traditional A/B tests require waiting for p < 0.05, which can take weeks on low-traffic pages. This is one reason adaptive methods like multi-armed bandit testing are gaining popularity — they optimize continuously rather than waiting for a single significance threshold.