Back to Articles
ab-testingstatisticscro

How to Calculate Sample Size for an A/B Test

A plain-English guide to calculating A/B test sample size — the four inputs that drive it, a worked example, why undersized tests fail, and how bandits sidestep fixed sizing.

June 9, 2026·5 min read·Sean Quigley, CEO, Surface AI

The single most common reason A/B tests produce misleading results is that they're too small. Calling a winner before you have enough data is how teams ship "improvements" that don't replicate. Calculating the right sample size before you start tells you how much traffic a test needs to detect a real effect — and whether the test is even worth running given your traffic.

This guide covers the four inputs that drive sample size, how they fit together, a worked example, and what to do when you don't have the volume.

The Four Inputs

Every sample-size calculation comes down to four numbers. Three are decisions you make; one you measure.

1. Baseline conversion rate

Your current conversion rate for the metric you're testing — the rate of the control. You measure this from existing data. A page converting at 3% needs a very different sample than one converting at 30%, because lower base rates produce noisier relative comparisons.

2. Minimum detectable effect (MDE)

The smallest improvement you care about detecting, expressed as a relative or absolute lift. This is the input people get wrong most often, and it has the biggest impact. A minimum detectable effect of 2% relative lift requires far more traffic than a 20% lift, because small effects are harder to distinguish from noise.

Set the MDE to the smallest lift that would actually change your decision — not an optimistic guess. Setting it too high lets you "finish" fast but blinds you to real, smaller improvements.

3. Statistical significance (alpha)

Your tolerance for false positives — concluding there's an effect when there isn't. The convention is 5% (α = 0.05), corresponding to 95% confidence. Lowering it to 1% makes the test stricter and requires more data. This connects directly to statistical significance.

4. Statistical power (1 − beta)

Your tolerance for false negatives — missing a real effect. The convention is 80% power, meaning that if a true effect of your MDE size exists, you'll detect it 80% of the time. Raising power to 90% increases the sample required.

How They Fit Together

The relationships are intuitive once you see them:

Change this input......and required sample size
Smaller MDE (detect tinier effects)Increases sharply
Lower baseline conversion rateIncreases
Stricter significance (1% vs 5%)Increases
Higher power (90% vs 80%)Increases

The dominant lever is MDE: because sample size scales roughly with the inverse square of the effect you want to detect, halving your MDE roughly quadruples the traffic you need.

A Worked Example

Suppose your landing page converts at 5% and you want to detect a 10% relative lift (i.e., 5% → 5.5%), at 95% confidence and 80% power.

Plugging those into a standard two-proportion sample-size formula gives roughly 30,000 visitors per variant — about 60,000 total for a two-variant test.

Now change one thing: you'll accept only a 20% relative lift (5% → 6%). The required sample drops to roughly 8,000 per variant. Same page, same confidence — but a less ambitious MDE cut the traffic need by ~75%. That's the inverse-square relationship in action, and it's why the MDE decision dominates everything else.

You don't compute this by hand in practice — a sample-size calculator does it. The point of understanding the inputs is to set them honestly.

Why Undersized Tests Fail

Running a test without enough traffic doesn't just mean "no result." It actively misleads:

  • False winners. With small samples, random noise can look like a large effect. You ship it, and it regresses — a textbook case of regression to the mean.
  • Missed real wins. An underpowered test that comes back "not significant" doesn't mean there's no effect — it means you didn't have the data to see it. Teams discard genuine improvements this way.
  • The peeking trap. Watching an undersized test and stopping when it briefly hits significance compounds the problem — see the peeking problem.

Always run a guardrail check for sample ratio mismatch too: even a correctly sized test is invalid if the actual traffic split drifts from what you configured.

What If You Don't Have the Traffic?

If the calculation says you need 60,000 visitors per variant and you get 5,000 a month, a clean fixed-horizon A/B test simply isn't viable on a reasonable timeline. You have a few options:

  1. Test bigger swings. Bold changes have larger effects, which need less traffic to detect. Don't waste a low-traffic test on a button-color tweak.
  2. Test higher up the funnel where volume is greater (more visitors reach the hero than reach checkout).
  3. Use adaptive methods. Bandit testing doesn't require a fixed sample size — it reallocates traffic continuously, which makes better use of limited volume and avoids the all-or-nothing endpoint.

For a fuller treatment of the low-volume case, see CRO for low-traffic sites.

How Autonomous Systems Sidestep Fixed Sizing

Fixed sample-size calculation assumes a fixed-horizon experiment: decide the size, run, stop, conclude. Autonomous optimization works differently. Instead of committing traffic to a predetermined split for a predetermined duration, it continuously reallocates based on accumulating evidence, so it's always making the best use of whatever traffic exists rather than waiting to reach a magic number.

That doesn't repeal statistics — you still need enough total volume to learn anything. But it removes the brittle "did we hit our sample size?" gate. Platforms like Surface AI handle this allocation automatically, deciding when there's enough evidence to favor a variant, so teams don't have to run a sample-size calculation before every test.