CUPED | Surface AI Hub

CUPED stands for Controlled-experiment Using Pre-Experiment Data. It's a technique developed at Microsoft that reduces the variance in your test metric by removing noise that pre-experiment behavior already explains — making your A/B tests more sensitive without requiring more traffic.

The core insight: much of the variation in user behavior during an experiment is predictable from how those same users behaved before the experiment started. If you can measure and remove that pre-existing noise, you end up with a tighter, cleaner signal from the experiment itself.

Standard A/B vs. CUPED

	Standard A/B Test	CUPED
Variance reduction	None	20–60% typical
Sample size impact	Baseline	20–60% smaller
Data required	Experiment period only	Pre-experiment period + experiment
Implementation complexity	Low	Medium
Works best for	Any metric	High-autocorrelation metrics

How It Works

Collect pre-experiment data. Before the test starts, record the same metric you plan to measure — typically over the past 1–4 weeks — for each user who will enter the experiment
Compute the adjustment coefficient. Fit a regression of the in-experiment outcome on the pre-experiment covariate. This gives you a coefficient θ that captures how much of the outcome is predicted by prior behavior
Adjust each user's outcome. For each user in the experiment, compute: adjusted_outcome = outcome - θ × (pre_metric - mean(pre_metric)) where θ is the regression coefficient from step 2 — it scales how much weight to give the pre-experiment correction
Run your test on adjusted outcomes. Use the adjusted values exactly as you would the raw metric — compute means per variant, run a t-test or z-test, report confidence intervals

The adjustment doesn't change the expected value of the metric; it only reduces the variance. This means your point estimates stay the same, but your confidence intervals shrink — which is equivalent to needing less data to reach the same level of confidence.

When to Use CUPED

CUPED is most valuable when:

A stable pre-experiment period exists (users have prior sessions or transaction history)
Your metric has high autocorrelation — meaning past behavior predicts future behavior. Revenue per user and session engagement typically qualify; one-time signup events typically don't
You're traffic-constrained and need to run faster experiments
Your testing platform supports it (Statsig, Eppo, and Experimentation Platform at large companies often have it built in)

Limitations

Requires historical data. New users or new products without prior behavioral data can't benefit from CUPED. You need a lookback window long enough to be meaningful
Adds implementation complexity. You need to join pre-experiment data to experiment assignments, which requires an engineering step beyond standard A/B test setup
Metric-dependent gains. If your outcome metric has low autocorrelation — meaning past behavior doesn't predict the metric well — CUPED provides little benefit. Test it empirically before relying on it
Not a substitute for sample size. CUPED reduces the sample you need, but doesn't eliminate the need for sufficient traffic. At very low volumes, you still need to combine it with other strategies like Bayesian testing or qualitative research

For teams running experiments at the margins of statistical viability, CUPED is one of the highest-ROI investments available — especially when paired with sequential testing for valid continuous monitoring.