Epsilon-Greedy | Surface AI Hub

Epsilon-greedy is the most intuitive way to solve the multi-armed bandit problem. You serve the current best-performing variant most of the time (this is the "greedy" part), but for a small fixed fraction of visitors — epsilon, often written ε — you pick a variant at random to keep learning.

If ε = 0.1, then 90% of visitors see whichever variant currently has the highest conversion rate, and 10% are randomly assigned across all variants to make sure no option is neglected.

How It Works

For each visitor:

Flip a weighted coin. With probability ε, explore; otherwise, exploit.
Explore — Choose a variant uniformly at random.
Exploit — Choose the variant with the highest observed conversion rate so far.
Update the chosen variant's running conversion estimate with the outcome.

The single parameter ε controls the exploration–exploitation tradeoff directly: higher ε learns faster but spends more traffic on losers; lower ε commits faster but risks getting stuck on a variant that only looked best by chance early on.

Variations

Decaying epsilon — Start with a high ε and shrink it over time, so the algorithm explores aggressively at first and exploits more as confidence grows.
Epsilon-first — Spend an initial pure-exploration phase (a fixed ε share of total traffic) before switching to pure exploitation. This is essentially an A/B test followed by a hard cutover.

Epsilon-Greedy vs. Thompson Sampling

	Epsilon-Greedy	Thompson Sampling
Exploration	Fixed rate ε (or scheduled)	Adaptive, proportional to uncertainty
Tuning	Must choose ε	None
Uncertainty-aware	No — explores uniformly	Yes — explores promising arms more
Implementation	Trivial	Moderate

The key weakness of basic epsilon-greedy is that its exploration is "dumb": during exploration it spreads traffic evenly across all alternatives, including ones already shown to be clearly worse. Uncertainty-aware methods like Thompson sampling concentrate exploration on variants that still might be the best.

When to Use It

You want a simple, transparent baseline that is easy to implement and explain.
You're prototyping a bandit system and want a sanity check before adopting a more sophisticated allocator.
Your variants have clearly separated performance where crude exploration is good enough.

Limitations

Wastes exploration traffic on variants already known to be poor.
The right ε is hard to pick and the best value changes as the test matures — which is why decaying schedules exist.
Not uncertainty-aware, so it converges more slowly than Bayesian approaches on close calls.