Epsilon-greedy is the most intuitive way to solve the multi-armed bandit problem. You serve the current best-performing variant most of the time (this is the "greedy" part), but for a small fixed fraction of visitors — epsilon, often written ε — you pick a variant at random to keep learning.
If ε = 0.1, then 90% of visitors see whichever variant currently has the highest conversion rate, and 10% are randomly assigned across all variants to make sure no option is neglected.
How It Works
For each visitor:
- Flip a weighted coin. With probability ε, explore; otherwise, exploit.
- Explore — Choose a variant uniformly at random.
- Exploit — Choose the variant with the highest observed conversion rate so far.
- Update the chosen variant's running conversion estimate with the outcome.
The single parameter ε controls the exploration–exploitation tradeoff directly: higher ε learns faster but spends more traffic on losers; lower ε commits faster but risks getting stuck on a variant that only looked best by chance early on.
Variations
- Decaying epsilon — Start with a high ε and shrink it over time, so the algorithm explores aggressively at first and exploits more as confidence grows.
- Epsilon-first — Spend an initial pure-exploration phase (a fixed ε share of total traffic) before switching to pure exploitation. This is essentially an A/B test followed by a hard cutover.
Epsilon-Greedy vs. Thompson Sampling
| Epsilon-Greedy | Thompson Sampling | |
|---|---|---|
| Exploration | Fixed rate ε (or scheduled) | Adaptive, proportional to uncertainty |
| Tuning | Must choose ε | None |
| Uncertainty-aware | No — explores uniformly | Yes — explores promising arms more |
| Implementation | Trivial | Moderate |
The key weakness of basic epsilon-greedy is that its exploration is "dumb": during exploration it spreads traffic evenly across all alternatives, including ones already shown to be clearly worse. Uncertainty-aware methods like Thompson sampling concentrate exploration on variants that still might be the best.
When to Use It
- You want a simple, transparent baseline that is easy to implement and explain.
- You're prototyping a bandit system and want a sanity check before adopting a more sophisticated allocator.
- Your variants have clearly separated performance where crude exploration is good enough.
Limitations
- Wastes exploration traffic on variants already known to be poor.
- The right ε is hard to pick and the best value changes as the test matures — which is why decaying schedules exist.
- Not uncertainty-aware, so it converges more slowly than Bayesian approaches on close calls.