Epsilon-Greedy

A simple multi-armed bandit strategy that serves the best-performing variant most of the time but reserves a fixed fraction of traffic (epsilon) for random exploration of the alternatives.

Epsilon-greedy is the most intuitive way to solve the multi-armed bandit problem. You serve the current best-performing variant most of the time (this is the "greedy" part), but for a small fixed fraction of visitors — epsilon, often written ε — you pick a variant at random to keep learning.

If ε = 0.1, then 90% of visitors see whichever variant currently has the highest conversion rate, and 10% are randomly assigned across all variants to make sure no option is neglected.

How It Works

For each visitor:

  1. Flip a weighted coin. With probability ε, explore; otherwise, exploit.
  2. Explore — Choose a variant uniformly at random.
  3. Exploit — Choose the variant with the highest observed conversion rate so far.
  4. Update the chosen variant's running conversion estimate with the outcome.

The single parameter ε controls the exploration–exploitation tradeoff directly: higher ε learns faster but spends more traffic on losers; lower ε commits faster but risks getting stuck on a variant that only looked best by chance early on.

Variations

  • Decaying epsilon — Start with a high ε and shrink it over time, so the algorithm explores aggressively at first and exploits more as confidence grows.
  • Epsilon-first — Spend an initial pure-exploration phase (a fixed ε share of total traffic) before switching to pure exploitation. This is essentially an A/B test followed by a hard cutover.

Epsilon-Greedy vs. Thompson Sampling

Epsilon-GreedyThompson Sampling
ExplorationFixed rate ε (or scheduled)Adaptive, proportional to uncertainty
TuningMust choose εNone
Uncertainty-awareNo — explores uniformlyYes — explores promising arms more
ImplementationTrivialModerate

The key weakness of basic epsilon-greedy is that its exploration is "dumb": during exploration it spreads traffic evenly across all alternatives, including ones already shown to be clearly worse. Uncertainty-aware methods like Thompson sampling concentrate exploration on variants that still might be the best.

When to Use It

  • You want a simple, transparent baseline that is easy to implement and explain.
  • You're prototyping a bandit system and want a sanity check before adopting a more sophisticated allocator.
  • Your variants have clearly separated performance where crude exploration is good enough.

Limitations

  • Wastes exploration traffic on variants already known to be poor.
  • The right ε is hard to pick and the best value changes as the test matures — which is why decaying schedules exist.
  • Not uncertainty-aware, so it converges more slowly than Bayesian approaches on close calls.