Thompson Sampling

A Bayesian algorithm for multi-armed bandit problems that allocates traffic to each variant in proportion to the probability that it is the best — balancing exploration and exploitation automatically.

Thompson sampling is a strategy for solving the multi-armed bandit problem: given several variants with unknown conversion rates, how do you allocate traffic so that you maximize total conversions while still learning which variant is best?

Instead of splitting traffic evenly (as an A/B test does) or always serving the current leader (which risks locking in a false winner), Thompson sampling serves each variant in proportion to the probability that it is the best one. As evidence accumulates, that probability shifts, and traffic follows.

How It Works

Thompson sampling is a Bayesian method. For each variant it maintains a probability distribution over the variant's true conversion rate, then repeats a simple loop for every visitor:

  1. Sample — Draw one random conversion rate from each variant's current distribution.
  2. Serve — Show the visitor the variant with the highest sampled value.
  3. Observe — Record whether the visitor converted.
  4. Update — Revise that variant's distribution using the new observation.

Early on, the distributions are wide and uncertain, so sampling produces a lot of variety — that is the exploration. As data accumulates the distributions sharpen, the best variant gets drawn most often, and traffic concentrates on it — that is the exploitation. The balance is automatic; there is no exploration rate to tune, unlike epsilon-greedy.

Thompson Sampling vs. Other Allocation Strategies

A/B TestEpsilon-GreedyThompson Sampling
Traffic allocationFixed (even)Mostly leader + random εProportional to P(best)
ExplorationNone (fixed)Constant rate εAdaptive, decays naturally
Tuning requiredSample sizeChoose εNone
Handles uncertaintyNoCrudelyExplicitly (Bayesian)
Opportunity costHighMediumLow

When to Use Thompson Sampling

  • You want to minimize the cost of testing — fewer visitors are sent to losing variants than in a fixed split.
  • You are running continuous optimization rather than a one-time experiment with a fixed endpoint.
  • You have three or more variants and don't want to manually manage exploration.
  • You value an approach with few knobs to tune — it works well out of the box.

Limitations

  • Less interpretable than a clean A/B test. Because allocation shifts over time, computing a single classical confidence interval at the end is not straightforward.
  • Sensitive to non-stationarity. If conversion rates drift (seasonality, a campaign ending), a naïve implementation can over-commit to a variant that was best in the past. Production systems add decay or windowing to stay adaptive.
  • Needs reasonable volume. With very little traffic the distributions stay wide and allocation stays noisy.

Thompson sampling is one of the workhorse algorithms behind autonomous optimization platforms, where the goal is to keep earning conversions while learning, rather than pausing to run discrete experiments.