Contextual Bandit

An adaptive testing algorithm that selects the best variant for each individual user based on their context — like device, location, or behavior — rather than allocating traffic uniformly.

A contextual bandit is a machine learning-based experimentation method that learns which variant to show each user based on observable features about that user — their device type, location, referral source, behavioral history, or any other signal available at the time of the decision.

The name comes from the 'multi-armed bandit' problem in statistics: you have multiple options (arms), each with unknown reward distributions, and you want to maximize total reward by balancing exploration (trying different arms) with exploitation (using the best-known arm). A contextual bandit extends this by making the optimal arm depend on the context of each individual decision.

How It Differs from Multi-Armed Bandit

A standard multi-armed bandit assumes the best variant is the same for all users and adapts traffic allocation globally. A contextual bandit learns a different best variant for different user segments — it personalizes the decision.

A/B Test vs. Multi-Armed Bandit vs. Contextual Bandit

A/B TestMulti-Armed BanditContextual Bandit
Traffic allocationFixed (50/50)Adaptive (global)Adaptive (per user)
PersonalizationNoneNoneYes
Learning signalPost-experimentContinuousContinuous
Data requiredModerateModerateHigh
Implementation complexityLowMediumHigh
InterpretabilityHighMediumLow

A Real Example

Suppose you're testing two CTAs: 'Start free trial' vs. 'Book a demo'. A standard A/B test might find 'Start free trial' wins by 8% overall. But a contextual bandit might learn:

  • Mobile users convert better on 'Start free trial'
  • Users arriving from high-intent search terms ('pricing', 'alternatives') convert better on 'Book a demo'
  • Returning visitors who've already seen the pricing page convert better on 'Book a demo'

The bandit serves each user the variant that's predicted to be best for them specifically, rather than forcing everyone through the globally optimal option.

When to Use a Contextual Bandit

Contextual bandits are worth considering when:

  • You have sufficient traffic to train a reliable model — typically tens of thousands of users per context segment, not hundreds
  • You have meaningful feature signals available at decision time (device, session behavior, acquisition source, account attributes)
  • Your goal is ongoing personalization, not a one-time experiment conclusion
  • You're already running multi-armed bandits and want to add a personalization layer

Limitations

  • Requires ML infrastructure. You need a feature pipeline, a model training loop, and a serving layer that can make real-time predictions at the point of variant assignment
  • More data to train reliably. The model needs to learn separate reward functions for different context combinations. Sparse data in some segments means poor decisions in those segments
  • Low interpretability. Unlike an A/B test with a clear winner and a confidence interval, a contextual bandit produces a model. Understanding why it makes the decisions it does requires additional analysis
  • Harder to audit. If a bandit is serving different users different experiences, detecting errors or bias in the allocation is more complex than inspecting a simple A/B split

For most teams, the right progression is: A/B tests → multi-armed bandit → contextual bandit. Each step adds power and personalization at the cost of complexity and data requirements. Jump ahead only when the prior step has hit its ceiling and you have the infrastructure to support the next one.