A contextual bandit is a machine learning-based experimentation method that learns which variant to show each user based on observable features about that user — their device type, location, referral source, behavioral history, or any other signal available at the time of the decision.
The name comes from the 'multi-armed bandit' problem in statistics: you have multiple options (arms), each with unknown reward distributions, and you want to maximize total reward by balancing exploration (trying different arms) with exploitation (using the best-known arm). A contextual bandit extends this by making the optimal arm depend on the context of each individual decision.
How It Differs from Multi-Armed Bandit
A standard multi-armed bandit assumes the best variant is the same for all users and adapts traffic allocation globally. A contextual bandit learns a different best variant for different user segments — it personalizes the decision.
A/B Test vs. Multi-Armed Bandit vs. Contextual Bandit
| A/B Test | Multi-Armed Bandit | Contextual Bandit | |
|---|---|---|---|
| Traffic allocation | Fixed (50/50) | Adaptive (global) | Adaptive (per user) |
| Personalization | None | None | Yes |
| Learning signal | Post-experiment | Continuous | Continuous |
| Data required | Moderate | Moderate | High |
| Implementation complexity | Low | Medium | High |
| Interpretability | High | Medium | Low |
A Real Example
Suppose you're testing two CTAs: 'Start free trial' vs. 'Book a demo'. A standard A/B test might find 'Start free trial' wins by 8% overall. But a contextual bandit might learn:
- Mobile users convert better on 'Start free trial'
- Users arriving from high-intent search terms ('pricing', 'alternatives') convert better on 'Book a demo'
- Returning visitors who've already seen the pricing page convert better on 'Book a demo'
The bandit serves each user the variant that's predicted to be best for them specifically, rather than forcing everyone through the globally optimal option.
When to Use a Contextual Bandit
Contextual bandits are worth considering when:
- You have sufficient traffic to train a reliable model — typically tens of thousands of users per context segment, not hundreds
- You have meaningful feature signals available at decision time (device, session behavior, acquisition source, account attributes)
- Your goal is ongoing personalization, not a one-time experiment conclusion
- You're already running multi-armed bandits and want to add a personalization layer
Limitations
- Requires ML infrastructure. You need a feature pipeline, a model training loop, and a serving layer that can make real-time predictions at the point of variant assignment
- More data to train reliably. The model needs to learn separate reward functions for different context combinations. Sparse data in some segments means poor decisions in those segments
- Low interpretability. Unlike an A/B test with a clear winner and a confidence interval, a contextual bandit produces a model. Understanding why it makes the decisions it does requires additional analysis
- Harder to audit. If a bandit is serving different users different experiences, detecting errors or bias in the allocation is more complex than inspecting a simple A/B split
For most teams, the right progression is: A/B tests → multi-armed bandit → contextual bandit. Each step adds power and personalization at the cost of complexity and data requirements. Jump ahead only when the prior step has hit its ceiling and you have the infrastructure to support the next one.