Contextual Bandit | Surface AI Hub

A contextual bandit is a machine learning-based experimentation method that learns which variant to show each user based on observable features about that user — their device type, location, referral source, behavioral history, or any other signal available at the time of the decision.

The name comes from the 'multi-armed bandit' problem in statistics: you have multiple options (arms), each with unknown reward distributions, and you want to maximize total reward by balancing exploration (trying different arms) with exploitation (using the best-known arm). A contextual bandit extends this by making the optimal arm depend on the context of each individual decision.

How It Differs from Multi-Armed Bandit

A standard multi-armed bandit assumes the best variant is the same for all users and adapts traffic allocation globally. A contextual bandit learns a different best variant for different user segments — it personalizes the decision.

A/B Test vs. Multi-Armed Bandit vs. Contextual Bandit

	A/B Test	Multi-Armed Bandit	Contextual Bandit
Traffic allocation	Fixed (50/50)	Adaptive (global)	Adaptive (per user)
Personalization	None	None	Yes
Learning signal	Post-experiment	Continuous	Continuous
Data required	Moderate	Moderate	High
Implementation complexity	Low	Medium	High
Interpretability	High	Medium	Low

A Real Example

Suppose you're testing two CTAs: 'Start free trial' vs. 'Book a demo'. A standard A/B test might find 'Start free trial' wins by 8% overall. But a contextual bandit might learn:

Mobile users convert better on 'Start free trial'
Users arriving from high-intent search terms ('pricing', 'alternatives') convert better on 'Book a demo'
Returning visitors who've already seen the pricing page convert better on 'Book a demo'

The bandit serves each user the variant that's predicted to be best for them specifically, rather than forcing everyone through the globally optimal option.

When to Use a Contextual Bandit

Contextual bandits are worth considering when:

You have sufficient traffic to train a reliable model — typically tens of thousands of users per context segment, not hundreds
You have meaningful feature signals available at decision time (device, session behavior, acquisition source, account attributes)
Your goal is ongoing personalization, not a one-time experiment conclusion
You're already running multi-armed bandits and want to add a personalization layer

Limitations

Requires ML infrastructure. You need a feature pipeline, a model training loop, and a serving layer that can make real-time predictions at the point of variant assignment
More data to train reliably. The model needs to learn separate reward functions for different context combinations. Sparse data in some segments means poor decisions in those segments
Low interpretability. Unlike an A/B test with a clear winner and a confidence interval, a contextual bandit produces a model. Understanding why it makes the decisions it does requires additional analysis
Harder to audit. If a bandit is serving different users different experiences, detecting errors or bias in the allocation is more complex than inspecting a simple A/B split

For most teams, the right progression is: A/B tests → multi-armed bandit → contextual bandit. Each step adds power and personalization at the cost of complexity and data requirements. Jump ahead only when the prior step has hit its ceiling and you have the infrastructure to support the next one.