Quick Comparison
| Algorithm | Strategy | Best For | Complexity |
|---|---|---|---|
| Epsilon-Greedy | Fixed exploration rate | Stable environments | Simple |
| UCB1 | Confidence bounds | Decreasing exploration over time | Moderate |
| Thompson Sampling | Bayesian sampling | Most scenarios | Moderate |
| Contextual (Linear UCB) | Context-aware selection | Personalization | Advanced |
Epsilon-Greedy
The simplest bandit algorithm. Exploits the best-known arm most of the time, explores randomly the rest.
How it works
With probability (1 - epsilon), pick the arm with the highest average reward. With probability epsilon, pick a random arm uniformly. Default epsilon = 0.1 (10% exploration). This creates a fixed balance between trying new things and sticking with what works.
When to use
- You want predictable behavior
- Your environment is stable
- You have a known good default
Parameters
| Name | Range | Default | Description |
|---|---|---|---|
| epsilon | 0 - 1 | 0.1 | Exploration rate. Higher values explore more. |
Confidence formula
confidence = 1 - epsilonFixed confidence based on the exploration parameter.
How decisions are made
UCB1 (Upper Confidence Bound)
An optimistic algorithm that favors under-explored arms by adding a confidence bonus.
How it works
For each arm, calculate: average_reward + sqrt(2 * ln(total_pulls) / arm_pulls). Pick the arm with the highest value. Arms that haven't been tried much get a large bonus, ensuring exploration. As an arm is pulled more, the bonus shrinks, naturally shifting toward exploitation.
When to use
- You want exploration to naturally decrease over time
- You don't want to set exploration parameters manually
- You have many arms to evaluate
Parameters
None -- UCB1 is parameter-free.
Confidence formula
confidence = min(0.95, bestUCB / (bestUCB + 0.1))Confidence grows as the best arm's UCB value increases.
Confidence bonus shrinks with more data
Less-explored arms get a larger confidence bonus, encouraging exploration.
Thompson Sampling
A Bayesian algorithm that samples from the posterior distribution of each arm's reward probability.
How it works
Maintain a Beta(successes + 1, failures + 1) distribution for each arm. Sample a value from each arm's distribution and pick the arm with the highest sampled value. As more data comes in, the distributions narrow and the best arm wins more often, automatically balancing exploration and exploitation.
When to use
- Binary outcomes (click/no-click, convert/don't-convert)
- You want the theoretically optimal exploration/exploitation balance
- Most general-purpose use cases
Parameters
None -- adapts automatically from data.
Confidence formula
sample from Beta(successes + 1, trials - successes + 1)Confidence emerges naturally from the posterior distribution.
Distributions narrow with more data
Contextual Bandit (Linear UCB)
Uses user context (device, location, behavior) to personalize treatment selection.
How it works
Extracts numeric features from context. Maintains a linear model per arm. Uses UCB-style confidence bounds on the linear predictions. Arms are selected based on predicted reward + confidence bonus, personalized to each user's context. This means different users can receive different treatments based on who they are.
When to use
- Different users respond to different treatments
- You have useful contextual signals (device type, location, time of day)
- You want personalization, not just global optimization
Parameters
| Name | Type | Description |
|---|---|---|
| context | object | Contextual features passed at assignment time (e.g. device, location, time of day). |
Confidence formula
context-dependent linear UCB boundConfidence depends on both the linear model fit and the user's context features.
Context features inform arm selection
Different users get different predictions based on their context.
Choosing an Algorithm
Not sure which algorithm to pick? Here's a quick guide based on your situation.
Just starting?
Use Thompson Sampling -- it works well in almost all scenarios and requires no parameter tuning. It's the best general-purpose choice.
Need predictability?
Use Epsilon-Greedy -- you control exactly how much exploration happens. Set epsilon to 0.1 for 10% exploration, or lower if you want to be more conservative.
Want hands-off exploration?
Use UCB1 -- exploration decreases naturally as confidence grows. No parameters to tune, and it automatically explores less as it gathers more data.
Have user context?
Use Contextual Linear -- personalize assignments based on user attributes like device type, location, or time of day. Best when different users respond to different treatments.