Algorithms

Understanding the multi-armed bandit algorithms that power Bandit's optimization engine.

Quick Comparison

AlgorithmStrategyBest ForComplexity
Epsilon-GreedyFixed exploration rateStable environmentsSimple
UCB1Confidence boundsDecreasing exploration over timeModerate
Thompson SamplingBayesian samplingMost scenariosModerate
Contextual (Linear UCB)Context-aware selectionPersonalizationAdvanced
Simple

Epsilon-Greedy

The simplest bandit algorithm. Exploits the best-known arm most of the time, explores randomly the rest.

How it works

With probability (1 - epsilon), pick the arm with the highest average reward. With probability epsilon, pick a random arm uniformly. Default epsilon = 0.1 (10% exploration). This creates a fixed balance between trying new things and sticking with what works.

When to use

  • You want predictable behavior
  • Your environment is stable
  • You have a known good default

Parameters

NameRangeDefaultDescription
epsilon0 - 10.1Exploration rate. Higher values explore more.

Confidence formula

confidence = 1 - epsilon

Fixed confidence based on the exploration parameter.

How decisions are made

90% Exploit best arm
10%
Pick highest-reward armRandom arm
Moderate

UCB1 (Upper Confidence Bound)

An optimistic algorithm that favors under-explored arms by adding a confidence bonus.

How it works

For each arm, calculate: average_reward + sqrt(2 * ln(total_pulls) / arm_pulls). Pick the arm with the highest value. Arms that haven't been tried much get a large bonus, ensuring exploration. As an arm is pulled more, the bonus shrinks, naturally shifting toward exploitation.

When to use

  • You want exploration to naturally decrease over time
  • You don't want to set exploration parameters manually
  • You have many arms to evaluate

Parameters

None -- UCB1 is parameter-free.

Confidence formula

confidence = min(0.95, bestUCB / (bestUCB + 0.1))

Confidence grows as the best arm's UCB value increases.

Confidence bonus shrinks with more data

Arm A5 pulls
reward
+ bonus
Arm B25 pulls
reward
+ bonus
Arm C100 pulls
reward
+ bonus

Less-explored arms get a larger confidence bonus, encouraging exploration.

Moderate

Thompson Sampling

A Bayesian algorithm that samples from the posterior distribution of each arm's reward probability.

How it works

Maintain a Beta(successes + 1, failures + 1) distribution for each arm. Sample a value from each arm's distribution and pick the arm with the highest sampled value. As more data comes in, the distributions narrow and the best arm wins more often, automatically balancing exploration and exploitation.

When to use

  • Binary outcomes (click/no-click, convert/don't-convert)
  • You want the theoretically optimal exploration/exploitation balance
  • Most general-purpose use cases

Parameters

None -- adapts automatically from data.

Confidence formula

sample from Beta(successes + 1, trials - successes + 1)

Confidence emerges naturally from the posterior distribution.

Distributions narrow with more data

0reward probability1
Arm A (5 trials)
Arm B (25 trials)
Arm C (100 trials)
Advanced

Contextual Bandit (Linear UCB)

Uses user context (device, location, behavior) to personalize treatment selection.

How it works

Extracts numeric features from context. Maintains a linear model per arm. Uses UCB-style confidence bounds on the linear predictions. Arms are selected based on predicted reward + confidence bonus, personalized to each user's context. This means different users can receive different treatments based on who they are.

When to use

  • Different users respond to different treatments
  • You have useful contextual signals (device type, location, time of day)
  • You want personalization, not just global optimization

Parameters

NameTypeDescription
contextobjectContextual features passed at assignment time (e.g. device, location, time of day).

Confidence formula

context-dependent linear UCB bound

Confidence depends on both the linear model fit and the user's context features.

Advanced: This algorithm requires thoughtful feature selection. Choose context features that genuinely influence user behavior -- adding irrelevant features can reduce performance.

Context features inform arm selection

Mmobile
LUS
Tevening
Linear model
Arm A0.72
Arm B0.89
Arm C0.54

Different users get different predictions based on their context.

Choosing an Algorithm

Not sure which algorithm to pick? Here's a quick guide based on your situation.

1

Just starting?

Use Thompson Sampling -- it works well in almost all scenarios and requires no parameter tuning. It's the best general-purpose choice.

2

Need predictability?

Use Epsilon-Greedy -- you control exactly how much exploration happens. Set epsilon to 0.1 for 10% exploration, or lower if you want to be more conservative.

3

Want hands-off exploration?

Use UCB1 -- exploration decreases naturally as confidence grows. No parameters to tune, and it automatically explores less as it gathers more data.

4

Have user context?

Use Contextual Linear -- personalize assignments based on user attributes like device type, location, or time of day. Best when different users respond to different treatments.

Bandit