Bandit — Algorithmic A/B Testing

Quick Comparison

Algorithm	Strategy	Best For	Complexity
Epsilon-Greedy	Fixed exploration rate	Stable environments	Simple
UCB1	Confidence bounds	Decreasing exploration over time	Moderate
Thompson Sampling	Bayesian sampling	Most scenarios	Moderate
Contextual (Linear UCB)	Context-aware selection	Personalization	Advanced

Simple

Epsilon-Greedy

The simplest bandit algorithm. Exploits the best-known arm most of the time, explores randomly the rest.

How it works

With probability (1 - epsilon), pick the arm with the highest average reward. With probability epsilon, pick a random arm uniformly. Default epsilon = 0.1 (10% exploration). This creates a fixed balance between trying new things and sticking with what works.

When to use

You want predictable behavior
Your environment is stable
You have a known good default

Parameters

Name	Range	Default	Description
epsilon	0 - 1	0.1	Exploration rate. Higher values explore more.

Confidence formula

confidence = 1 - epsilon

Fixed confidence based on the exploration parameter.

How decisions are made

90% Exploit best arm

10%

Pick highest-reward armRandom arm

Moderate

UCB1 (Upper Confidence Bound)

An optimistic algorithm that favors under-explored arms by adding a confidence bonus.

How it works

For each arm, calculate: average_reward + sqrt(2 * ln(total_pulls) / arm_pulls). Pick the arm with the highest value. Arms that haven't been tried much get a large bonus, ensuring exploration. As an arm is pulled more, the bonus shrinks, naturally shifting toward exploitation.

When to use

You want exploration to naturally decrease over time
You don't want to set exploration parameters manually
You have many arms to evaluate

Parameters

None -- UCB1 is parameter-free.

Confidence formula

confidence = min(0.95, bestUCB / (bestUCB + 0.1))

Confidence grows as the best arm's UCB value increases.

Confidence bonus shrinks with more data

Arm A5 pulls

reward

+ bonus

Arm B25 pulls

reward

+ bonus

Arm C100 pulls

reward

+ bonus

Less-explored arms get a larger confidence bonus, encouraging exploration.

Moderate

Thompson Sampling

A Bayesian algorithm that samples from the posterior distribution of each arm's reward probability.

How it works

Maintain a Beta(successes + 1, failures + 1) distribution for each arm. Sample a value from each arm's distribution and pick the arm with the highest sampled value. As more data comes in, the distributions narrow and the best arm wins more often, automatically balancing exploration and exploitation.

When to use

Binary outcomes (click/no-click, convert/don't-convert)
You want the theoretically optimal exploration/exploitation balance
Most general-purpose use cases

Parameters

None -- adapts automatically from data.

Confidence formula

sample from Beta(successes + 1, trials - successes + 1)

Confidence emerges naturally from the posterior distribution.

Distributions narrow with more data

0reward probability1

Arm A (5 trials)

Arm B (25 trials)

Arm C (100 trials)

Advanced

Contextual Bandit (Linear UCB)

Uses user context (device, location, behavior) to personalize treatment selection.

How it works

Extracts numeric features from context. Maintains a linear model per arm. Uses UCB-style confidence bounds on the linear predictions. Arms are selected based on predicted reward + confidence bonus, personalized to each user's context. This means different users can receive different treatments based on who they are.

When to use

Different users respond to different treatments
You have useful contextual signals (device type, location, time of day)
You want personalization, not just global optimization

Parameters

Name	Type	Description
context	object	Contextual features passed at assignment time (e.g. device, location, time of day).

Confidence formula

context-dependent linear UCB bound

Confidence depends on both the linear model fit and the user's context features.

Advanced: This algorithm requires thoughtful feature selection. Choose context features that genuinely influence user behavior -- adding irrelevant features can reduce performance.

Context features inform arm selection

Mmobile

LUS

Tevening

Linear model

Arm A0.72

Arm B0.89

Arm C0.54

Different users get different predictions based on their context.

Choosing an Algorithm

Not sure which algorithm to pick? Here's a quick guide based on your situation.

Just starting?

Use Thompson Sampling -- it works well in almost all scenarios and requires no parameter tuning. It's the best general-purpose choice.

Need predictability?

Use Epsilon-Greedy -- you control exactly how much exploration happens. Set epsilon to 0.1 for 10% exploration, or lower if you want to be more conservative.

Want hands-off exploration?

Use UCB1 -- exploration decreases naturally as confidence grows. No parameters to tune, and it automatically explores less as it gathers more data.

Have user context?

Use Contextual Linear -- personalize assignments based on user attributes like device type, location, or time of day. Best when different users respond to different treatments.

Algorithms

Quick Comparison

Epsilon-Greedy

How it works

When to use

Parameters

Confidence formula

UCB1 (Upper Confidence Bound)

How it works

When to use

Parameters

Confidence formula

Thompson Sampling

How it works

When to use

Parameters

Confidence formula

Contextual Bandit (Linear UCB)

How it works

When to use

Parameters

Confidence formula

Choosing an Algorithm

Just starting?

Need predictability?

Want hands-off exploration?

Have user context?