Thompson Sampling: From Slot Machines to Serious Science
Author
Justin M. Jones, PhD
Published
March 14, 2026
Show code
ARM_COLORS = ["#4e79a7","#f28e2b","#59a14f","#e15759"]// Lanczos log-gamma approximationfunctionlgamma(x) {if (x <0.5) returnMath.log(Math.PI/Math.sin(Math.PI* x)) -lgamma(1- x); x -=1;const g =7;const c = [0.99999999999980993,676.5203681218851,-1259.1392167224028,771.32342877765313,-176.61502916214059,12.507343278686905,-0.13857109526572012,9.9843695780195716e-6,1.5056327351493116e-7];let a = c[0];const t = x + g +0.5;for (let i =1; i < g +2; i++) a += c[i] / (x + i);return0.5*Math.log(2*Math.PI) + (x +0.5) *Math.log(t) - t +Math.log(a);}functionlbeta(a, b) { returnlgamma(a) +lgamma(b) -lgamma(a + b); }functionbetaPDF(x, a, b) {if (x <=0|| x >=1) return0;const logp = (a -1) *Math.log(x) + (b -1) *Math.log(1- x) -lbeta(a, b);returnMath.exp(logp);}functionnormalPDF(x, mu, sigma) {returnMath.exp(-0.5* ((x - mu) / sigma) **2) / (sigma *Math.sqrt(2*Math.PI));}// Box-Muller standard normal samplefunctionstdNormal() {const u1 =Math.random(), u2 =Math.random();returnMath.sqrt(-2*Math.log(u1)) *Math.cos(2*Math.PI* u2);}// Marsaglia-Tsang Gamma samplerfunctiongammaSample(shape) {if (shape <1) returngammaSample(shape +1) *Math.pow(Math.random(),1/ shape);const d = shape -1/3, c =1/Math.sqrt(9* d);while (true) {let z, v;do { z =stdNormal(); v =1+ c * z; } while (v <=0); v = v * v * v;const u =Math.random();if (u <1-0.0331* z * z * z * z) return d * v;if (Math.log(u) <0.5* z * z + d * (1- v +Math.log(v))) return d * v; }}functionbetaSample(a, b) {const x =gammaSample(a), y =gammaSample(b);return x / (x + y);}
The Multi-Armed Bandit Problem
Picture this: you’re running a study on onboarding interventions. You’ve got five different training modules, and you want to figure out which one does the best job improving 30-day retention. The classic play is a clean randomized experiment — split people equally across conditions, run it for a few months, analyze. Done.
Except there’s a catch. While that experiment is running, you’re constantly assigning people to training modules you already have some reason to suspect are worse. Every person you send to a suboptimal module is a person whose retention you could have improved more. In a lab setting, that cost is a mild annoyance. At scale, with millions of users and real outcomes on the line, it adds up.
This is the multi-armed bandit problem. The name comes from imagining a row of slot machines — a.k.a. “one-armed bandits” — each with a different and unknown payout probability. You want to maximize your winnings, but you don’t know which machine is best. Every pull you spend figuring that out is a pull you’re not spending on the winner.
The problem shows up everywhere in disguise: A/B testing, recommendation systems, clinical trials, ad auctions, hyperparameter search. Anywhere you’re making sequential decisions under uncertainty, you’re running a bandit problem whether you call it that or not.
The Setup
Formally: you have \(K\) arms. At each time step \(t = 1, 2, \ldots, T\), you choose an arm \(A_t \in \{1, \ldots, K\}\) and receive a reward \(R_t \sim P_{A_t}\), where \(P_k\) is the (unknown) reward distribution for arm \(k\). The expected reward for arm \(k\) is \(\mu_k\), and the best arm has expected reward \(\mu^* = \max_k \mu_k\).
The goal: maximize \(\sum_{t=1}^T R_t\). Or equivalently, minimize how much reward you lose by not always pulling the best arm.
The Exploration-Exploitation Tradeoff
Every bandit algorithm has to deal with this tension:
Exploitation: Pull the arm you currently believe is best. Collect reward now.
Exploration: Try arms you’re uncertain about. Gather information that might improve future decisions.
These goals are in direct conflict. An algorithm that only exploits gets stuck on the first arm that seems decent — if it had bad luck early on, it’ll stay stuck there. An algorithm that only explores never commits and just collects mediocre rewards indefinitely.
Here’s an interactive demo. We have four arms with unknown success rates. A random exploration policy pulls each arm uniformly at random — watch how slowly it converges on the best arm, and how much regret it accumulates in the process:
Show code
intro_sim = {const K =4;const true_probs = [0.2,0.4,0.6,0.8];const T =500;const counts = [0,0,0,0];const successes = [0,0,0,0];const mu_star =Math.max(...true_probs);const pull_history = [];const est_history = [];const regret_history = [];let cum_reg =0;for (let t =0; t < T; t++) {const arm =Math.floor(Math.random() * K);const reward =Math.random() < true_probs[arm] ?1:0; counts[arm]++; successes[arm] += reward; cum_reg += mu_star - true_probs[arm]; pull_history.push(arm); est_history.push(counts.map((c, i) => c >0? successes[i] / c :0.5)); regret_history.push(cum_reg); }return { pull_history, est_history, regret_history, true_probs, T, K };}
Show code
viewof intro_T = Inputs.range([1,500], {value:50,step:1,label:"Number of pulls (random policy)"})
Even with 500 pulls, a random policy hasn’t reliably identified Arm 4 as the winner — and the regret keeps climbing linearly. That’s the problem we’re trying to solve.
Regret and Pseudoregret
Before getting into solutions, let’s be precise about how we measure failure. The standard measure is cumulative regret:
\[R_T = T\mu^* - \sum_{t=1}^{T} \mu_{A_t}\]
In words: the total expected reward you would have collected always pulling the best arm, minus what you actually expected to collect with your policy. Every time you pull a suboptimal arm, you eat the difference \(\Delta_k = \mu^* - \mu_k\), called the suboptimality gap for arm \(k\).
Pseudoregret averages over the randomness in the algorithm itself:
The distinction is subtle. Regret is a realized quantity — it depends on which random samples happened to come up. Pseudoregret smooths over that noise and asks about expected performance. In theoretical work you’ll see proofs about pseudoregret; in practice people often use the terms interchangeably. Worth knowing the difference.
Why Regret Bounds Matter
A random policy has pseudoregret that grows linearly: \(\bar{R}_T \sim O(T)\). That means it never stops losing at a constant rate. Good bandit algorithms achieve logarithmic regret: \(\bar{R}_T = O(\log T)\). That’s a qualitative difference — the regret curve flattens out as the algorithm locks in on the best arm.
Thompson Sampling achieves logarithmic regret. In fact, it achieves the asymptotically optimal bound, matching the theoretical lower limit proved by Lai and Robbins (1985). That’s not just good engineering — it’s optimal in a rigorous sense.
The comparison chart in the final section will make this concrete. For now, just keep in mind: linear regret means you never really learn. Logarithmic regret means you do.
Basic Thompson Sampling: Binary Rewards
Let’s start with the simplest useful case: binary rewards. Each arm either gives you a 1 (success) or a 0 (failure). Click / no click. Conversion / no conversion. This is the most common setup in industry A/B testing.
For arm \(k\), the true success probability is \(\theta_k \in [0, 1]\), which we don’t know. Thompson Sampling’s approach: maintain a probability distribution over what \(\theta_k\) might be, and use that uncertainty to guide exploration.
The Beta Distribution as a Prior
The natural prior for an unknown probability is the Beta distribution: \(\theta \sim \text{Beta}(\alpha, \beta)\), where \(\alpha\) and \(\beta\) are shape parameters. The mean is \(\alpha / (\alpha + \beta)\) and the total \(\alpha + \beta\) controls how concentrated (certain) the distribution is.
A few reference points:
\(\text{Beta}(1, 1)\): flat uniform — no information whatsoever
\(\text{Beta}(10, 10)\): centered at 0.5, fairly confident it’s near there
\(\text{Beta}(30, 5)\): mean ≈ 0.86, very confident it’s high
The reason Beta is so useful here is conjugacy: if the prior is \(\text{Beta}(\alpha, \beta)\) and you observe \(s\) successes and \(f\) failures, the posterior is exactly \(\text{Beta}(\alpha + s, \beta + f)\). No numerical integration, no MCMC. The math just works.
Notice how a flat \(\text{Beta}(1,1)\) prior becomes increasingly concentrated as you add observations (increase \(\alpha\) and \(\beta\)). That concentration is the algorithm learning.
The Algorithm
Thompson Sampling is almost embarrassingly simple:
Initialize each arm \(k\) with prior \(\text{Beta}(\alpha_k, \beta_k) = \text{Beta}(1, 1)\)
At each time step \(t\):
For each arm \(k\), sample\(\tilde{\theta}_k \sim \text{Beta}(\alpha_k, \beta_k)\)
That’s it. The key move is step (a): you’re sampling a plausible value of each arm’s success probability from your current beliefs, then acting as if those samples were the ground truth.
The reason this works: an arm that you’re uncertain about (wide Beta distribution) will occasionally sample high, which drives exploration automatically — no explicit bonus required. Once you’ve pulled an arm enough times, its posterior concentrates and it only wins the sample competition if it’s genuinely good. Exploitation falls out naturally too.
Watch it run on four arms with true success probabilities \([0.2, 0.4, 0.6, 0.8]\). Scrub through time and watch the posteriors sharpen:
A few things worth noticing as you scrub through time: (1) early on, all four posteriors are wide and overlap a lot — the algorithm is still genuinely uncertain; (2) around T=100–200, the posteriors start separating and Arm 4 starts dominating pull counts; (3) the regret curve bends over as the algorithm locks in. That bend is the difference between \(O(\log T)\) and \(O(T)\).
Regret Comparison: Thompson Sampling vs. the Field
Before moving on to continuous rewards, let’s see how Thompson Sampling actually stacks up against other common bandit algorithms. The contenders:
Random: pull a uniform random arm every time. The baseline floor.
ε-Greedy (ε = 0.1): with probability 0.9, pull the arm with the best current empirical mean; with probability 0.1, pull a random arm. Simple and widely used.
UCB1: pull arm \(k\) if it maximizes \(\hat{\mu}_k + \sqrt{2 \ln t / n_k}\), where \(n_k\) is the number of times arm \(k\) has been pulled. Optimistic in the face of uncertainty.
Thompson Sampling: what we just covered.
All four use the same arm setup: true probabilities \([0.2, 0.4, 0.6, 0.8]\).
Show code
reg_sim = {const true_probs = [0.2,0.4,0.6,0.8];const K =4;const T =1000;const mu_star =Math.max(...true_probs);functionrunRandom() {let cum =0;returnArray.from({ length: T }, () => {const arm =Math.floor(Math.random() * K); cum += mu_star - true_probs[arm];return cum; }); }functionrunEpsilonGreedy(eps =0.1) {const alpha =newArray(K).fill(1), beta =newArray(K).fill(1);let cum =0;returnArray.from({ length: T }, () => {const arm =Math.random() < eps?Math.floor(Math.random() * K): alpha.map((a, i) => a / (a + beta[i])).reduce((best, v, i, arr) => v > arr[best] ? i : best,0);const reward =Math.random() < true_probs[arm] ?1:0; cum += mu_star - true_probs[arm]; alpha[arm] += reward; beta[arm] +=1- reward;return cum; }); }functionrunUCB1() {const n =newArray(K).fill(0);const sum =newArray(K).fill(0);let cum =0;returnArray.from({ length: T }, (_, t) => {let arm;if (t < K) { arm = t; } else { arm = n.map((ni, i) => sum[i] / ni +Math.sqrt(2*Math.log(t +1) / ni)).reduce((best, v, i, arr) => v > arr[best] ? i : best,0); }const reward =Math.random() < true_probs[arm] ?1:0; cum += mu_star - true_probs[arm]; n[arm]++; sum[arm] += reward;return cum; }); }functionrunThompson() {const alpha =newArray(K).fill(1), beta =newArray(K).fill(1);let cum =0;returnArray.from({ length: T }, () => {const samples = alpha.map((a, i) =>betaSample(a, beta[i]));const arm = samples.reduce((best, v, i, arr) => v > arr[best] ? i : best,0);const reward =Math.random() < true_probs[arm] ?1:0; cum += mu_star - true_probs[arm]; alpha[arm] += reward; beta[arm] +=1- reward;return cum; }); }return {random:runRandom(),egreedy:runEpsilonGreedy(),ucb1:runUCB1(),thompson:runThompson(), T };}
The separation becomes stark by T=500. Random and ε-Greedy have roughly linear regret — they keep paying the same price per step indefinitely. UCB1 and Thompson Sampling bend over logarithmically. Between UCB1 and Thompson, Thompson tends to be a bit better in practice on binary problems because it handles uncertainty more gracefully than the heuristic UCB bonus.
Thompson Sampling: Continuous Rewards
Binary rewards are a special case. A lot of the time rewards are continuous — revenue per click, time-on-site, task completion scores, satisfaction ratings. The question isn’t “which arm succeeds more often” but “which arm produces the highest value on average.”
The math changes but the logic doesn’t. We still want a posterior over each arm’s mean reward \(\mu_k\), and we still sample from that posterior to pick an arm.
The Normal-Normal Model
The standard conjugate setup for continuous rewards: assume rewards have a known (or estimated) variance \(\sigma^2\), and put a Normal prior on the mean:
As \(n \to \infty\), \(\tau_n^2 \to 0\) and \(\mu_n \to \bar{r}_k\) — your posterior concentrates on the sample mean. The prior washes out and data takes over. That’s the behavior you want.
The algorithm is the same as before, just sampling \(\tilde{\mu}_k \sim \mathcal{N}(\mu_k, \tau_k^2)\) instead of from a Beta.
Four arms with true means \([0, 0.5, 1.5, 2.5]\) and reward \(\text{SD} = 1\):
Notice how the posteriors for Arm 1 (true mean = 0) and Arm 2 (true mean = 0.5) start wide, overlap with the better arms early on, but gradually get pulled away from the upper end. The algorithm explores them early but eventually stops bothering. That’s the right behavior.
Here’s where it gets a little interesting. What if your rewards are continuous but bounded between 0 and 1? Satisfaction scores, normalized engagement rates, click-through values you’ve capped at 1 — anything that lives on the unit interval.
You’re not in binary land (rewards are continuous), and you’re not cleanly in Normal land (Normal distributions produce values outside \([0, 1]\), which creates an awkward mismatch). The natural distribution for continuous data on \([0, 1]\) is the Beta distribution — but now we’re using it to model the rewards themselves, not just the prior over a success probability.
The Fractional Update
Here’s a clean and practical way to handle this. Think of a continuous reward \(r \in [0, 1]\) as a “soft” Bernoulli outcome: it’s evidence of success with magnitude \(r\) and failure with magnitude \((1 - r)\). A reward of \(0.8\) is strong evidence this arm is good. A reward of \(0.1\) is weak evidence it’s useful at all.
Under this framing, the Beta-Bernoulli update generalizes immediately:
This is called the fractional (or soft) Beta update. It’s not a standard Bayesian update from a Beta likelihood — that would require integrating over an intractable normalizing constant. But it’s well-motivated, it works well in practice, and it’s widely used in production systems. The intuition is solid: you’re scaling the strength of your update by the magnitude of the reward, which is exactly what you’d want.
As a sanity check: if rewards are always exactly 0 or 1, this reduces to the standard Beta-Bernoulli update. And if rewards are i.i.d. uniform on \([0, 1]\), the update averages to \(+0.5\) for both \(\alpha\) and \(\beta\), which is neutral — also sensible.
Same four-arm setup, but now true reward distributions are \(\text{Beta}(2, 8)\), \(\text{Beta}(4, 6)\), \(\text{Beta}(6, 4)\), \(\text{Beta}(8, 2)\), giving true mean rewards of \([0.2, 0.4, 0.6, 0.8]\):
Show code
bcts_sim = {const K =4;const true_alphas = [2,4,6,8];const true_betas = [8,6,4,2];const true_means = true_alphas.map((a, i) => a / (a + true_betas[i]));const mu_star =Math.max(...true_means);const T =1000;const alpha = [1,1,1,1];const beta = [1,1,1,1];const history = [{ alphas: [1,1,1,1],betas: [1,1,1,1] }];const arms_pulled = [];const cumulative_regret = [];let cum_reg =0;for (let t =0; t < T; t++) {const samples = alpha.map((a, i) =>betaSample(a, beta[i]));const arm = samples.reduce((best, v, i, arr) => v > arr[best] ? i : best,0);// Continuous reward from true Beta distributionconst reward =betaSample(true_alphas[arm], true_betas[arm]);// Fractional update alpha[arm] += reward; beta[arm] +=1- reward; cum_reg += mu_star - true_means[arm]; history.push({ alphas: [...alpha],betas: [...beta] }); arms_pulled.push(arm); cumulative_regret.push(cum_reg); }return { history, arms_pulled, cumulative_regret, true_means, true_alphas, true_betas, K, T };}
Compare the posterior shapes here to the binary TS case earlier. Notice how they concentrate faster — because each pull provides a continuous amount of information (a real-valued reward) rather than a single bit. The fractional update efficiently uses every piece of information you observe.
Let’s do a direct comparison across all three variants on structurally equivalent problems (same true mean rewards \([0.2, 0.4, 0.6, 0.8]\), different reward types):
All three variants achieve logarithmic regret. The differences in the curves reflect the different amounts of information each reward type provides per pull — continuous rewards carry more signal than binary ones, so the continuous variants converge faster.
Practical Notes
On the known-variance assumption. The Normal-Normal model assumes you know or have a reasonable estimate of \(\sigma^2\). In practice, overestimating it is conservative (slower but safer convergence); underestimating it leads to overconfident posteriors and potentially bad decisions early on. If you need to be principled about unknown variance, use the Normal-Inverse-Gamma conjugate model — same logic, more moving parts.
On the fractional update. It’s not a formal Bayesian update from a Beta likelihood (the normalizing constant doesn’t cooperate). But it has a coherent interpretation, good empirical behavior, and is cheap to compute. For production systems with unit-interval rewards, it’s a reasonable default. If you want to go fully Bayesian on Beta-distributed rewards, you’re looking at variational inference or MCMC — almost certainly overkill unless you have strong reasons.
On initialization.\(\text{Beta}(1, 1)\) is a safe default, but if you have domain knowledge — say, historical conversion rates cluster around 3% — bake it in. An informative prior speeds up early learning and reduces regret at small \(T\). Just don’t make it so strong that it takes too many observations to override.
On the regret bound. Thompson Sampling achieves the Lai-Robbins lower bound asymptotically:
where KL is the KL-divergence between arm \(k\)’s distribution and the optimal arm’s distribution. This is the best any consistent algorithm can do — Thompson Sampling is optimal, not just good.
On extensions. Everything here covers the stationary bandit — reward distributions don’t change over time. In practice, conversion rates drift, seasonal effects kick in, the world changes. Contextual bandits (where you observe covariates before choosing an arm) and non-stationary bandits (where you discount old observations) are the natural next steps. Thompson Sampling generalizes to both, but those are different tutorials.