Thompson Sampling: From Slot Machines to Serious Science

Author

Justin M. Jones, PhD

Published

March 14, 2026


The Multi-Armed Bandit Problem

Picture this: you’re running a study on onboarding interventions. You’ve got five different training modules, and you want to figure out which one does the best job improving 30-day retention. The classic play is a clean randomized experiment — split people equally across conditions, run it for a few months, analyze. Done.

Except there’s a catch. While that experiment is running, you’re constantly assigning people to training modules you already have some reason to suspect are worse. Every person you send to a suboptimal module is a person whose retention you could have improved more. In a lab setting, that cost is a mild annoyance. At scale, with millions of users and real outcomes on the line, it adds up.

This is the multi-armed bandit problem. The name comes from imagining a row of slot machines — a.k.a. “one-armed bandits” — each with a different and unknown payout probability. You want to maximize your winnings, but you don’t know which machine is best. Every pull you spend figuring that out is a pull you’re not spending on the winner.

The problem shows up everywhere in disguise: A/B testing, recommendation systems, clinical trials, ad auctions, hyperparameter search. Anywhere you’re making sequential decisions under uncertainty, you’re running a bandit problem whether you call it that or not.

The Setup

Formally: you have \(K\) arms. At each time step \(t = 1, 2, \ldots, T\), you choose an arm \(A_t \in \{1, \ldots, K\}\) and receive a reward \(R_t \sim P_{A_t}\), where \(P_k\) is the (unknown) reward distribution for arm \(k\). The expected reward for arm \(k\) is \(\mu_k\), and the best arm has expected reward \(\mu^* = \max_k \mu_k\).

The goal: maximize \(\sum_{t=1}^T R_t\). Or equivalently, minimize how much reward you lose by not always pulling the best arm.

The Exploration-Exploitation Tradeoff

Every bandit algorithm has to deal with this tension:

  • Exploitation: Pull the arm you currently believe is best. Collect reward now.
  • Exploration: Try arms you’re uncertain about. Gather information that might improve future decisions.

These goals are in direct conflict. An algorithm that only exploits gets stuck on the first arm that seems decent — if it had bad luck early on, it’ll stay stuck there. An algorithm that only explores never commits and just collects mediocre rewards indefinitely.

Here’s an interactive demo. We have four arms with unknown success rates. A random exploration policy pulls each arm uniformly at random — watch how slowly it converges on the best arm, and how much regret it accumulates in the process:

Even with 500 pulls, a random policy hasn’t reliably identified Arm 4 as the winner — and the regret keeps climbing linearly. That’s the problem we’re trying to solve.


Regret and Pseudoregret

Before getting into solutions, let’s be precise about how we measure failure. The standard measure is cumulative regret:

\[R_T = T\mu^* - \sum_{t=1}^{T} \mu_{A_t}\]

In words: the total expected reward you would have collected always pulling the best arm, minus what you actually expected to collect with your policy. Every time you pull a suboptimal arm, you eat the difference \(\Delta_k = \mu^* - \mu_k\), called the suboptimality gap for arm \(k\).

Pseudoregret averages over the randomness in the algorithm itself:

\[\bar{R}_T = T\mu^* - \mathbb{E}\left[\sum_{t=1}^T \mu_{A_t}\right]\]

The distinction is subtle. Regret is a realized quantity — it depends on which random samples happened to come up. Pseudoregret smooths over that noise and asks about expected performance. In theoretical work you’ll see proofs about pseudoregret; in practice people often use the terms interchangeably. Worth knowing the difference.

Why Regret Bounds Matter

A random policy has pseudoregret that grows linearly: \(\bar{R}_T \sim O(T)\). That means it never stops losing at a constant rate. Good bandit algorithms achieve logarithmic regret: \(\bar{R}_T = O(\log T)\). That’s a qualitative difference — the regret curve flattens out as the algorithm locks in on the best arm.

Thompson Sampling achieves logarithmic regret. In fact, it achieves the asymptotically optimal bound, matching the theoretical lower limit proved by Lai and Robbins (1985). That’s not just good engineering — it’s optimal in a rigorous sense.

The comparison chart in the final section will make this concrete. For now, just keep in mind: linear regret means you never really learn. Logarithmic regret means you do.


Basic Thompson Sampling: Binary Rewards

Let’s start with the simplest useful case: binary rewards. Each arm either gives you a 1 (success) or a 0 (failure). Click / no click. Conversion / no conversion. This is the most common setup in industry A/B testing.

For arm \(k\), the true success probability is \(\theta_k \in [0, 1]\), which we don’t know. Thompson Sampling’s approach: maintain a probability distribution over what \(\theta_k\) might be, and use that uncertainty to guide exploration.

The Beta Distribution as a Prior

The natural prior for an unknown probability is the Beta distribution: \(\theta \sim \text{Beta}(\alpha, \beta)\), where \(\alpha\) and \(\beta\) are shape parameters. The mean is \(\alpha / (\alpha + \beta)\) and the total \(\alpha + \beta\) controls how concentrated (certain) the distribution is.

A few reference points:

  • \(\text{Beta}(1, 1)\): flat uniform — no information whatsoever
  • \(\text{Beta}(10, 10)\): centered at 0.5, fairly confident it’s near there
  • \(\text{Beta}(30, 5)\): mean ≈ 0.86, very confident it’s high

The reason Beta is so useful here is conjugacy: if the prior is \(\text{Beta}(\alpha, \beta)\) and you observe \(s\) successes and \(f\) failures, the posterior is exactly \(\text{Beta}(\alpha + s, \beta + f)\). No numerical integration, no MCMC. The math just works.

Play with it:

Notice how a flat \(\text{Beta}(1,1)\) prior becomes increasingly concentrated as you add observations (increase \(\alpha\) and \(\beta\)). That concentration is the algorithm learning.

The Algorithm

Thompson Sampling is almost embarrassingly simple:

  1. Initialize each arm \(k\) with prior \(\text{Beta}(\alpha_k, \beta_k) = \text{Beta}(1, 1)\)
  2. At each time step \(t\):
    1. For each arm \(k\), sample \(\tilde{\theta}_k \sim \text{Beta}(\alpha_k, \beta_k)\)
    2. Pull arm \(A_t = \arg\max_k \tilde{\theta}_k\)
    3. Observe reward \(R_t \in \{0, 1\}\)
    4. Update: \(\alpha_{A_t} \mathrel{+}= R_t\), \(\beta_{A_t} \mathrel{+}= (1 - R_t)\)

That’s it. The key move is step (a): you’re sampling a plausible value of each arm’s success probability from your current beliefs, then acting as if those samples were the ground truth.

The reason this works: an arm that you’re uncertain about (wide Beta distribution) will occasionally sample high, which drives exploration automatically — no explicit bonus required. Once you’ve pulled an arm enough times, its posterior concentrates and it only wins the sample competition if it’s genuinely good. Exploitation falls out naturally too.

Watch it run on four arms with true success probabilities \([0.2, 0.4, 0.6, 0.8]\). Scrub through time and watch the posteriors sharpen:

A few things worth noticing as you scrub through time: (1) early on, all four posteriors are wide and overlap a lot — the algorithm is still genuinely uncertain; (2) around T=100–200, the posteriors start separating and Arm 4 starts dominating pull counts; (3) the regret curve bends over as the algorithm locks in. That bend is the difference between \(O(\log T)\) and \(O(T)\).

Implementation in R

Show code
thompson_binary <- function(true_probs, T = 1000, alpha0 = 1, beta0 = 1) {
  K <- length(true_probs)
  alpha <- rep(alpha0, K)
  beta  <- rep(beta0,  K)

  arms    <- integer(T)
  rewards <- numeric(T)

  for (t in 1:T) {
    samples   <- rbeta(K, alpha, beta)
    chosen    <- which.max(samples)
    arms[t]   <- chosen
    rewards[t] <- rbinom(1, 1, true_probs[chosen])
    alpha[chosen] <- alpha[chosen] + rewards[t]
    beta[chosen]  <- beta[chosen]  + (1 - rewards[t])
  }

  tibble(
    t       = 1:T,
    arm     = arms,
    reward  = rewards,
    regret  = cumsum(max(true_probs) - true_probs[arms])
  )
}

set.seed(42)
bts_result <- thompson_binary(c(0.2, 0.4, 0.6, 0.8), T = 1000)

bts_result %>%
  ggplot(aes(x = t, y = regret)) +
  geom_line(color = "#4e79a7", linewidth = 1) +
  labs(title = "Binary Thompson Sampling — Cumulative Regret",
       x = "Time step", y = "Cumulative regret") +
  theme_minimal(base_size = 13)


Regret Comparison: Thompson Sampling vs. the Field

Before moving on to continuous rewards, let’s see how Thompson Sampling actually stacks up against other common bandit algorithms. The contenders:

  • Random: pull a uniform random arm every time. The baseline floor.
  • ε-Greedy (ε = 0.1): with probability 0.9, pull the arm with the best current empirical mean; with probability 0.1, pull a random arm. Simple and widely used.
  • UCB1: pull arm \(k\) if it maximizes \(\hat{\mu}_k + \sqrt{2 \ln t / n_k}\), where \(n_k\) is the number of times arm \(k\) has been pulled. Optimistic in the face of uncertainty.
  • Thompson Sampling: what we just covered.

All four use the same arm setup: true probabilities \([0.2, 0.4, 0.6, 0.8]\).

The separation becomes stark by T=500. Random and ε-Greedy have roughly linear regret — they keep paying the same price per step indefinitely. UCB1 and Thompson Sampling bend over logarithmically. Between UCB1 and Thompson, Thompson tends to be a bit better in practice on binary problems because it handles uncertainty more gracefully than the heuristic UCB bonus.


Thompson Sampling: Continuous Rewards

Binary rewards are a special case. A lot of the time rewards are continuous — revenue per click, time-on-site, task completion scores, satisfaction ratings. The question isn’t “which arm succeeds more often” but “which arm produces the highest value on average.”

The math changes but the logic doesn’t. We still want a posterior over each arm’s mean reward \(\mu_k\), and we still sample from that posterior to pick an arm.

The Normal-Normal Model

The standard conjugate setup for continuous rewards: assume rewards have a known (or estimated) variance \(\sigma^2\), and put a Normal prior on the mean:

\[R \mid \mu_k \sim \mathcal{N}(\mu_k, \sigma^2), \qquad \mu_k \sim \mathcal{N}(\mu_0, \tau_0^2)\]

After observing \(n\) rewards from arm \(k\) with sum \(S_k = \sum r_i\), the posterior is:

\[\mu_k \mid \text{data} \sim \mathcal{N}\!\left(\mu_n, \tau_n^2\right)\]

where:

\[\tau_n^2 = \left(\frac{1}{\tau_0^2} + \frac{n}{\sigma^2}\right)^{-1}, \qquad \mu_n = \tau_n^2\left(\frac{\mu_0}{\tau_0^2} + \frac{S_k}{\sigma^2}\right)\]

As \(n \to \infty\), \(\tau_n^2 \to 0\) and \(\mu_n \to \bar{r}_k\) — your posterior concentrates on the sample mean. The prior washes out and data takes over. That’s the behavior you want.

The algorithm is the same as before, just sampling \(\tilde{\mu}_k \sim \mathcal{N}(\mu_k, \tau_k^2)\) instead of from a Beta.

Four arms with true means \([0, 0.5, 1.5, 2.5]\) and reward \(\text{SD} = 1\):

Notice how the posteriors for Arm 1 (true mean = 0) and Arm 2 (true mean = 0.5) start wide, overlap with the better arms early on, but gradually get pulled away from the upper end. The algorithm explores them early but eventually stops bothering. That’s the right behavior.

Implementation in R

Show code
thompson_normal <- function(true_means, T = 1000, sigma = 1,
                             prior_mu = mean(true_means), prior_tau = 3) {
  K        <- length(true_means)
  mus      <- rep(prior_mu, K)
  tau_sqs  <- rep(prior_tau^2, K)
  sigma_sq <- sigma^2

  arms    <- integer(T)
  rewards <- numeric(T)

  for (t in 1:T) {
    samples   <- rnorm(K, mus, sqrt(tau_sqs))
    chosen    <- which.max(samples)
    arms[t]   <- chosen
    rewards[t] <- rnorm(1, true_means[chosen], sigma)

    # Bayesian update
    new_tau_sq      <- 1 / (1/tau_sqs[chosen] + 1/sigma_sq)
    mus[chosen]     <- new_tau_sq * (mus[chosen]/tau_sqs[chosen] + rewards[t]/sigma_sq)
    tau_sqs[chosen] <- new_tau_sq
  }

  tibble(
    t      = 1:T,
    arm    = arms,
    reward = rewards,
    regret = cumsum(max(true_means) - true_means[arms])
  )
}

set.seed(42)
nts_result <- thompson_normal(c(0, 0.5, 1.5, 2.5), T = 1000, sigma = 1)

nts_result %>%
  ggplot(aes(x = t, y = regret)) +
  geom_line(color = "#f28e2b", linewidth = 1) +
  labs(title = "Normal Thompson Sampling — Cumulative Regret",
       x = "Time step", y = "Cumulative regret") +
  theme_minimal(base_size = 13)


Thompson Sampling: Continuous Rewards on [0, 1]

Here’s where it gets a little interesting. What if your rewards are continuous but bounded between 0 and 1? Satisfaction scores, normalized engagement rates, click-through values you’ve capped at 1 — anything that lives on the unit interval.

You’re not in binary land (rewards are continuous), and you’re not cleanly in Normal land (Normal distributions produce values outside \([0, 1]\), which creates an awkward mismatch). The natural distribution for continuous data on \([0, 1]\) is the Beta distribution — but now we’re using it to model the rewards themselves, not just the prior over a success probability.

The Fractional Update

Here’s a clean and practical way to handle this. Think of a continuous reward \(r \in [0, 1]\) as a “soft” Bernoulli outcome: it’s evidence of success with magnitude \(r\) and failure with magnitude \((1 - r)\). A reward of \(0.8\) is strong evidence this arm is good. A reward of \(0.1\) is weak evidence it’s useful at all.

Under this framing, the Beta-Bernoulli update generalizes immediately:

\[\alpha_k \mathrel{+}= r_t, \qquad \beta_k \mathrel{+}= (1 - r_t)\]

This is called the fractional (or soft) Beta update. It’s not a standard Bayesian update from a Beta likelihood — that would require integrating over an intractable normalizing constant. But it’s well-motivated, it works well in practice, and it’s widely used in production systems. The intuition is solid: you’re scaling the strength of your update by the magnitude of the reward, which is exactly what you’d want.

As a sanity check: if rewards are always exactly 0 or 1, this reduces to the standard Beta-Bernoulli update. And if rewards are i.i.d. uniform on \([0, 1]\), the update averages to \(+0.5\) for both \(\alpha\) and \(\beta\), which is neutral — also sensible.

Same four-arm setup, but now true reward distributions are \(\text{Beta}(2, 8)\), \(\text{Beta}(4, 6)\), \(\text{Beta}(6, 4)\), \(\text{Beta}(8, 2)\), giving true mean rewards of \([0.2, 0.4, 0.6, 0.8]\):

Compare the posterior shapes here to the binary TS case earlier. Notice how they concentrate faster — because each pull provides a continuous amount of information (a real-valued reward) rather than a single bit. The fractional update efficiently uses every piece of information you observe.

Implementation in R

Show code
thompson_beta_continuous <- function(true_alphas, true_betas, T = 1000,
                                      alpha0 = 1, beta0 = 1) {
  K          <- length(true_alphas)
  alpha      <- rep(alpha0, K)
  beta_par   <- rep(beta0, K)
  true_means <- true_alphas / (true_alphas + true_betas)

  arms    <- integer(T)
  rewards <- numeric(T)

  for (t in 1:T) {
    samples     <- rbeta(K, alpha, beta_par)
    chosen      <- which.max(samples)
    arms[t]     <- chosen
    rewards[t]  <- rbeta(1, true_alphas[chosen], true_betas[chosen])

    # Fractional update
    alpha[chosen]    <- alpha[chosen]    + rewards[t]
    beta_par[chosen] <- beta_par[chosen] + (1 - rewards[t])
  }

  tibble(
    t      = 1:T,
    arm    = arms,
    reward = rewards,
    regret = cumsum(max(true_means) - true_means[arms])
  )
}

set.seed(42)
bcts_result <- thompson_beta_continuous(
  true_alphas = c(2, 4, 6, 8),
  true_betas  = c(8, 6, 4, 2),
  T = 1000
)

bcts_result %>%
  ggplot(aes(x = t, y = regret)) +
  geom_line(color = "#7b2d8b", linewidth = 1) +
  labs(title = "Beta Continuous Thompson Sampling — Cumulative Regret",
       x = "Time step", y = "Cumulative regret") +
  theme_minimal(base_size = 13)


Putting It All Together

Let’s do a direct comparison across all three variants on structurally equivalent problems (same true mean rewards \([0.2, 0.4, 0.6, 0.8]\), different reward types):

Show code
set.seed(42)

r_binary   <- thompson_binary(c(0.2, 0.4, 0.6, 0.8), T = 1000) %>% mutate(variant = "Binary (Beta-Bernoulli)")
r_normal   <- thompson_normal(c(0, 0.5, 1.5, 2.5),   T = 1000, sigma = 1) %>% mutate(variant = "Continuous (Normal-Normal)")
r_beta_cts <- thompson_beta_continuous(c(2, 4, 6, 8), c(8, 6, 4, 2), T = 1000) %>% mutate(variant = "Continuous [0,1] (Fractional Beta)")

bind_rows(r_binary, r_normal, r_beta_cts) %>%
  ggplot(aes(x = t, y = regret, color = variant)) +
  geom_line(linewidth = 1) +
  scale_color_manual(values = c("#4e79a7", "#f28e2b", "#7b2d8b")) +
  labs(
    title  = "Cumulative Regret — Thompson Sampling Variants",
    x      = "Time step",
    y      = "Cumulative regret",
    color  = "Variant"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "bottom", legend.title = element_blank())

All three variants achieve logarithmic regret. The differences in the curves reflect the different amounts of information each reward type provides per pull — continuous rewards carry more signal than binary ones, so the continuous variants converge faster.


Practical Notes

On the known-variance assumption. The Normal-Normal model assumes you know or have a reasonable estimate of \(\sigma^2\). In practice, overestimating it is conservative (slower but safer convergence); underestimating it leads to overconfident posteriors and potentially bad decisions early on. If you need to be principled about unknown variance, use the Normal-Inverse-Gamma conjugate model — same logic, more moving parts.

On the fractional update. It’s not a formal Bayesian update from a Beta likelihood (the normalizing constant doesn’t cooperate). But it has a coherent interpretation, good empirical behavior, and is cheap to compute. For production systems with unit-interval rewards, it’s a reasonable default. If you want to go fully Bayesian on Beta-distributed rewards, you’re looking at variational inference or MCMC — almost certainly overkill unless you have strong reasons.

On initialization. \(\text{Beta}(1, 1)\) is a safe default, but if you have domain knowledge — say, historical conversion rates cluster around 3% — bake it in. An informative prior speeds up early learning and reduces regret at small \(T\). Just don’t make it so strong that it takes too many observations to override.

On the regret bound. Thompson Sampling achieves the Lai-Robbins lower bound asymptotically:

\[\bar{R}_T \sim \sum_{k: \mu_k < \mu^*} \frac{\Delta_k}{\text{KL}(\mu_k \| \mu^*)} \log T\]

where KL is the KL-divergence between arm \(k\)’s distribution and the optimal arm’s distribution. This is the best any consistent algorithm can do — Thompson Sampling is optimal, not just good.

On extensions. Everything here covers the stationary bandit — reward distributions don’t change over time. In practice, conversion rates drift, seasonal effects kick in, the world changes. Contextual bandits (where you observe covariates before choosing an arm) and non-stationary bandits (where you discount old observations) are the natural next steps. Thompson Sampling generalizes to both, but those are different tutorials.


Built with Quarto. Visualizations use Observable Plot.