Thompson Sampling: From Slot Machines to Serious Science

Author

Justin M. Jones, PhD

Published

March 14, 2026

Show code

ARM_COLORS = ["#4e79a7", "#f28e2b", "#59a14f", "#e15759"]

// Lanczos log-gamma approximation
function lgamma(x) {
  if (x < 0.5) return Math.log(Math.PI / Math.sin(Math.PI * x)) - lgamma(1 - x);
  x -= 1;
  const g = 7;
  const c = [0.99999999999980993, 676.5203681218851, -1259.1392167224028,
             771.32342877765313, -176.61502916214059, 12.507343278686905,
             -0.13857109526572012, 9.9843695780195716e-6, 1.5056327351493116e-7];
  let a = c[0];
  const t = x + g + 0.5;
  for (let i = 1; i < g + 2; i++) a += c[i] / (x + i);
  return 0.5 * Math.log(2 * Math.PI) + (x + 0.5) * Math.log(t) - t + Math.log(a);
}

function lbeta(a, b) { return lgamma(a) + lgamma(b) - lgamma(a + b); }

function betaPDF(x, a, b) {
  if (x <= 0 || x >= 1) return 0;
  const logp = (a - 1) * Math.log(x) + (b - 1) * Math.log(1 - x) - lbeta(a, b);
  return Math.exp(logp);
}

function normalPDF(x, mu, sigma) {
  return Math.exp(-0.5 * ((x - mu) / sigma) ** 2) / (sigma * Math.sqrt(2 * Math.PI));
}

// Box-Muller standard normal sample
function stdNormal() {
  const u1 = Math.random(), u2 = Math.random();
  return Math.sqrt(-2 * Math.log(u1)) * Math.cos(2 * Math.PI * u2);
}

// Marsaglia-Tsang Gamma sampler
function gammaSample(shape) {
  if (shape < 1) return gammaSample(shape + 1) * Math.pow(Math.random(), 1 / shape);
  const d = shape - 1 / 3, c = 1 / Math.sqrt(9 * d);
  while (true) {
    let z, v;
    do { z = stdNormal(); v = 1 + c * z; } while (v <= 0);
    v = v * v * v;
    const u = Math.random();
    if (u < 1 - 0.0331 * z * z * z * z) return d * v;
    if (Math.log(u) < 0.5 * z * z + d * (1 - v + Math.log(v))) return d * v;
  }
}

function betaSample(a, b) {
  const x = gammaSample(a), y = gammaSample(b);
  return x / (x + y);
}

The Multi-Armed Bandit Problem

Picture this: you’re running a study on onboarding interventions. You’ve got five different training modules, and you want to figure out which one does the best job improving 30-day retention. The classic play is a clean randomized experiment — split people equally across conditions, run it for a few months, analyze. Done.

Except there’s a catch. While that experiment is running, you’re constantly assigning people to training modules you already have some reason to suspect are worse. Every person you send to a suboptimal module is a person whose retention you could have improved more. In a lab setting, that cost is a mild annoyance. At scale, with millions of users and real outcomes on the line, it adds up.

This is the multi-armed bandit problem. The name comes from imagining a row of slot machines — a.k.a. “one-armed bandits” — each with a different and unknown payout probability. You want to maximize your winnings, but you don’t know which machine is best. Every pull you spend figuring that out is a pull you’re not spending on the winner.

The problem shows up everywhere in disguise: A/B testing, recommendation systems, clinical trials, ad auctions, hyperparameter search. Anywhere you’re making sequential decisions under uncertainty, you’re running a bandit problem whether you call it that or not.

The Setup

Formally: you have \(K\) arms. At each time step \(t = 1, 2, \ldots, T\), you choose an arm \(A_t \in \{1, \ldots, K\}\) and receive a reward \(R_t \sim P_{A_t}\), where \(P_k\) is the (unknown) reward distribution for arm \(k\). The expected reward for arm \(k\) is \(\mu_k\), and the best arm has expected reward \(\mu^* = \max_k \mu_k\).

The goal: maximize \(\sum_{t=1}^T R_t\). Or equivalently, minimize how much reward you lose by not always pulling the best arm.

The Exploration-Exploitation Tradeoff

Every bandit algorithm has to deal with this tension:

Exploitation: Pull the arm you currently believe is best. Collect reward now.
Exploration: Try arms you’re uncertain about. Gather information that might improve future decisions.

These goals are in direct conflict. An algorithm that only exploits gets stuck on the first arm that seems decent — if it had bad luck early on, it’ll stay stuck there. An algorithm that only explores never commits and just collects mediocre rewards indefinitely.

Here’s an interactive demo. We have four arms with unknown success rates. A random exploration policy pulls each arm uniformly at random — watch how slowly it converges on the best arm, and how much regret it accumulates in the process:

Show code

{
  const state = intro_sim.est_history[intro_T - 1];
  const data = intro_sim.true_probs.flatMap((tp, k) => [
    { arm: `Arm ${k+1}`, type: "Estimated", value: state[k] },
    { arm: `Arm ${k+1}`, type: "True", value: tp }
  ]);

  const barChart = Plot.plot({
    title: `After ${intro_T} random pulls — estimated vs. true success rates`,
    width,
    height: 260,
    x: { label: null },
    y: { label: "Success rate", domain: [0, 1] },
    color: { domain: ["Estimated", "True"], range: ["#4e79a7", "#aaa"], legend: true },
    marks: [
      Plot.barY(data, Plot.groupX({ y: "first" },
        { x: "arm", y: "value", fill: "type", fx: "arm" }
      )),
      Plot.ruleY([0])
    ],
    fx: { label: null }
  });

  const regData = intro_sim.regret_history.slice(0, intro_T).map((r, t) => ({ t: t + 1, regret: r }));
  const regChart = Plot.plot({
    title: "Cumulative regret (random policy)",
    width,
    height: 200,
    x: { label: "Pull #" },
    y: { label: "Cumulative regret" },
    marks: [
      Plot.line(regData, { x: "t", y: "regret", stroke: "#e15759", strokeWidth: 2 }),
      Plot.ruleY([0])
    ]
  });

  return htl.html`<div>${barChart}${regChart}</div>`;
}

Even with 500 pulls, a random policy hasn’t reliably identified Arm 4 as the winner — and the regret keeps climbing linearly. That’s the problem we’re trying to solve.

Regret and Pseudoregret

Before getting into solutions, let’s be precise about how we measure failure. The standard measure is cumulative regret:

\[R_T = T\mu^* - \sum_{t=1}^{T} \mu_{A_t}\]

In words: the total expected reward you would have collected always pulling the best arm, minus what you actually expected to collect with your policy. Every time you pull a suboptimal arm, you eat the difference \(\Delta_k = \mu^* - \mu_k\), called the suboptimality gap for arm \(k\).

Pseudoregret averages over the randomness in the algorithm itself:

\[\bar{R}_T = T\mu^* - \mathbb{E}\left[\sum_{t=1}^T \mu_{A_t}\right]\]

The distinction is subtle. Regret is a realized quantity — it depends on which random samples happened to come up. Pseudoregret smooths over that noise and asks about expected performance. In theoretical work you’ll see proofs about pseudoregret; in practice people often use the terms interchangeably. Worth knowing the difference.

Why Regret Bounds Matter

A random policy has pseudoregret that grows linearly: \(\bar{R}_T \sim O(T)\). That means it never stops losing at a constant rate. Good bandit algorithms achieve logarithmic regret: \(\bar{R}_T = O(\log T)\). That’s a qualitative difference — the regret curve flattens out as the algorithm locks in on the best arm.

Thompson Sampling achieves logarithmic regret. In fact, it achieves the asymptotically optimal bound, matching the theoretical lower limit proved by Lai and Robbins (1985). That’s not just good engineering — it’s optimal in a rigorous sense.

The comparison chart in the final section will make this concrete. For now, just keep in mind: linear regret means you never really learn. Logarithmic regret means you do.

Basic Thompson Sampling: Binary Rewards

Let’s start with the simplest useful case: binary rewards. Each arm either gives you a 1 (success) or a 0 (failure). Click / no click. Conversion / no conversion. This is the most common setup in industry A/B testing.

For arm \(k\), the true success probability is \(\theta_k \in [0, 1]\), which we don’t know. Thompson Sampling’s approach: maintain a probability distribution over what \(\theta_k\) might be, and use that uncertainty to guide exploration.

The Beta Distribution as a Prior

The natural prior for an unknown probability is the Beta distribution: \(\theta \sim \text{Beta}(\alpha, \beta)\), where \(\alpha\) and \(\beta\) are shape parameters. The mean is \(\alpha / (\alpha + \beta)\) and the total \(\alpha + \beta\) controls how concentrated (certain) the distribution is.

A few reference points:

\(\text{Beta}(1, 1)\): flat uniform — no information whatsoever
\(\text{Beta}(10, 10)\): centered at 0.5, fairly confident it’s near there
\(\text{Beta}(30, 5)\): mean ≈ 0.86, very confident it’s high

The reason Beta is so useful here is conjugacy: if the prior is \(\text{Beta}(\alpha, \beta)\) and you observe \(s\) successes and \(f\) failures, the posterior is exactly \(\text{Beta}(\alpha + s, \beta + f)\). No numerical integration, no MCMC. The math just works.

Play with it:

Show code

viewof beta_alpha = Inputs.range([0.5, 50], { value: 1, step: 0.5, label: "α (successes + prior)" })
viewof beta_beta  = Inputs.range([0.5, 50], { value: 1, step: 0.5, label: "β (failures + prior)" })

Show code

{
  const xs = d3.range(0.005, 0.995, 0.005);
  const data = xs.map(x => ({ x, y: betaPDF(x, beta_alpha, beta_beta) }));
  const mean = beta_alpha / (beta_alpha + beta_beta);
  const mode = (beta_alpha > 1 && beta_beta > 1)
    ? (beta_alpha - 1) / (beta_alpha + beta_beta - 2)
    : null;

  return Plot.plot({
    title: `Beta(${beta_alpha}, ${beta_beta})  —  mean = ${mean.toFixed(3)}${mode !== null ? ", mode = " + mode.toFixed(3) : ""}`,
    width,
    height: 280,
    x: { label: "θ (success probability)", domain: [0, 1] },
    y: { label: "Density" },
    marks: [
      Plot.areaY(data, { x: "x", y: "y", fill: "#4e79a7", fillOpacity: 0.3 }),
      Plot.line(data,  { x: "x", y: "y", stroke: "#4e79a7", strokeWidth: 2.5 }),
      Plot.ruleX([mean], { stroke: "#e15759", strokeDasharray: "4 2", strokeWidth: 2 }),
      Plot.ruleY([0])
    ]
  });
}

Notice how a flat \(\text{Beta}(1,1)\) prior becomes increasingly concentrated as you add observations (increase \(\alpha\) and \(\beta\)). That concentration is the algorithm learning.

The Algorithm

Thompson Sampling is almost embarrassingly simple:

Initialize each arm \(k\) with prior \(\text{Beta}(\alpha_k, \beta_k) = \text{Beta}(1, 1)\)
At each time step \(t\):
1. For each arm \(k\), sample \(\tilde{\theta}_k \sim \text{Beta}(\alpha_k, \beta_k)\)
2. Pull arm \(A_t = \arg\max_k \tilde{\theta}_k\)
3. Observe reward \(R_t \in \{0, 1\}\)
4. Update: \(\alpha_{A_t} \mathrel{+}= R_t\), \(\beta_{A_t} \mathrel{+}= (1 - R_t)\)

That’s it. The key move is step (a): you’re sampling a plausible value of each arm’s success probability from your current beliefs, then acting as if those samples were the ground truth.

The reason this works: an arm that you’re uncertain about (wide Beta distribution) will occasionally sample high, which drives exploration automatically — no explicit bonus required. Once you’ve pulled an arm enough times, its posterior concentrates and it only wins the sample competition if it’s genuinely good. Exploitation falls out naturally too.

Watch it run on four arms with true success probabilities \([0.2, 0.4, 0.6, 0.8]\). Scrub through time and watch the posteriors sharpen:

Show code

{
  const state = bts_sim.history[bts_T];
  const xs = d3.range(0.005, 0.995, 0.005);
  const armLabels = bts_sim.true_probs.map((p, k) => `Arm ${k+1} (true θ=${p})`);

  const pdfData = [];
  for (let k = 0; k < bts_sim.K; k++) {
    for (const x of xs) {
      pdfData.push({
        x, y: betaPDF(x, state.alphas[k], state.betas[k]),
        arm: armLabels[k]
      });
    }
  }

  const vertData = bts_sim.true_probs.flatMap((p, k) =>
    [{ x: p, arm: armLabels[k] }]
  );

  const posteriorPlot = Plot.plot({
    title: `Posterior distributions at T = ${bts_T}`,
    width, height: 300,
    x: { label: "θ (success probability)", domain: [0, 1] },
    y: { label: "Density" },
    color: { domain: armLabels, range: ARM_COLORS, legend: true },
    marks: [
      Plot.line(pdfData, { x: "x", y: "y", stroke: "arm", strokeWidth: 2 }),
      ...bts_sim.true_probs.map((p, k) =>
        Plot.ruleX([p], { stroke: ARM_COLORS[k], strokeDasharray: "3 3", strokeOpacity: 0.7 })
      ),
      Plot.ruleY([0])
    ]
  });

  // Pull counts
  const counts = new Array(bts_sim.K).fill(0);
  for (let i = 0; i < bts_T; i++) counts[bts_sim.arms_pulled[i]]++;
  const countData = counts.map((c, k) => ({ arm: `Arm ${k+1}`, count: c }));

  const countPlot = Plot.plot({
    title: "Pull counts",
    width, height: 200,
    x: { label: null },
    y: { label: "Times pulled" },
    color: { domain: countData.map(d => d.arm), range: ARM_COLORS },
    marks: [
      Plot.barY(countData, { x: "arm", y: "count", fill: "arm", fillOpacity: 0.85 }),
      Plot.ruleY([0])
    ]
  });

  const regData = bts_sim.cumulative_regret.slice(0, bts_T).map((r, t) => ({ t: t + 1, r }));
  const regPlot = Plot.plot({
    title: "Cumulative regret",
    width, height: 200,
    x: { label: "Time step" },
    y: { label: "Regret" },
    marks: [
      Plot.line(regData, { x: "t", y: "r", stroke: "#4e79a7", strokeWidth: 2 }),
      Plot.ruleY([0])
    ]
  });

  return htl.html`<div>${posteriorPlot}${countPlot}${regPlot}</div>`;
}

A few things worth noticing as you scrub through time: (1) early on, all four posteriors are wide and overlap a lot — the algorithm is still genuinely uncertain; (2) around T=100–200, the posteriors start separating and Arm 4 starts dominating pull counts; (3) the regret curve bends over as the algorithm locks in. That bend is the difference between \(O(\log T)\) and \(O(T)\).

Implementation in R

Show code

thompson_binary <- function(true_probs, T = 1000, alpha0 = 1, beta0 = 1) {
  K <- length(true_probs)
  alpha <- rep(alpha0, K)
  beta  <- rep(beta0,  K)

  arms    <- integer(T)
  rewards <- numeric(T)

  for (t in 1:T) {
    samples   <- rbeta(K, alpha, beta)
    chosen    <- which.max(samples)
    arms[t]   <- chosen
    rewards[t] <- rbinom(1, 1, true_probs[chosen])
    alpha[chosen] <- alpha[chosen] + rewards[t]
    beta[chosen]  <- beta[chosen]  + (1 - rewards[t])
  }

  tibble(
    t       = 1:T,
    arm     = arms,
    reward  = rewards,
    regret  = cumsum(max(true_probs) - true_probs[arms])
  )
}

set.seed(42)
bts_result <- thompson_binary(c(0.2, 0.4, 0.6, 0.8), T = 1000)

bts_result %>%
  ggplot(aes(x = t, y = regret)) +
  geom_line(color = "#4e79a7", linewidth = 1) +
  labs(title = "Binary Thompson Sampling — Cumulative Regret",
       x = "Time step", y = "Cumulative regret") +
  theme_minimal(base_size = 13)

Regret Comparison: Thompson Sampling vs. the Field

Before moving on to continuous rewards, let’s see how Thompson Sampling actually stacks up against other common bandit algorithms. The contenders:

Random: pull a uniform random arm every time. The baseline floor.
ε-Greedy (ε = 0.1): with probability 0.9, pull the arm with the best current empirical mean; with probability 0.1, pull a random arm. Simple and widely used.
UCB1: pull arm \(k\) if it maximizes \(\hat{\mu}_k + \sqrt{2 \ln t / n_k}\), where \(n_k\) is the number of times arm \(k\) has been pulled. Optimistic in the face of uncertainty.
Thompson Sampling: what we just covered.

All four use the same arm setup: true probabilities \([0.2, 0.4, 0.6, 0.8]\).

Show code

reg_sim = {
  const true_probs = [0.2, 0.4, 0.6, 0.8];
  const K = 4;
  const T = 1000;
  const mu_star = Math.max(...true_probs);

  function runRandom() {
    let cum = 0;
    return Array.from({ length: T }, () => {
      const arm = Math.floor(Math.random() * K);
      cum += mu_star - true_probs[arm];
      return cum;
    });
  }

  function runEpsilonGreedy(eps = 0.1) {
    const alpha = new Array(K).fill(1), beta = new Array(K).fill(1);
    let cum = 0;
    return Array.from({ length: T }, () => {
      const arm = Math.random() < eps
        ? Math.floor(Math.random() * K)
        : alpha.map((a, i) => a / (a + beta[i])).reduce((best, v, i, arr) => v > arr[best] ? i : best, 0);
      const reward = Math.random() < true_probs[arm] ? 1 : 0;
      cum += mu_star - true_probs[arm];
      alpha[arm] += reward;
      beta[arm]  += 1 - reward;
      return cum;
    });
  }

  function runUCB1() {
    const n   = new Array(K).fill(0);
    const sum = new Array(K).fill(0);
    let cum = 0;
    return Array.from({ length: T }, (_, t) => {
      let arm;
      if (t < K) {
        arm = t;
      } else {
        arm = n.map((ni, i) => sum[i] / ni + Math.sqrt(2 * Math.log(t + 1) / ni))
               .reduce((best, v, i, arr) => v > arr[best] ? i : best, 0);
      }
      const reward = Math.random() < true_probs[arm] ? 1 : 0;
      cum += mu_star - true_probs[arm];
      n[arm]++;
      sum[arm] += reward;
      return cum;
    });
  }

  function runThompson() {
    const alpha = new Array(K).fill(1), beta = new Array(K).fill(1);
    let cum = 0;
    return Array.from({ length: T }, () => {
      const samples = alpha.map((a, i) => betaSample(a, beta[i]));
      const arm = samples.reduce((best, v, i, arr) => v > arr[best] ? i : best, 0);
      const reward = Math.random() < true_probs[arm] ? 1 : 0;
      cum += mu_star - true_probs[arm];
      alpha[arm] += reward;
      beta[arm]  += 1 - reward;
      return cum;
    });
  }

  return {
    random:   runRandom(),
    egreedy:  runEpsilonGreedy(),
    ucb1:     runUCB1(),
    thompson: runThompson(),
    T
  };
}

Show code

{
  const algos = [
    { key: "random",   name: "Random",              color: "#e15759" },
    { key: "egreedy",  name: "ε-Greedy (ε=0.1)",    color: "#f28e2b" },
    { key: "ucb1",     name: "UCB1",                color: "#59a14f" },
    { key: "thompson", name: "Thompson Sampling",   color: "#4e79a7" }
  ];

  const data = algos.flatMap(a =>
    reg_sim[a.key].slice(0, reg_T).map((r, t) => ({ t: t + 1, regret: r, algorithm: a.name }))
  );

  return Plot.plot({
    title: `Cumulative Regret — Algorithm Comparison (T = ${reg_T})`,
    width, height: 400,
    x: { label: "Time step" },
    y: { label: "Cumulative regret" },
    color: {
      domain: algos.map(a => a.name),
      range:  algos.map(a => a.color),
      legend: true
    },
    marks: [
      Plot.line(data, { x: "t", y: "regret", stroke: "algorithm", strokeWidth: 2 }),
      Plot.ruleY([0])
    ]
  });
}

The separation becomes stark by T=500. Random and ε-Greedy have roughly linear regret — they keep paying the same price per step indefinitely. UCB1 and Thompson Sampling bend over logarithmically. Between UCB1 and Thompson, Thompson tends to be a bit better in practice on binary problems because it handles uncertainty more gracefully than the heuristic UCB bonus.

Thompson Sampling: Continuous Rewards

Binary rewards are a special case. A lot of the time rewards are continuous — revenue per click, time-on-site, task completion scores, satisfaction ratings. The question isn’t “which arm succeeds more often” but “which arm produces the highest value on average.”

The math changes but the logic doesn’t. We still want a posterior over each arm’s mean reward \(\mu_k\), and we still sample from that posterior to pick an arm.

The Normal-Normal Model

The standard conjugate setup for continuous rewards: assume rewards have a known (or estimated) variance \(\sigma^2\), and put a Normal prior on the mean:

\[R \mid \mu_k \sim \mathcal{N}(\mu_k, \sigma^2), \qquad \mu_k \sim \mathcal{N}(\mu_0, \tau_0^2)\]

After observing \(n\) rewards from arm \(k\) with sum \(S_k = \sum r_i\), the posterior is:

\[\mu_k \mid \text{data} \sim \mathcal{N}\!\left(\mu_n, \tau_n^2\right)\]

where:

\[\tau_n^2 = \left(\frac{1}{\tau_0^2} + \frac{n}{\sigma^2}\right)^{-1}, \qquad \mu_n = \tau_n^2\left(\frac{\mu_0}{\tau_0^2} + \frac{S_k}{\sigma^2}\right)\]

As \(n \to \infty\), \(\tau_n^2 \to 0\) and \(\mu_n \to \bar{r}_k\) — your posterior concentrates on the sample mean. The prior washes out and data takes over. That’s the behavior you want.

The algorithm is the same as before, just sampling \(\tilde{\mu}_k \sim \mathcal{N}(\mu_k, \tau_k^2)\) instead of from a Beta.

Four arms with true means \([0, 0.5, 1.5, 2.5]\) and reward \(\text{SD} = 1\):

Show code

nts_sim = {
  const K = 4;
  const true_means = [0, 0.5, 1.5, 2.5];
  const sigma = 1, sigma_sq = 1;
  const prior_mu = 1.25, prior_tau_sq = 9;
  const T = 1000;
  const mu_star = Math.max(...true_means);

  const mus     = new Array(K).fill(prior_mu);
  const tau_sqs = new Array(K).fill(prior_tau_sq);

  const history = [{ mus: [...mus], tau_sqs: [...tau_sqs] }];
  const arms_pulled = [];
  const cumulative_regret = [];
  let cum_reg = 0;

  for (let t = 0; t < T; t++) {
    const samples = mus.map((m, k) => m + Math.sqrt(tau_sqs[k]) * stdNormal());
    const arm = samples.reduce((best, v, i, arr) => v > arr[best] ? i : best, 0);
    const reward = true_means[arm] + sigma * stdNormal();

    const new_tau_sq = 1 / (1 / tau_sqs[arm] + 1 / sigma_sq);
    const new_mu     = new_tau_sq * (mus[arm] / tau_sqs[arm] + reward / sigma_sq);

    mus[arm]     = new_mu;
    tau_sqs[arm] = new_tau_sq;
    cum_reg += mu_star - true_means[arm];

    history.push({ mus: [...mus], tau_sqs: [...tau_sqs] });
    arms_pulled.push(arm);
    cumulative_regret.push(cum_reg);
  }

  return { history, arms_pulled, cumulative_regret, true_means, K, T };
}

Show code

{
  const state = nts_sim.history[nts_T];
  const armLabels = nts_sim.true_means.map((m, k) => `Arm ${k+1} (true μ=${m})`);
  const xmin = -4, xmax = 7;
  const xs = d3.range(xmin, xmax, 0.05);

  const pdfData = [];
  for (let k = 0; k < nts_sim.K; k++) {
    const sigma_post = Math.sqrt(state.tau_sqs[k]);
    for (const x of xs) {
      pdfData.push({ x, y: normalPDF(x, state.mus[k], sigma_post), arm: armLabels[k] });
    }
  }

  const posteriorPlot = Plot.plot({
    title: `Posterior distributions (Normal) at T = ${nts_T}`,
    width, height: 300,
    x: { label: "μ (mean reward)" },
    y: { label: "Density" },
    color: { domain: armLabels, range: ARM_COLORS, legend: true },
    marks: [
      Plot.line(pdfData, { x: "x", y: "y", stroke: "arm", strokeWidth: 2 }),
      ...nts_sim.true_means.map((m, k) =>
        Plot.ruleX([m], { stroke: ARM_COLORS[k], strokeDasharray: "3 3", strokeOpacity: 0.7 })
      ),
      Plot.ruleY([0])
    ]
  });

  const counts = new Array(nts_sim.K).fill(0);
  for (let i = 0; i < nts_T; i++) counts[nts_sim.arms_pulled[i]]++;
  const countData = counts.map((c, k) => ({ arm: `Arm ${k+1}`, count: c }));

  const countPlot = Plot.plot({
    title: "Pull counts",
    width, height: 200,
    x: { label: null },
    y: { label: "Times pulled" },
    color: { domain: countData.map(d => d.arm), range: ARM_COLORS },
    marks: [
      Plot.barY(countData, { x: "arm", y: "count", fill: "arm", fillOpacity: 0.85 }),
      Plot.ruleY([0])
    ]
  });

  const regData = nts_sim.cumulative_regret.slice(0, nts_T).map((r, t) => ({ t: t + 1, r }));
  const regPlot = Plot.plot({
    title: "Cumulative regret",
    width, height: 200,
    x: { label: "Time step" },
    y: { label: "Regret" },
    marks: [
      Plot.line(regData, { x: "t", y: "r", stroke: "#f28e2b", strokeWidth: 2 }),
      Plot.ruleY([0])
    ]
  });

  return htl.html`<div>${posteriorPlot}${countPlot}${regPlot}</div>`;
}

Notice how the posteriors for Arm 1 (true mean = 0) and Arm 2 (true mean = 0.5) start wide, overlap with the better arms early on, but gradually get pulled away from the upper end. The algorithm explores them early but eventually stops bothering. That’s the right behavior.

Implementation in R

Show code

thompson_normal <- function(true_means, T = 1000, sigma = 1,
                             prior_mu = mean(true_means), prior_tau = 3) {
  K        <- length(true_means)
  mus      <- rep(prior_mu, K)
  tau_sqs  <- rep(prior_tau^2, K)
  sigma_sq <- sigma^2

  arms    <- integer(T)
  rewards <- numeric(T)

  for (t in 1:T) {
    samples   <- rnorm(K, mus, sqrt(tau_sqs))
    chosen    <- which.max(samples)
    arms[t]   <- chosen
    rewards[t] <- rnorm(1, true_means[chosen], sigma)

    # Bayesian update
    new_tau_sq      <- 1 / (1/tau_sqs[chosen] + 1/sigma_sq)
    mus[chosen]     <- new_tau_sq * (mus[chosen]/tau_sqs[chosen] + rewards[t]/sigma_sq)
    tau_sqs[chosen] <- new_tau_sq
  }

  tibble(
    t      = 1:T,
    arm    = arms,
    reward = rewards,
    regret = cumsum(max(true_means) - true_means[arms])
  )
}

set.seed(42)
nts_result <- thompson_normal(c(0, 0.5, 1.5, 2.5), T = 1000, sigma = 1)

nts_result %>%
  ggplot(aes(x = t, y = regret)) +
  geom_line(color = "#f28e2b", linewidth = 1) +
  labs(title = "Normal Thompson Sampling — Cumulative Regret",
       x = "Time step", y = "Cumulative regret") +
  theme_minimal(base_size = 13)

Thompson Sampling: Continuous Rewards on [0, 1]

Here’s where it gets a little interesting. What if your rewards are continuous but bounded between 0 and 1? Satisfaction scores, normalized engagement rates, click-through values you’ve capped at 1 — anything that lives on the unit interval.

You’re not in binary land (rewards are continuous), and you’re not cleanly in Normal land (Normal distributions produce values outside \([0, 1]\), which creates an awkward mismatch). The natural distribution for continuous data on \([0, 1]\) is the Beta distribution — but now we’re using it to model the rewards themselves, not just the prior over a success probability.

The Fractional Update

Here’s a clean and practical way to handle this. Think of a continuous reward \(r \in [0, 1]\) as a “soft” Bernoulli outcome: it’s evidence of success with magnitude \(r\) and failure with magnitude \((1 - r)\). A reward of \(0.8\) is strong evidence this arm is good. A reward of \(0.1\) is weak evidence it’s useful at all.

Under this framing, the Beta-Bernoulli update generalizes immediately:

\[\alpha_k \mathrel{+}= r_t, \qquad \beta_k \mathrel{+}= (1 - r_t)\]

This is called the fractional (or soft) Beta update. It’s not a standard Bayesian update from a Beta likelihood — that would require integrating over an intractable normalizing constant. But it’s well-motivated, it works well in practice, and it’s widely used in production systems. The intuition is solid: you’re scaling the strength of your update by the magnitude of the reward, which is exactly what you’d want.

As a sanity check: if rewards are always exactly 0 or 1, this reduces to the standard Beta-Bernoulli update. And if rewards are i.i.d. uniform on \([0, 1]\), the update averages to \(+0.5\) for both \(\alpha\) and \(\beta\), which is neutral — also sensible.

Same four-arm setup, but now true reward distributions are \(\text{Beta}(2, 8)\), \(\text{Beta}(4, 6)\), \(\text{Beta}(6, 4)\), \(\text{Beta}(8, 2)\), giving true mean rewards of \([0.2, 0.4, 0.6, 0.8]\):

Show code

bcts_sim = {
  const K = 4;
  const true_alphas = [2, 4, 6, 8];
  const true_betas  = [8, 6, 4, 2];
  const true_means  = true_alphas.map((a, i) => a / (a + true_betas[i]));
  const mu_star     = Math.max(...true_means);
  const T = 1000;

  const alpha = [1, 1, 1, 1];
  const beta  = [1, 1, 1, 1];
  const history = [{ alphas: [1,1,1,1], betas: [1,1,1,1] }];
  const arms_pulled = [];
  const cumulative_regret = [];
  let cum_reg = 0;

  for (let t = 0; t < T; t++) {
    const samples = alpha.map((a, i) => betaSample(a, beta[i]));
    const arm = samples.reduce((best, v, i, arr) => v > arr[best] ? i : best, 0);

    // Continuous reward from true Beta distribution
    const reward = betaSample(true_alphas[arm], true_betas[arm]);

    // Fractional update
    alpha[arm] += reward;
    beta[arm]  += 1 - reward;
    cum_reg    += mu_star - true_means[arm];

    history.push({ alphas: [...alpha], betas: [...beta] });
    arms_pulled.push(arm);
    cumulative_regret.push(cum_reg);
  }

  return { history, arms_pulled, cumulative_regret, true_means, true_alphas, true_betas, K, T };
}

Show code

{
  const state = bcts_sim.history[bcts_T];
  const armLabels = bcts_sim.true_means.map((m, k) => `Arm ${k+1} (true μ=${m.toFixed(2)})`);
  const xs = d3.range(0.005, 0.995, 0.005);

  const pdfData = [];
  for (let k = 0; k < bcts_sim.K; k++) {
    for (const x of xs) {
      pdfData.push({
        x, y: betaPDF(x, state.alphas[k], state.betas[k]), arm: armLabels[k]
      });
    }
  }

  const posteriorPlot = Plot.plot({
    title: `Posterior distributions (Beta, fractional update) at T = ${bcts_T}`,
    width, height: 300,
    x: { label: "Estimated mean reward", domain: [0, 1] },
    y: { label: "Density" },
    color: { domain: armLabels, range: ARM_COLORS, legend: true },
    marks: [
      Plot.line(pdfData, { x: "x", y: "y", stroke: "arm", strokeWidth: 2 }),
      ...bcts_sim.true_means.map((m, k) =>
        Plot.ruleX([m], { stroke: ARM_COLORS[k], strokeDasharray: "3 3", strokeOpacity: 0.7 })
      ),
      Plot.ruleY([0])
    ]
  });

  const counts = new Array(bcts_sim.K).fill(0);
  for (let i = 0; i < bcts_T; i++) counts[bcts_sim.arms_pulled[i]]++;
  const countData = counts.map((c, k) => ({ arm: `Arm ${k+1}`, count: c }));

  const countPlot = Plot.plot({
    title: "Pull counts",
    width, height: 200,
    x: { label: null },
    y: { label: "Times pulled" },
    color: { domain: countData.map(d => d.arm), range: ARM_COLORS },
    marks: [
      Plot.barY(countData, { x: "arm", y: "count", fill: "arm", fillOpacity: 0.85 }),
      Plot.ruleY([0])
    ]
  });

  const regData = bcts_sim.cumulative_regret.slice(0, bcts_T).map((r, t) => ({ t: t + 1, r }));
  const regPlot = Plot.plot({
    title: "Cumulative regret",
    width, height: 200,
    x: { label: "Time step" },
    y: { label: "Regret" },
    marks: [
      Plot.line(regData, { x: "t", y: "r", stroke: "#7b2d8b", strokeWidth: 2 }),
      Plot.ruleY([0])
    ]
  });

  return htl.html`<div>${posteriorPlot}${countPlot}${regPlot}</div>`;
}

Compare the posterior shapes here to the binary TS case earlier. Notice how they concentrate faster — because each pull provides a continuous amount of information (a real-valued reward) rather than a single bit. The fractional update efficiently uses every piece of information you observe.

Implementation in R

Show code

thompson_beta_continuous <- function(true_alphas, true_betas, T = 1000,
                                      alpha0 = 1, beta0 = 1) {
  K          <- length(true_alphas)
  alpha      <- rep(alpha0, K)
  beta_par   <- rep(beta0, K)
  true_means <- true_alphas / (true_alphas + true_betas)

  arms    <- integer(T)
  rewards <- numeric(T)

  for (t in 1:T) {
    samples     <- rbeta(K, alpha, beta_par)
    chosen      <- which.max(samples)
    arms[t]     <- chosen
    rewards[t]  <- rbeta(1, true_alphas[chosen], true_betas[chosen])

    # Fractional update
    alpha[chosen]    <- alpha[chosen]    + rewards[t]
    beta_par[chosen] <- beta_par[chosen] + (1 - rewards[t])
  }

  tibble(
    t      = 1:T,
    arm    = arms,
    reward = rewards,
    regret = cumsum(max(true_means) - true_means[arms])
  )
}

set.seed(42)
bcts_result <- thompson_beta_continuous(
  true_alphas = c(2, 4, 6, 8),
  true_betas  = c(8, 6, 4, 2),
  T = 1000
)

bcts_result %>%
  ggplot(aes(x = t, y = regret)) +
  geom_line(color = "#7b2d8b", linewidth = 1) +
  labs(title = "Beta Continuous Thompson Sampling — Cumulative Regret",
       x = "Time step", y = "Cumulative regret") +
  theme_minimal(base_size = 13)

Putting It All Together

Let’s do a direct comparison across all three variants on structurally equivalent problems (same true mean rewards \([0.2, 0.4, 0.6, 0.8]\), different reward types):

Show code

set.seed(42)

r_binary   <- thompson_binary(c(0.2, 0.4, 0.6, 0.8), T = 1000) %>% mutate(variant = "Binary (Beta-Bernoulli)")
r_normal   <- thompson_normal(c(0, 0.5, 1.5, 2.5),   T = 1000, sigma = 1) %>% mutate(variant = "Continuous (Normal-Normal)")
r_beta_cts <- thompson_beta_continuous(c(2, 4, 6, 8), c(8, 6, 4, 2), T = 1000) %>% mutate(variant = "Continuous [0,1] (Fractional Beta)")

bind_rows(r_binary, r_normal, r_beta_cts) %>%
  ggplot(aes(x = t, y = regret, color = variant)) +
  geom_line(linewidth = 1) +
  scale_color_manual(values = c("#4e79a7", "#f28e2b", "#7b2d8b")) +
  labs(
    title  = "Cumulative Regret — Thompson Sampling Variants",
    x      = "Time step",
    y      = "Cumulative regret",
    color  = "Variant"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "bottom", legend.title = element_blank())

All three variants achieve logarithmic regret. The differences in the curves reflect the different amounts of information each reward type provides per pull — continuous rewards carry more signal than binary ones, so the continuous variants converge faster.

Practical Notes

On the known-variance assumption. The Normal-Normal model assumes you know or have a reasonable estimate of \(\sigma^2\). In practice, overestimating it is conservative (slower but safer convergence); underestimating it leads to overconfident posteriors and potentially bad decisions early on. If you need to be principled about unknown variance, use the Normal-Inverse-Gamma conjugate model — same logic, more moving parts.

On the fractional update. It’s not a formal Bayesian update from a Beta likelihood (the normalizing constant doesn’t cooperate). But it has a coherent interpretation, good empirical behavior, and is cheap to compute. For production systems with unit-interval rewards, it’s a reasonable default. If you want to go fully Bayesian on Beta-distributed rewards, you’re looking at variational inference or MCMC — almost certainly overkill unless you have strong reasons.

On initialization. \(\text{Beta}(1, 1)\) is a safe default, but if you have domain knowledge — say, historical conversion rates cluster around 3% — bake it in. An informative prior speeds up early learning and reduces regret at small \(T\). Just don’t make it so strong that it takes too many observations to override.

On the regret bound. Thompson Sampling achieves the Lai-Robbins lower bound asymptotically:

\[\bar{R}_T \sim \sum_{k: \mu_k < \mu^*} \frac{\Delta_k}{\text{KL}(\mu_k \| \mu^*)} \log T\]

where KL is the KL-divergence between arm \(k\)’s distribution and the optimal arm’s distribution. This is the best any consistent algorithm can do — Thompson Sampling is optimal, not just good.

On extensions. Everything here covers the stationary bandit — reward distributions don’t change over time. In practice, conversion rates drift, seasonal effects kick in, the world changes. Contextual bandits (where you observe covariates before choosing an arm) and non-stationary bandits (where you discount old observations) are the natural next steps. Thompson Sampling generalizes to both, but those are different tutorials.

Built with Quarto. Visualizations use Observable Plot.