Thompson Sampling

Thompson Sampling

Thompson Sampling is a probabilistic algorithm used in the field of reinforcement learning for balancing the exploration-exploitation trade-off. It is a Bayesian approach that provides a practical solution to the multi-armed bandit problem, where an agent must choose between multiple options (arms) with uncertain rewards.


Thompson Sampling, also known as Posterior Sampling or Probability Matching, is an algorithm for decision making under uncertainty. It was first proposed by William R. Thompson in 1933. The algorithm uses a Bayesian approach to model the uncertainty of the reward distribution for each option. It then samples from these distributions to decide which option to choose next. This approach allows the algorithm to balance exploration (trying out less certain options) and exploitation (choosing the best-known option) in an optimal way.

How it Works

The Thompson Sampling algorithm works by maintaining a probability distribution over the expected reward of each option. Initially, these distributions are set to a prior, which reflects our initial beliefs about the rewards. As the algorithm interacts with the environment and receives rewards, it updates these distributions using Bayes' theorem.

For each decision, the algorithm samples a random value from each distribution and chooses the option with the highest sampled value. This means that options with higher expected rewards are more likely to be chosen, but options with high uncertainty also have a chance of being selected. This is how Thompson Sampling balances exploration and exploitation.


Thompson Sampling has been successfully applied in various fields, including online advertising, recommendation systems, and clinical trials. In online advertising, for example, Thompson Sampling can be used to decide which ads to show to a user, based on the past click-through rates of the ads. In recommendation systems, it can be used to recommend items that the user is likely to enjoy, based on their past behavior.


Thompson Sampling has several advantages over other reinforcement learning algorithms. First, it is simple to implement and understand. Second, it has strong theoretical guarantees: under certain conditions, it can be shown to achieve near-optimal performance. Third, it is robust to changes in the environment, as it continuously updates its beliefs based on the latest data.


Despite its advantages, Thompson Sampling also has some limitations. It assumes that the reward distributions are independent and identically distributed, which may not be the case in some real-world scenarios. It also requires the ability to sample from the posterior distributions, which can be computationally expensive or even infeasible for complex models.

Further Reading

For a more detailed understanding of Thompson Sampling, the following resources are recommended:

  1. Thompson Sampling for Contextual Bandits with Linear Payoffs
  2. A Tutorial on Thompson Sampling
  3. Thompson Sampling: An Asymptotically Optimal Finite Time Analysis

Thompson Sampling is a powerful tool for decision making under uncertainty. By balancing exploration and exploitation, it allows us to learn effectively from our interactions with the environment.