Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Ciara Pike-Burke Supervisor: David Leslie 24th April 2015 1 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Introduction Motivation In many real life problems there is a trade-off to be made between exploitation and exploration. For example; ◮ In clinical trials. ◮ In portfolio optimization. ◮ In website optimization. ◮ Choosing a restaurant. 2 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Introduction Multi-Armed Bandits One of the best ways to model the exploitation vs exploration trade-off is using Multi-Armed Bandits. ◮ Each of k slot machines has an unknown reward distribution. ◮ Want to maximize reward, or equivalently minimize regret. ◮ Regret is the accumulated difference in expected reward of the arm we played and the optimal arm. Figure: A Multi-Armed Bandit (image from research.microsoft.com) 3 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Algorithms Thompson Sampling Thompson Sampling 4 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Algorithms Thompson Sampling Thompson Sampling For the case of Bernoulli rewards, the Thompson Sampling algorithm is: 1. Initialize with Uniform ( Beta (1 , 1)) priors on the reward of each arm. 2. At each time step, t : ◮ Sample θ i from Beta ( s i ( t − 1) + 1 , f i ( t − 1) + 1) for each arm i . ◮ Play the arm that corresponds to the largest θ i . ◮ Update s i ( t ) and f i ( t ) for all i . Where s i ( t ) is the number of successes from playing arm i in t total plays of the algorithm and f i ( t ) the number of failures. 4 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Algorithms Thompson Sampling Sampling Distributions Figure: Thompson Sampling for the 2-armed Bernoulli bandit with p = (0 . 35 , 0 . 8). 5 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Algorithms Thompson Sampling Sampling Distributions Figure: Thompson Sampling for the 2-armed Bernoulli bandit with p = (0 . 35 , 0 . 8). 5 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Algorithms Optimistic Bayesian Sampling OBS: Motivation ◮ If the variance of the better arm is too large, Thompson Sampling will often end up playing the inferior arm. ◮ May et al. (2012) propose a new method, Optimistic Bayesian Sampling, to combat this. 6 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Algorithms Optimistic Bayesian Sampling OBS: Outline ◮ Optimistic Bayesian Sampling is the same as Thompson Sampling except for the decision rule. ◮ At each time step t , play the arm that maximizes q i = max { θ i , µ i } where θ i ∼ Beta ( s i ( t − 1) + 1 , f i ( t − 1) + 1) and µ i is the mean of this distribution. Optimistic Bayesian Sampling has been shown empirically and theoretically to perform better that Thompson Sampling. 7 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Algorithms Optimistic Bayesian Sampling using Rejection Sampling Motivation Thompson Sampling Optimistic Bayesian Sampling 2000 5000 1500 Frequency Frequency 1000 3000 500 1000 0 0 0.3 0.4 0.5 0.6 0.7 0.40 0.45 0.50 0.55 0.60 0.65 0.70 prob prob 8 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Algorithms Optimistic Bayesian Sampling using Rejection Sampling Motivation Thompson Sampling Optimistic Bayesian Sampling 2000 5000 1500 Frequency Frequency 1000 3000 500 1000 0 0 0.3 0.4 0.5 0.6 0.7 0.40 0.45 0.50 0.55 0.60 0.65 0.70 prob prob 8 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Algorithms Optimistic Bayesian Sampling using Rejection Sampling Optimistic Bayesian Sampling using Rejection Sampling We can use Rejection Sampling to obtain samples from the truncated Beta distribution. Rejection Sampling 1500 Frequency 1000 500 0 0.3 0.4 0.5 0.6 0.7 prob 9 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Algorithms Optimistic Bayesian Sampling using Rejection Sampling Optimistic Bayesian Sampling using Rejection Sampling We can use Rejection Sampling to obtain samples from the truncated Beta distribution. Rejection Sampling 1500 Frequency 1000 500 0 0.3 0.4 0.5 0.6 0.7 prob ◮ The algorithm is the same as for Thompson Sampling but sampling from the truncated Beta ( s i ( t − 1) + 1 , f i ( t − 1) + 1). ◮ Can choose any proposal distribution, the simplest is the Beta distribution. 9 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Simulation Study Simulation Study The three methods, Thompson Sampling, Optimistic Bayesian Sampling and Optimistic Bayesian Sampling using Rejection Sampling were tested on four simulations with Bernoulli rewards. 10 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Simulation Study Simulation Study The three methods, Thompson Sampling, Optimistic Bayesian Sampling and Optimistic Bayesian Sampling using Rejection Sampling were tested on four simulations with Bernoulli rewards. Simulation 1 The 2 armed bandit with randomly generated probabilities p = (0 . 34 , 0 . 92) Simulation 2 The 5 armed bandit with p = (0 . 45 , 0 . 45 , 0 . 45 , 0 . 55 , 0 . 45) Simulation 3 The 10 armed bandit with p = (0 . 9 , 0 . 8 , . . . , 0 . 8) Simulation 4 The 20 armed bandit with randomly generated probabilities p = (0 . 56 , 0 . 09 , 0 . 68 , 0 . 69 , 0 . 19 , 0 . 45 , 0 . 77 , 0 . 29 , 0 . 58 , 0 . 11 , 0 . 91 , 0 . 17 , 0 . 29 , 0 . 95 , 0 . 90 , 0 . 39 , 0 . 38 , 0 . 53 , 0 . 84 , 0 . 03). 10 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Simulation Study Results 11 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Conclusion Conclusion ◮ Both adaptations of the Thompson Sampling algorithm seem to perform better than the original in simulations. ◮ However, Optimistic Bayesian Sampling using Rejection Sampling can be slow. ◮ The theoretical regret bound of OBS is better than that of Thompson Sampling - there has not been a regret bound proved for OBS using Rejection Sampling. 12 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Conclusion Future Work ◮ More careful consideration of the proposal distribution for Optimistic Bayesian Sampling using Rejection Sampling. ◮ Theoretical results for OBS using Rejection Sampling. ◮ Further simulations with: ◮ more arms, ◮ more complex reward distributions, ◮ contextual bandits, ◮ addition or subtraction of arms mid-way through the algorithm. 13 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits References References Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. . Biometrika, pages 285-294. Agrawal, S. and Goyal, N. (2011). Analysis of thompson sampling for the multi-armed bandit problem. arXiv preprint arXiv:1111.1797. May, B. C., Korda,N., Lee, A. and Leslie, D. S. (2012). Optimistic bayesian sampling in contextual-bandit problems . The Journal of Machine Learning Research, 13(1):2069-2106. 14 / 14
Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits References References Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. . Biometrika, pages 285-294. Agrawal, S. and Goyal, N. (2011). Analysis of thompson sampling for the multi-armed bandit problem. arXiv preprint arXiv:1111.1797. May, B. C., Korda,N., Lee, A. and Leslie, D. S. (2012). Optimistic bayesian sampling in contextual-bandit problems . The Journal of Machine Learning Research, 13(1):2069-2106. Thank you for listening, any questions? 14 / 14
Recommend
More recommend