Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo
Multi-Armed Bandits β’ Problem: β π bandits with unknown average reward π(π) β Which arm π should we play at each time step? β Exploitation/exploration tradeoff β’ Common frequentist approaches: β π -greedy β Upper confidence bound (UCB) β’ Alternative Bayesian approaches β Thompson sampling β Gittins indices 2 CS886 (c) 2013 Pascal Poupart
Bayesian Learning β’ Notation: β π π : random variable for π βs rewards β Pr π π ; π : unknown distribution (parameterized by π ) β π π = πΉ[π π ] : unknown average reward β’ Idea: β Express uncertainty about π by a prior Prβ‘ (π) π , π π , β¦ , π π ) based on β Compute posterior Prβ‘ (π|π 1 2 π π , π π , β¦ , π π observed for π so far. samples π π 1 2 β’ Bayes theorem: π β Pr π Prβ‘ π , π π , β¦ , π π , π π , β¦ , π π |π) Pr π π (π 1 2 π 1 2 π 3 CS886 (c) 2013 Pascal Poupart
Distributional Information β’ Posterior over π allows us to estimate β Distribution over next reward π π π = Pr π π ; π Pr π π π ππ π , π π , β¦ , π π , π π , β¦ , π Pr π π |π π π 1 2 1 2 π β Distribution over π(π) when π includes the mean π = Pr π π π if π = π(π) π , π π , β¦ , π π , π π , β¦ , π Pr π(π) π 1 2 π 1 2 π β’ To guide exploration: π , π 2 π , β¦ , π π π ) β₯ 1 β π β UCB : Pr π π β€ πππ£ππ( π 1 π π , π 2 π , β¦ , π π β Bayesian techniques: Pr π π | π 1 4 CS886 (c) 2013 Pascal Poupart
Coin Example β’ Consider two biased coins π· 1 and π· 2 π π· 1 = Pr π· 1 = βπππ π π· 2 = Pr π· 2 = βπππ β’ Problem: β Maximize # of heads in π flips β Which coin should we choose for each flip? 5 CS886 (c) 2013 Pascal Poupart
Bernoulli Variables β’ π π· 1 , π π· 2 are Bernoulli variables with domain {0,1} β’ Bernoulli dist. are parameterized by their mean β i.e. Pr π π· 1 ; π 1 = π 1 = π π· 1 Pr π π· 2 ; π 2 = π 2 = π(π· 2 ) 6 CS886 (c) 2013 Pascal Poupart
Beta distribution β’ Let the prior Prβ‘ (π) be a Beta distribution πΆππ’π π; π½, πΎ β π π½β1 1 β π πΎβ1 πΆππ’π π; 1, 1 πΆππ’π π; 2, 8 β’ π½ β 1: # of heads πΆππ’π(π; 20, 80) β’ πΎ β 1 : # of tails Prβ‘ (π) β’ πΉ π = π½/(π½ + πΎ) π 7 CS886 (c) 2013 Pascal Poupart
Belief Update β’ Prior: Pr π = πΆππ’π π; π½, πΎ β π π½β1 1 β π πΎβ1 β’ Posterior after coin flip: Pr π βπππ β β‘β‘β‘β‘β‘β‘β‘β‘Pr π β‘β‘β‘β‘β‘β‘β‘β‘β‘β‘β‘β‘Pr βπππ π β π π½β1 1 β π πΎβ1 π = π π½+1 β1 1 β π πΎβ1 β πΆππ’π(π; π½ + 1, πΎ) Pr π π’πππ β β‘β‘β‘β‘β‘β‘β‘β‘Pr π β‘β‘β‘β‘β‘β‘β‘β‘β‘β‘β‘β‘Pr π’πππ π β π π½β1 1 β π πΎβ1 (1 β π) = π π½β1 1 β π (πΎ+1)β1 β πΆππ’π(π; π½, πΎ + 1) 8 CS886 (c) 2013 Pascal Poupart
Thompson Sampling β’ Idea: β Sample several potential average rewards: π , β¦ , π π ) for each π π 1 π , β¦ π π π β‘~β‘Prβ‘ (π(π)|π π 1 β Estimate empirical average π = 1 π π π π π (π) π=1 π β Execute ππ ππππ¦ π β‘π β’ Coin example π = Beta π π ; π½ π , πΎ π π , β¦ , π β Pr π(π) π π 1 where π½ π β 1 = #βππππ‘ and πΎ π β 1 = #π’ππππ‘ 9 CS886 (c) 2013 Pascal Poupart
Thompson Sampling Algorithm Bernoulli Rewards ThompsonSampling( β ) π β 0 For π = 1β‘toβ‘β Sample π 1 π , β¦ , π π (π)β‘~β‘Prβ‘ (π π )β‘β‘βπ 1 π β π π π π π (π) β‘β‘β‘βπ π=1 π β β argmax a β‘π π Execute π β and receive π π β π + π (π(π β )) based on π Update Prβ‘ Return π 10 CS886 (c) 2013 Pascal Poupart
Comparison Thompson Sampling Greedy Strategy β’ Action Selection β’ Action Selection π β = argmax a β‘π π β = argmax a β‘π π π β’ Empirical mean β’ Empirical mean π = 1 π = 1 π π π π π π π π (π) π π π π=1 π=1 β’ Samples β’ Samples π β¦ π π π π β‘~β‘Prβ‘ π ) π (π π ; π) π β‘~β‘Prβ‘ (π π (π)|π π 1 π π (π π ; π) π β‘~β‘Prβ‘ π β’ No exploration β’ Some exploration 11 CS886 (c) 2013 Pascal Poupart
Sample Size β’ In Thompson sampling, amount of data πβ‘ and sample size π regulate amount of exploration π becomes less β’ As π and π increase, π stochastic, which reduces exploration π β¦ π π ) becomes more peaked β As π β , Prβ‘ (π(π)|π 1 π π β¦ π π approaches πΉ[π(π)|π π ] β As π β , π 1 π (π) ensures that all actions β’ The stochasticity of π are chosen with some probability 12 CS886 (c) 2013 Pascal Poupart
Continuous Rewards β’ So far we assumed that π β 0,1 β’ What about continuous rewards, i.e. π β 0,1 ? β NB: rewards in [π, π] can be remapped to [0,1] by an affine transformation without changing the problem β’ Idea: β When we receive a reward π β Sample πβ‘~β‘πΆππ πππ£πππ(π ) s.t. π β {0,1} 13 CS886 (c) 2013 Pascal Poupart
Thompson Sampling Algorithm Continuous rewards ThompsonSampling( β ) π β 0 For π = 1β‘toβ‘β Sample π 1 π , β¦ , π π (π)β‘~β‘Prβ‘ (π π )β‘β‘βπ 1 π β π π π π π (π) β‘β‘β‘βπ π=1 π β β argmax a β‘π π Execute π β and receive π π β π + π Sample πβ‘~β‘πΆππ πππ£πππ(π ) (π(π β )) based on π Update Prβ‘ Return π 14 CS886 (c) 2013 Pascal Poupart
Analysis β’ Thompson sampling converges to best arm β’ Theory: β Expected cumulative regret: π(log β‘π) β On par with UCB and π -greedy β’ Practice: β Sample size π often set to 1 β Used by Bing for ad placement β’ Graepel, Candela, Borchert, Herbrich (2010) Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoftβs Bing search engine, ICML. 15 CS886 (c) 2013 Pascal Poupart
Recommend
More recommend