the multi armed bandit problem
play

The Multi-Armed Bandit Problem Nicol` o Cesa-Bianchi Universit` a - PowerPoint PPT Presentation

The Multi-Armed Bandit Problem Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem The bandit problem [Robbins, 1952] . . . K slot machines Rewards X i ,1 , X i ,2 , . . . of


  1. The Multi-Armed Bandit Problem Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

  2. The bandit problem [Robbins, 1952] . . . K slot machines Rewards X i ,1 , X i ,2 , . . . of machine i are i.i.d. [ 0, 1 ] -valued random variables An allocation policy prescribes which machine I t to play at time t based on the realization of X I 1 ,1 , . . . , X I t − 1 , t − 1 Want to play as often as possible the machine with largest reward expectation µ ∗ = max i = 1,..., K E X i ,1 Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

  3. Bandits for targeting content Choose the best content to display to the next visitor of your website Goal is to elicit a response from the visitor (e.g., click on a banner) Content options = slot machines Response rate = reward expectation Simplifying assumptions: fixed response rates 1 no visitor profiles 2 Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

  4. Regrets, I’ve had a few (F. Sinatra) Definition (Regret after n plays) n � µ ∗ n − E X I t , t t = 1 Theorem ( Lai and Robbins, 1985) There exist allocation policies satisfying n � µ ∗ n − E X I t , t � c K ln n uniformly over n t = 1 Constant c roughly equal to 1 /∆ ∗ , where ∆ ∗ = µ ∗ − j : µ j <µ ∗ µ j max Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

  5. A simple policy [Agrawal, 1995]  At the beginning play each machine once 1 At each time t > K play machine I t maximizing 2 � 2 ln t X i , t + over i = 1, . . . , K T i , t X i , t is the average reward obtained from machine i T i , t is number of times machine i has been played Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

  6. A finite-time regret bound Theorem ( Auer, C-B, and Fisher, 2002) At any time n , the regret of the  policy is at most 8 K ∆ ∗ ln n + 5 K Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

  7. Upper confidence bounds � ( 2 ln t ) /T i , t is the size (using Cherno ff -Hoe ff ding bounds) of the one-sided confidence interval for the average reward within which µ i falls with probability 1 − 1 t Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

  8. The epsilon-greedy policy Input parameter: schedule ε 1 , ε 2 , . . . where 0 � ε t � 1 At each time t : with probability 1 − ε t play the machine I t with the 1 highest average reward with probability ε t play a random machine 2 Is there a schedule of ε t guaranteeing logarithmic regret? Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

  9. The tuned epsilon-greedy policy Theorem ( Auer, C-B, and Fisher, 2002) If ε t = 12 / ( d 2 t ) where d satisfies 0 < d � ∆ ∗ then the instantaneous regret at any time n of tuned ε -greedy is at most � K � O dn Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

  10. Practical performance The   policy: � � � 1 � 2 ln t ln t is replaced by min 4, V j , t T i , t T i , t where V j , t is an upper confidence bound for the variance of machine j Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

  11. Practical performance Optimally tuned ε -greedy performs almost always best unless there are several nonoptimal machines with wildly di ff erent response rates Performance of ε -greedy is quite sensitive to bad tuning   performs comparably to a well-tuned ε -greedy and is not very sensitive to large di ff erences in the response rates Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

  12. The nonstochastic bandit problem [Auer, C-B, Freund, and Schapire, 2002] What if probability is removed altogether? Nonstochastic bandits Bounded real rewards x i ,1 , x i ,2 , . . . are deterministically assigned to each machine i Analogies with repeated play of an unknown game [Ba˜ nos, 1968; Megiddo, 1980] Allocation policies are allowed to randomize Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

  13. 0 1 0 0 7 9 9 8 9 0 0 1 5 7 9 6 0 0 2 2 0 0 0 1 0 2 0 1 0 1 0 0 8 9 8 7 Definition (Regret) � n � n � � � � max x i , t − E x I t , t i = 1,..., K t = 1 t = 1 Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

  14. Competing against arbitrary policies 0 1 0 0 7 9 9 8 9 0 0 1 5 7 9 6 0 0 2 2 0 0 0 1 0 2 0 1 0 1 0 0 8 9 8 7 Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

  15. Tracking regret Regret against an arbitrary and unknown policy ( j 1 , j 2 , . . . , j n ) � n n � � � x j t , t − E x I t , t t = 1 t = 1 Theorem (Auer, C-B, Freund, and Schapire, 2002) For all fixed S , the regret of the weight sharing policy against any policy j = ( j 1 , j 2 , . . . , j n ) is at most √ S nK ln K where S is the number of times j switches to a di ff erent machine Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

Recommend


More recommend