The Multi-Arm Bandit Framework A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL – INRIA Lille MVA-RL Course
In This Lecture A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 2/94
In This Lecture Question : which route should we take? Problem : each day we obtain a limited feedback : traveling time of the chosen route Results : if we do not repeatedly try different options we cannot learn. Solution : trade off between optimization and learning . A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 3/94
Mathematical Tools Outline Mathematical Tools The General Multi-arm Bandit Problem The Stochastic Multi-arm Bandit Problem The Non-Stochastic Multi-arm Bandit Problem Connections to Game Theory Other Stochastic Multi-arm Bandit Problems A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 4/94
Mathematical Tools Concentration Inequalities Proposition (Chernoff-Hoeffding Inequality) Let X i ∈ [ a i , b i ] be n independent r.v. with mean µ i = E X i . Then �� �� � � � n � � 2 ǫ 2 � � X i − µ i � ≥ ǫ ≤ 2 exp − � n . P � i = 1 ( b i − a i ) 2 i = 1 A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 5/94
Mathematical Tools Concentration Inequalities Proof. � � � n P ( e s � n i = 1 X i − µ i ≥ e s ǫ ) X i − µ i ≥ ǫ = P i = 1 e − s ǫ E [ e s � n i = 1 X i − µ i ] , ≤ Markov inequality � n e − s ǫ E [ e s ( X i − µ i ) ] , = independent random variables i = 1 n � e s 2 ( b i − a i ) 2 / 8 , e − s ǫ ≤ Hoeffding inequality i = 1 e − s ǫ + s 2 � n i = 1 ( b i − a i ) 2 / 8 = If we choose s = 4 ǫ/ � n i = 1 ( b i − a i ) 2 , the result follows. � � n � Similar arguments hold for P i = 1 X i − µ i ≤ − ǫ . A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 6/94
Mathematical Tools Concentration Inequalities Finite sample guarantee : � � � � � � n � 2 n ǫ 2 � 1 � � X t − E [ X 1 ] > ǫ ≤ 2 exp − P � ���� ( b − a ) 2 n t = 1 � �� � accuracy � �� � confidence deviation A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 7/94
Mathematical Tools Concentration Inequalities Finite sample guarantee : �� � � � n � � 1 log 2 /δ � � X t − E [ X 1 ] � > ( b − a ) ≤ δ P n 2 n t = 1 A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 8/94
Mathematical Tools Concentration Inequalities Finite sample guarantee : �� � � n � � 1 � � P X t − E [ X 1 ] � > ǫ ≤ δ n t = 1 if n ≥ ( b − a ) 2 log 2 /δ . 2 ǫ 2 A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 9/94
The General Multi-arm Bandit Problem Outline Mathematical Tools The General Multi-arm Bandit Problem The Stochastic Multi-arm Bandit Problem The Non-Stochastic Multi-arm Bandit Problem Connections to Game Theory Other Stochastic Multi-arm Bandit Problems A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 10/94
The General Multi-arm Bandit Problem The Multi–armed Bandit Game The learner has i = 1 , . . . , N arms (options, experts, ...) At each round t = 1 , . . . , n ◮ At the same time ◮ The environment chooses a vector of rewards { X i , t } N i = 1 ◮ The learner chooses an arm I t ◮ The learner receives a reward X I t , t ◮ The environment does not reveal the rewards of the other arms A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 11/94
The General Multi-arm Bandit Problem The Multi–armed Bandit Game (cont’d) The regret � � � � n n � � R n ( A ) = max X i , t − E X I t , t i = 1 ,..., N E t = 1 t = 1 The expectation summarizes any possible source of randomness (either in X or in the algorithm) A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 12/94
The General Multi-arm Bandit Problem The Exploration–Exploitation Lemma Problem 1 : The environment does not reveal the rewards of the arms not pulled by the learner ⇒ the learner should gain information by repeatedly pulling all the arms ⇒ exploration Problem 2 : Whenever the learner pulls a bad arm , it suffers some regret ⇒ the learner should reduce the regret by repeatedly pulling the best arm ⇒ exploitation Challenge : The learner should solve two opposite problems! Challenge : The learner should solve the exploration-exploitation dilemma! A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 13/94
The General Multi-arm Bandit Problem The Multi–armed Bandit Game (cont’d) Examples ◮ Packet routing ◮ Clinical trials ◮ Web advertising ◮ Computer games ◮ Resource mining ◮ ... A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 14/94
The Stochastic Multi-arm Bandit Problem Outline Mathematical Tools The General Multi-arm Bandit Problem The Stochastic Multi-arm Bandit Problem The Non-Stochastic Multi-arm Bandit Problem Connections to Game Theory Other Stochastic Multi-arm Bandit Problems A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 15/94
The Stochastic Multi-arm Bandit Problem The Stochastic Multi–armed Bandit Problem Definition The environment is stochastic ◮ Each arm has a distribution ν i bounded in [ 0 , 1 ] and characterized by an expected value µ i ◮ The rewards are i.i.d. X i , t ∼ ν i A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 16/94
The Stochastic Multi-arm Bandit Problem The Stochastic Multi–armed Bandit Problem (cont’d) Notation ◮ Number of times arm i has been pulled after n rounds n � T i , n = I { I t = i } t = 1 ◮ Regret � � � � n n � � R n ( A ) = max X i , t − E X I t , t i = 1 ,..., N E t = 1 t = 1 � � n � R n ( A ) = i = 1 ,..., N ( n µ i ) − E max X I t , t t = 1 N � R n ( A ) = i = 1 ,..., N ( n µ i ) − max E [ T i , n ] µ i i = 1 N � R n ( A ) = n µ i ∗ − E [ T i , n ] µ i A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 17/94 i = 1 �
The Stochastic Multi-arm Bandit Problem The Stochastic Multi–armed Bandit Problem (cont’d) � R n ( A ) = E [ T i , n ]∆ i i � = i ∗ ⇒ we only need to study the expected number of pulls of the suboptimal arms A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 18/94
The Stochastic Multi-arm Bandit Problem The Stochastic Multi–armed Bandit Problem (cont’d) Optimism in Face of Uncertainty Learning (OFUL) Whenever we are uncertain about the outcome of an arm, we consider the best possible world and choose the best arm . Why it works : ◮ If the best possible world is correct ⇒ no regret ◮ If the best possible world is wrong ⇒ the reduction in the uncertainty is maximized A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 19/94
The Stochastic Multi-arm Bandit Problem The Stochastic Multi–armed Bandit Problem (cont’d) 25 40 35 20 30 25 15 20 10 15 10 5 5 0 0 −4 −2 0 2 4 6 −4 −2 0 2 4 6 Rewards Rewards pulls = 100 pulls = 200 14 3 12 2.5 10 2 8 1.5 6 1 4 0.5 2 0 0 −4 −2 0 2 4 6 −4 −2 0 2 4 6 Rewards Rewards pulls = 50 pulls = 20 A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 20/94
The Stochastic Multi-arm Bandit Problem The Stochastic Multi–armed Bandit Problem (cont’d) Optimism in face of uncertainty 40 35 25 30 20 25 20 15 15 10 10 5 5 0 0 −4 −2 0 2 4 6 −4 −2 0 2 4 6 Rewards Rewards 3 14 2.5 12 2 10 8 1.5 6 1 4 0.5 2 0 0 −4 −2 0 2 4 6 −4 −2 0 2 4 6 Rewards Rewards A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 21/94
The Stochastic Multi-arm Bandit Problem The Upper–Confidence Bound (UCB) Algorithm The idea 2 1.5 1 0.5 Reward 0 −0.5 −1 −1.5 1 (10) 2 (73) 3 (3) 4 (23) Arms A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 22/94
The Stochastic Multi-arm Bandit Problem The Upper–Confidence Bound (UCB) Algorithm Show time! A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 23/94
The Stochastic Multi-arm Bandit Problem The Upper–Confidence Bound (UCB) Algorithm (cont’d) At each round t = 1 , . . . , n ◮ Compute the score of each arm i B i = ( optimistic score of arm i ) ◮ Pull arm I t = arg max i = 1 ,..., N B i , s , t ◮ Update the number of pulls T I t , t = T I t , t − 1 + 1 A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 24/94
The Stochastic Multi-arm Bandit Problem The Upper–Confidence Bound (UCB) Algorithm (cont’d) The score (with parameters ρ and δ ) B i = ( optimistic score of arm i ) B i , s , t = ( optimistic score of arm i if pulled s times up to round t ) B i , s , t = knowledge + uncertainty ���� optimism � log 1 /δ B i , s , t = ˆ µ i , s + ρ 2 s Optimism in face of uncertainty: Current knowledge : average rewards ˆ µ i , s Current uncertainty : number of pulls s A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 25/94
Recommend
More recommend