the exploration exploitation dilemma
play

The Exploration-Exploitation Dilemma A. LAZARIC ( SequeL Team - PowerPoint PPT Presentation

The Exploration-Exploitation Dilemma A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course The Exploration-Exploitation Dilemma A. LAZARIC Reinforcement Learning Fall 2017 - 2/95 The


  1. Mathematical Tools The Exploration–Exploitation Lemma Problem 1 : The environment does not reveal the rewards of the arms not pulled by the learner ⇒ the learner should gain information by repeatedly pulling all the arms Problem 2 : Whenever the learner pulls a bad arm , it suffers some regret ⇒ the learner should reduce the regret by repeatedly pulling the best arm A. LAZARIC – Reinforcement Learning Fall 2017 - 18/95

  2. Mathematical Tools The Exploration–Exploitation Lemma Problem 1 : The environment does not reveal the rewards of the arms not pulled by the learner ⇒ the learner should gain information by repeatedly pulling all the arms Problem 2 : Whenever the learner pulls a bad arm , it suffers some regret ⇒ the learner should reduce the regret by repeatedly pulling the best arm Challenge : The learner should solve two opposite problems! A. LAZARIC – Reinforcement Learning Fall 2017 - 18/95

  3. Mathematical Tools The Exploration–Exploitation Lemma Problem 1 : The environment does not reveal the rewards of the arms not pulled by the learner ⇒ the learner should gain information by repeatedly pulling all the arms ⇒ exploration Problem 2 : Whenever the learner pulls a bad arm , it suffers some regret ⇒ the learner should reduce the regret by repeatedly pulling the best arm Challenge : The learner should solve two opposite problems! A. LAZARIC – Reinforcement Learning Fall 2017 - 18/95

  4. Mathematical Tools The Exploration–Exploitation Lemma Problem 1 : The environment does not reveal the rewards of the arms not pulled by the learner ⇒ the learner should gain information by repeatedly pulling all the arms ⇒ exploration Problem 2 : Whenever the learner pulls a bad arm , it suffers some regret ⇒ the learner should reduce the regret by repeatedly pulling the best arm ⇒ exploitation Challenge : The learner should solve two opposite problems! A. LAZARIC – Reinforcement Learning Fall 2017 - 18/95

  5. Mathematical Tools The Exploration–Exploitation Lemma Problem 1 : The environment does not reveal the rewards of the arms not pulled by the learner ⇒ the learner should gain information by repeatedly pulling all the arms ⇒ exploration Problem 2 : Whenever the learner pulls a bad arm , it suffers some regret ⇒ the learner should reduce the regret by repeatedly pulling the best arm ⇒ exploitation Challenge : The learner should solve the exploration-exploitation dilemma! A. LAZARIC – Reinforcement Learning Fall 2017 - 18/95

  6. Mathematical Tools The Multi–armed Bandit Game (cont’d) Examples ◮ Packet routing ◮ Clinical trials ◮ Web advertising ◮ Computer games ◮ Resource mining ◮ ... A. LAZARIC – Reinforcement Learning Fall 2017 - 19/95

  7. Mathematical Tools The Stochastic Multi–armed Bandit Problem Definition The environment is stochastic ◮ Each arm has a distribution ν i bounded in [ 0 , 1 ] and characterized by an expected value µ i ◮ The rewards are i.i.d. X i , t ∼ ν i (as in the MDP model) A. LAZARIC – Reinforcement Learning Fall 2017 - 20/95

  8. Mathematical Tools The Stochastic Multi–armed Bandit Problem (cont’d) Notation ◮ Number of times arm i has been pulled after n rounds n � T i , n = I { I t = i } t = 1 A. LAZARIC – Reinforcement Learning Fall 2017 - 21/95

  9. Mathematical Tools The Stochastic Multi–armed Bandit Problem (cont’d) Notation ◮ Number of times arm i has been pulled after n rounds � n T i , n = I { I t = i } t = 1 ◮ Regret � � � � n n � � R n ( A ) = max X i , t − E X I t , t i = 1 ,..., K E t = 1 t = 1 A. LAZARIC – Reinforcement Learning Fall 2017 - 21/95

  10. Mathematical Tools The Stochastic Multi–armed Bandit Problem (cont’d) Notation ◮ Number of times arm i has been pulled after n rounds � n T i , n = I { I t = i } t = 1 ◮ Regret � � � n R n ( A ) = i = 1 ,..., K ( n µ i ) − E max X I t , t t = 1 A. LAZARIC – Reinforcement Learning Fall 2017 - 21/95

  11. Mathematical Tools The Stochastic Multi–armed Bandit Problem (cont’d) Notation ◮ Number of times arm i has been pulled after n rounds � n T i , n = I { I t = i } t = 1 ◮ Regret � K R n ( A ) = i = 1 ,..., K ( n µ i ) − max E [ T i , n ] µ i i = 1 A. LAZARIC – Reinforcement Learning Fall 2017 - 21/95

  12. Mathematical Tools The Stochastic Multi–armed Bandit Problem (cont’d) Notation ◮ Number of times arm i has been pulled after n rounds � n T i , n = I { I t = i } t = 1 ◮ Regret � K R n ( A ) = n µ i ∗ − E [ T i , n ] µ i i = 1 A. LAZARIC – Reinforcement Learning Fall 2017 - 21/95

  13. Mathematical Tools The Stochastic Multi–armed Bandit Problem (cont’d) Notation ◮ Number of times arm i has been pulled after n rounds � n T i , n = I { I t = i } t = 1 ◮ Regret � E [ T i , n ]( µ i ∗ − µ i ) R n ( A ) = i � = i ∗ A. LAZARIC – Reinforcement Learning Fall 2017 - 21/95

  14. Mathematical Tools The Stochastic Multi–armed Bandit Problem (cont’d) Notation ◮ Number of times arm i has been pulled after n rounds � n T i , n = I { I t = i } t = 1 ◮ Regret � R n ( A ) = E [ T i , n ]∆ i i � = i ∗ A. LAZARIC – Reinforcement Learning Fall 2017 - 21/95

  15. Mathematical Tools The Stochastic Multi–armed Bandit Problem (cont’d) Notation ◮ Number of times arm i has been pulled after n rounds � n T i , n = I { I t = i } t = 1 ◮ Regret � R n ( A ) = E [ T i , n ]∆ i i � = i ∗ ◮ Gap ∆ i = µ i ∗ − µ i A. LAZARIC – Reinforcement Learning Fall 2017 - 21/95

  16. Mathematical Tools The Stochastic Multi–armed Bandit Problem (cont’d) � R n ( A ) = E [ T i , n ]∆ i i � = i ∗ ⇒ we only need to study the expected number of pulls of the suboptimal arms A. LAZARIC – Reinforcement Learning Fall 2017 - 22/95

  17. Mathematical Tools The Stochastic Multi–armed Bandit Problem (cont’d) Optimism in Face of Uncertainty Learning (OFUL) Whenever we are uncertain about the outcome of an arm, we consider the best possible world and choose the best arm . A. LAZARIC – Reinforcement Learning Fall 2017 - 23/95

  18. Mathematical Tools The Stochastic Multi–armed Bandit Problem (cont’d) Optimism in Face of Uncertainty Learning (OFUL) Whenever we are uncertain about the outcome of an arm, we consider the best possible world and choose the best arm . Why it works : ◮ If the best possible world is correct ⇒ no regret ◮ If the best possible world is wrong ⇒ the reduction in the uncertainty is maximized A. LAZARIC – Reinforcement Learning Fall 2017 - 23/95

  19. Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm The idea 2 1.5 1 0.5 Reward 0 −0.5 −1 −1.5 1 (10) 2 (73) 3 (3) 4 (23) Arms A. LAZARIC – Reinforcement Learning Fall 2017 - 24/95

  20. Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm Show time! A. LAZARIC – Reinforcement Learning Fall 2017 - 25/95

  21. Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm (cont’d) At each round t = 1 , . . . , n ◮ Compute the score of each arm i B i = ( optimistic score of arm i ) ◮ Pull arm I t = arg max i = 1 ,..., K B i , s , t ◮ Update the number of pulls T I t , t = T I t , t − 1 + 1 and the other statistics A. LAZARIC – Reinforcement Learning Fall 2017 - 26/95

  22. Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm (cont’d) The score (with parameters ρ and δ ) B i = ( optimistic score of arm i ) A. LAZARIC – Reinforcement Learning Fall 2017 - 27/95

  23. Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm (cont’d) The score (with parameters ρ and δ ) B i , s , t = ( optimistic score of arm i if pulled s times up to round t ) A. LAZARIC – Reinforcement Learning Fall 2017 - 27/95

  24. Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm (cont’d) The score (with parameters ρ and δ ) B i , s , t = ( optimistic score of arm i if pulled s times up to round t ) Optimism in face of uncertainty: Current knowledge : average rewards ˆ µ i , s Current uncertainty : number of pulls s A. LAZARIC – Reinforcement Learning Fall 2017 - 27/95

  25. Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm (cont’d) The score (with parameters ρ and δ ) B i , s , t = knowledge + uncertainty ���� optimism Optimism in face of uncertainty: Current knowledge : average rewards ˆ µ i , s Current uncertainty : number of pulls s A. LAZARIC – Reinforcement Learning Fall 2017 - 27/95

  26. Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm (cont’d) The score (with parameters ρ and δ ) � log 1 /δ B i , s , t = ˆ µ i , s + ρ 2 s Optimism in face of uncertainty: Current knowledge : average rewards ˆ µ i , s Current uncertainty : number of pulls s A. LAZARIC – Reinforcement Learning Fall 2017 - 27/95

  27. Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm (cont’d) At each round t = 1 , . . . , n ◮ Compute the score of each arm i � log( t ) B i , t = ˆ µ i , T i , t + ρ 2 T i , t ◮ Pull arm I t = arg max i = 1 ,..., K B i , t ◮ Update the number of pulls T I t , t = T I t , t − 1 + 1 and ˆ µ i , T i , t A. LAZARIC – Reinforcement Learning Fall 2017 - 28/95

  28. Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm (cont’d) Theorem Let X 1 , . . . , X n be i.i.d. samples from a distribution bounded in [ a , b ] , then for any δ ∈ ( 0 , 1 ) �� � � � � n � 1 log 2 /δ � � X t − E [ X 1 ] � > ( b − a ) ≤ δ P n 2 n t = 1 A. LAZARIC – Reinforcement Learning Fall 2017 - 29/95

  29. Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm (cont’d) After s pulls, arm i � � � � s E [ X i ] ≤ 1 log 1 /δ X i , t + ≥ 1 − δ P s 2 s t = 1 A. LAZARIC – Reinforcement Learning Fall 2017 - 30/95

  30. Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm (cont’d) After s pulls, arm i � � � log 1 /δ P µ i ≤ ˆ µ i , s + ≥ 1 − δ 2 s A. LAZARIC – Reinforcement Learning Fall 2017 - 30/95

  31. Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm (cont’d) After s pulls, arm i � � � log 1 /δ µ i ≤ ˆ µ i , s + ≥ 1 − δ P 2 s ⇒ UCB uses an upper confidence bound on the expectation A. LAZARIC – Reinforcement Learning Fall 2017 - 30/95

  32. Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm (cont’d) Theorem For any set of K arms with distributions bounded in [ 0 , b ] , if δ = 1 / t, then UCB( ρ ) with ρ > 1 , achieves a regret � � �� � 4 b 2 3 1 R n ( A ) ≤ ρ log( n ) + ∆ i 2 + ∆ i 2 ( ρ − 1 ) i � = i ∗ A. LAZARIC – Reinforcement Learning Fall 2017 - 31/95

  33. Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm (cont’d) Let K = 2 with i ∗ = 1 � � 1 R n ( A ) ≤ O ∆ ρ log( n ) Remark 1 : the cumulative regret slowly increases as log( n ) A. LAZARIC – Reinforcement Learning Fall 2017 - 32/95

  34. Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm (cont’d) Let K = 2 with i ∗ = 1 � � 1 R n ( A ) ≤ O ∆ ρ log( n ) Remark 1 : the cumulative regret slowly increases as log( n ) Remark 2 : the smaller the gap the bigger the regret ... why? A. LAZARIC – Reinforcement Learning Fall 2017 - 32/95

  35. Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm (cont’d) Show time (again)! A. LAZARIC – Reinforcement Learning Fall 2017 - 33/95

  36. Mathematical Tools The Worst–case Performance Remark : the regret bound is distribution–dependent � � 1 R n ( A ; ∆) ≤ O ∆ ρ log( n ) A. LAZARIC – Reinforcement Learning Fall 2017 - 34/95

  37. Mathematical Tools The Worst–case Performance Remark : the regret bound is distribution–dependent � � 1 R n ( A ; ∆) ≤ O ∆ ρ log( n ) Meaning : the algorithm is able to adapt to the specific problem at hand! A. LAZARIC – Reinforcement Learning Fall 2017 - 34/95

  38. Mathematical Tools The Worst–case Performance Remark : the regret bound is distribution–dependent � � 1 R n ( A ; ∆) ≤ O ∆ ρ log( n ) Meaning : the algorithm is able to adapt to the specific problem at hand! Worst–case performance : what is the distribution which leads to the worst possible performance of UCB? what is the distribution–free performance of UCB? R n ( A ) = sup R n ( A ; ∆) ∆ A. LAZARIC – Reinforcement Learning Fall 2017 - 34/95

  39. Mathematical Tools The Worst–case Performance Problem : it seems like if ∆ → 0 then the regret tends to infinity... A. LAZARIC – Reinforcement Learning Fall 2017 - 35/95

  40. Mathematical Tools The Worst–case Performance Problem : it seems like if ∆ → 0 then the regret tends to infinity... ... nosense because the regret is defined as R n ( A ; ∆) = E [ T 2 , n ]∆ A. LAZARIC – Reinforcement Learning Fall 2017 - 35/95

  41. Mathematical Tools The Worst–case Performance Problem : it seems like if ∆ → 0 then the regret tends to infinity... ... nosense because the regret is defined as R n ( A ; ∆) = E [ T 2 , n ]∆ then if ∆ i is small, the regret is also small... A. LAZARIC – Reinforcement Learning Fall 2017 - 35/95

  42. Mathematical Tools The Worst–case Performance Problem : it seems like if ∆ → 0 then the regret tends to infinity... ... nosense because the regret is defined as R n ( A ; ∆) = E [ T 2 , n ]∆ then if ∆ i is small, the regret is also small... In fact � � � � 1 R n ( A ; ∆) = min O ∆ ρ log( n ) , E [ T 2 , n ]∆ A. LAZARIC – Reinforcement Learning Fall 2017 - 35/95

  43. Mathematical Tools The Worst–case Performance Then � � � � ≈ √ n 1 R n ( A ) = sup R n ( A ; ∆) = sup min O ∆ ρ log( n ) , n ∆ ∆ ∆ � for ∆ = 1 / n A. LAZARIC – Reinforcement Learning Fall 2017 - 36/95

  44. Mathematical Tools Tuning the confidence δ of UCB Remark : UCB is an anytime algorithm ( δ = 1 / t ) � log t B i , s , t = ˆ µ i , s + ρ 2 s A. LAZARIC – Reinforcement Learning Fall 2017 - 37/95

  45. Mathematical Tools Tuning the confidence δ of UCB Remark : UCB is an anytime algorithm ( δ = 1 / t ) � log t B i , s , t = ˆ µ i , s + ρ 2 s Remark : If the time horizon n is known then the optimal choice is δ = 1 / n � log n B i , s , t = ˆ µ i , s + ρ 2 s A. LAZARIC – Reinforcement Learning Fall 2017 - 37/95

  46. Mathematical Tools Tuning the confidence δ of UCB (cont’d) Intuition : UCB should pull the suboptimal arms ◮ Enough : so as to understand which arm is the best ◮ Not too much : so as to keep the regret as small as possible A. LAZARIC – Reinforcement Learning Fall 2017 - 38/95

  47. Mathematical Tools Tuning the confidence δ of UCB (cont’d) Intuition : UCB should pull the suboptimal arms ◮ Enough : so as to understand which arm is the best ◮ Not too much : so as to keep the regret as small as possible The confidence 1 − δ has the following impact (similar for ρ ) ◮ Big 1 − δ : high level of exploration ◮ Small 1 − δ : high level of exploitation A. LAZARIC – Reinforcement Learning Fall 2017 - 38/95

  48. Mathematical Tools Tuning the confidence δ of UCB (cont’d) Intuition : UCB should pull the suboptimal arms ◮ Enough : so as to understand which arm is the best ◮ Not too much : so as to keep the regret as small as possible The confidence 1 − δ has the following impact (similar for ρ ) ◮ Big 1 − δ : high level of exploration ◮ Small 1 − δ : high level of exploitation Solution : depending on the time horizon, we can tune how to trade-off between exploration and exploitation A. LAZARIC – Reinforcement Learning Fall 2017 - 38/95

  49. Mathematical Tools UCB Proof Let’s dig into the (1 page and half!!) proof. Define the (high-probability) event [statistics] � � � � � log 1 /δ � � E = ∀ i , s � ˆ µ i , s − µ i � ≤ 2 s By Chernoff-Hoeffding P [ E ] ≥ 1 − nK δ . A. LAZARIC – Reinforcement Learning Fall 2017 - 39/95

  50. Mathematical Tools UCB Proof Let’s dig into the (1 page and half!!) proof. Define the (high-probability) event [statistics] � � � � � log 1 /δ � � E = ∀ i , s � ˆ µ i , s − µ i � ≤ 2 s By Chernoff-Hoeffding P [ E ] ≥ 1 − nK δ . At time t we pull arm i [algorithm] B i , T i , t − 1 ≥ B i ∗ , T i ∗ , t − 1 A. LAZARIC – Reinforcement Learning Fall 2017 - 39/95

  51. Mathematical Tools UCB Proof Let’s dig into the (1 page and half!!) proof. Define the (high-probability) event [statistics] � � � � � log 1 /δ � � E = ∀ i , s � ˆ µ i , s − µ i � ≤ 2 s By Chernoff-Hoeffding P [ E ] ≥ 1 − nK δ . At time t we pull arm i [algorithm] � � log 1 /δ log 1 /δ µ i , T i , t − 1 + ˆ ≥ ˆ µ i ∗ , T i ∗ , t − 1 + 2 T i , t − 1 2 T i ∗ , t − 1 A. LAZARIC – Reinforcement Learning Fall 2017 - 39/95

  52. Mathematical Tools UCB Proof Let’s dig into the (1 page and half!!) proof. Define the (high-probability) event [statistics] � � � � � log 1 /δ � � E = ∀ i , s � ˆ µ i , s − µ i � ≤ 2 s By Chernoff-Hoeffding P [ E ] ≥ 1 − nK δ . At time t we pull arm i [algorithm] � � log 1 /δ log 1 /δ µ i , T i , t − 1 + ˆ ≥ ˆ µ i ∗ , T i ∗ , t − 1 + 2 T i , t − 1 2 T i ∗ , t − 1 On the event E we have [math] � log 1 /δ µ i + 2 ≥ µ i ∗ 2 T i , t − 1 A. LAZARIC – Reinforcement Learning Fall 2017 - 39/95

  53. Mathematical Tools UCB Proof (cont’d) Assume t is the last time i is pulled, then T i , n = T i , t − 1 + 1, thus � log 1 /δ µ i + 2 2 ( T i , n − 1 ) ≥ µ i ∗ A. LAZARIC – Reinforcement Learning Fall 2017 - 40/95

  54. Mathematical Tools UCB Proof (cont’d) Assume t is the last time i is pulled, then T i , n = T i , t − 1 + 1, thus � log 1 /δ µ i + 2 2 ( T i , n − 1 ) ≥ µ i ∗ Reordering [math] T i , n ≤ log 1 /δ + 1 2 ∆ 2 i under event E and thus with probability 1 − nK δ . A. LAZARIC – Reinforcement Learning Fall 2017 - 40/95

  55. Mathematical Tools UCB Proof (cont’d) Assume t is the last time i is pulled, then T i , n = T i , t − 1 + 1, thus � log 1 /δ µ i + 2 2 ( T i , n − 1 ) ≥ µ i ∗ Reordering [math] T i , n ≤ log 1 /δ + 1 2 ∆ 2 i under event E and thus with probability 1 − nK δ . Moving to the expectation [statistics] E [ T i , n ] = E [ T i , n I E ] + E [ T i , n I E C ] A. LAZARIC – Reinforcement Learning Fall 2017 - 40/95

  56. Mathematical Tools UCB Proof (cont’d) Assume t is the last time i is pulled, then T i , n = T i , t − 1 + 1, thus � log 1 /δ µ i + 2 2 ( T i , n − 1 ) ≥ µ i ∗ Reordering [math] T i , n ≤ log 1 /δ + 1 2 ∆ 2 i under event E and thus with probability 1 − nK δ . Moving to the expectation [statistics] E [ T i , n ] ≤ log 1 /δ + 1 + n ( nK δ ) 2 ∆ 2 i A. LAZARIC – Reinforcement Learning Fall 2017 - 40/95

  57. Mathematical Tools UCB Proof (cont’d) Assume t is the last time i is pulled, then T i , n = T i , t − 1 + 1, thus � log 1 /δ µ i + 2 2 ( T i , n − 1 ) ≥ µ i ∗ Reordering [math] T i , n ≤ log 1 /δ + 1 2 ∆ 2 i under event E and thus with probability 1 − nK δ . Moving to the expectation [statistics] E [ T i , n ] ≤ log 1 /δ + 1 + n ( nK δ ) 2 ∆ 2 i Trading-off the two terms δ = 1 / n 2 , we obtain � 2 log n µ i , T i , t − 1 + ˆ 2 T i , t − 1 and A. LAZARIC – Reinforcement Learning Fall 2017 - 40/95 n

  58. Mathematical Tools UCB Proof (cont’d) Trading-off the two terms δ = 1 / n 2 , we obtain � 2 log n µ i , T i , t − 1 + ˆ 2 T i , t − 1 and E [ T i , n ] ≤ log n + 1 + K ∆ 2 i A. LAZARIC – Reinforcement Learning Fall 2017 - 41/95

  59. Mathematical Tools Tuning the confidence δ of UCB (cont’d) Multi–armed Bandit : the same for δ = 1 / t and δ = 1 / n ... A. LAZARIC – Reinforcement Learning Fall 2017 - 42/95

  60. Mathematical Tools Tuning the confidence δ of UCB (cont’d) Multi–armed Bandit : the same for δ = 1 / t and δ = 1 / n ... ... almost (i.e., in expectation) A. LAZARIC – Reinforcement Learning Fall 2017 - 42/95

  61. Mathematical Tools Tuning the confidence δ of UCB (cont’d) The value–at–risk of the regret for UCB-anytime A. LAZARIC – Reinforcement Learning Fall 2017 - 43/95

  62. Mathematical Tools Tuning the ρ of UCB (cont’d) UCB values (for the δ = 1 / n algorithm) � log n B i , s = ˆ µ i , s + ρ 2 s A. LAZARIC – Reinforcement Learning Fall 2017 - 44/95

  63. Mathematical Tools Tuning the ρ of UCB (cont’d) UCB values (for the δ = 1 / n algorithm) � log n B i , s = ˆ µ i , s + ρ 2 s Theory ◮ ρ < 0 . 5, polynomial regret w.r.t. n ◮ ρ > 0 . 5, logarithmic regret w.r.t. n A. LAZARIC – Reinforcement Learning Fall 2017 - 44/95

  64. Mathematical Tools Tuning the ρ of UCB (cont’d) UCB values (for the δ = 1 / n algorithm) � log n B i , s = ˆ µ i , s + ρ 2 s Theory ◮ ρ < 0 . 5, polynomial regret w.r.t. n ◮ ρ > 0 . 5, logarithmic regret w.r.t. n Practice: ρ = 0 . 2 is often the best choice A. LAZARIC – Reinforcement Learning Fall 2017 - 44/95

  65. Mathematical Tools Tuning the ρ of UCB (cont’d) UCB values (for the δ = 1 / n algorithm) � log n B i , s = ˆ µ i , s + ρ 2 s Theory ◮ ρ < 0 . 5, polynomial regret w.r.t. n ◮ ρ > 0 . 5, logarithmic regret w.r.t. n Practice: ρ = 0 . 2 is often the best choice A. LAZARIC – Reinforcement Learning Fall 2017 - 44/95

  66. Mathematical Tools Improvements: UCB-V Idea : use empirical Bernstein bounds for more accurate c.i. A. LAZARIC – Reinforcement Learning Fall 2017 - 45/95

  67. Mathematical Tools Improvements: UCB-V Idea : use empirical Bernstein bounds for more accurate c.i. Algorithm ◮ Compute the score of each arm i � log( t ) B i , t = ˆ µ i , T i , t + ρ 2 T i , t ◮ Pull arm I t = arg max i = 1 ,..., K B i , t ◮ Update the number of pulls T I t , t , ˆ µ i , T i , t A. LAZARIC – Reinforcement Learning Fall 2017 - 45/95

  68. Mathematical Tools Improvements: UCB-V Idea : use empirical Bernstein bounds for more accurate c.i. Algorithm ◮ Compute the score of each arm i � σ 2 2 ˆ i , T i , t log t + 8 log t B i , t = ˆ µ i , T i , t + T i , t 3 T i , t ◮ Pull arm I t = arg max i = 1 ,..., K B i , t σ 2 ◮ Update the number of pulls T I t , t , ˆ µ i , T i , t and ˆ i , T i , t A. LAZARIC – Reinforcement Learning Fall 2017 - 45/95

  69. Mathematical Tools Improvements: UCB-V Idea : use empirical Bernstein bounds for more accurate c.i. Algorithm ◮ Compute the score of each arm i � σ 2 2 ˆ i , T i , t log t + 8 log t B i , t = ˆ µ i , T i , t + T i , t 3 T i , t ◮ Pull arm I t = arg max i = 1 ,..., K B i , t ◮ Update the number of pulls T I t , t , ˆ σ 2 µ i , T i , t and ˆ i , T i , t Regret � 1 � R n ≤ O ∆ log n A. LAZARIC – Reinforcement Learning Fall 2017 - 45/95

  70. Mathematical Tools Improvements: UCB-V Idea : use empirical Bernstein bounds for more accurate c.i. Algorithm ◮ Compute the score of each arm i � σ 2 2 ˆ i , T i , t log t + 8 log t B i , t = ˆ µ i , T i , t + T i , t 3 T i , t ◮ Pull arm I t = arg max i = 1 ,..., K B i , t ◮ Update the number of pulls T I t , t , ˆ σ 2 µ i , T i , t and ˆ i , T i , t Regret � σ 2 � R n ≤ O ∆ log n A. LAZARIC – Reinforcement Learning Fall 2017 - 45/95

  71. Mathematical Tools Improvements: KL-UCB Idea : use even tighter c.i. based on Kullback–Leibler divergence d ( p , q ) = p log p q + ( 1 − p ) log 1 − p 1 − q A. LAZARIC – Reinforcement Learning Fall 2017 - 46/95

Recommend


More recommend