introduction to multi armed bandits and reinforcement
play

Introduction to Multi-Armed Bandits and Reinforcement Learning - PowerPoint PPT Presentation

Introduction to Multi-Armed Bandits and Reinforcement Learning Training School on Machine Learning for Communications Paris, 23-25 September 2019 Who am I ? . Hi, Im Lilian Besson finishing my PhD in telecommunication and machine


  1. Regret decomposition . N a ( t ) : number of selections of arm a in the first t rounds ∆ a := µ ⋆ − µ a : sub-optimality gap of arm a Regret decomposition K � R ν ( A , T ) = ∆ a E [ N a ( T )] . a =1 Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 18/ 92

  2. Regret decomposition . N a ( t ) : number of selections of arm a in the first t rounds ∆ a := µ ⋆ − µ a : sub-optimality gap of arm a Regret decomposition K � R ν ( A , T ) = ∆ a E [ N a ( T )] . a =1 � T � � T � Proof. � � R ν ( A , T ) = µ ⋆ T − E X A t , t = µ ⋆ T − E µ A t t =1 t =1 � T � � = ( µ ⋆ − µ A t ) E t =1 � T � K � � = ( µ ⋆ − µ a ) 1 ( A t = a ) . E � �� � a =1 t =1 � �� � ∆ a N a ( T ) Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 18/ 92

  3. Regret decomposition . N a ( t ) : number of selections of arm a in the first t rounds ∆ a := µ ⋆ − µ a : sub-optimality gap of arm a Regret decomposition K � R ν ( A , T ) = ∆ a E [ N a ( T )] . a =1 A strategy with small regret should: ◮ select not too often arms for which ∆ a > 0 (sub-optimal arms) ◮ . . . which requires to try all arms to estimate the values of the ∆ a = ⇒ Exploration / Exploitation trade-off ! Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 18/ 92

  4. Two naive strategies . ◮ Idea 1 : = ⇒ EXPLORATION Draw each arm T / K times    1 �  T = Ω( T ) → R ν ( A , T ) = ∆ a ֒ K a : µ a >µ ⋆ Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 19/ 92

  5. Two naive strategies . ◮ Idea 1 : = ⇒ EXPLORATION Draw each arm T / K times    1 �  T = Ω( T ) → R ν ( A , T ) = ∆ a ֒ K a : µ a >µ ⋆ ◮ Idea 2 : Always trust the empirical best arm = ⇒ EXPLOITATION A t +1 = argmax µ a ( t ) using estimates of the unknown means µ a � a ∈{ 1 ,..., K } t � 1 µ a ( t ) = X a , s 1 ( A s = a ) � N a ( t ) s =1 → R ν ( A , T ) ≥ (1 − µ 1 ) × µ 2 × ( µ 1 − µ 2 ) T = Ω( T ) ֒ (with K = 2 Bernoulli arms of means µ 1 � = µ 2 ) Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 19/ 92

  6. A better idea: Explore-Then-Commit ( ETC ) . Given m ∈ { 1 , . . . , T / K } , ◮ draw each arm m times ◮ compute the empirical best arm � a = argmax a � µ a ( Km ) ◮ keep playing this arm until round T A t +1 = � a for t ≥ Km = ⇒ EXPLORATION followed by EXPLOITATION Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 20/ 92

  7. A better idea: Explore-Then-Commit ( ETC ) . Given m ∈ { 1 , . . . , T / K } , ◮ draw each arm m times ◮ compute the empirical best arm � a = argmax a � µ a ( Km ) ◮ keep playing this arm until round T A t +1 = � a for t ≥ Km = ⇒ EXPLORATION followed by EXPLOITATION Analysis for K = 2 arms. If µ 1 > µ 2 , ∆ := µ 1 − µ 2 . R ν ( ETC , T ) = ∆ E [ N 2 ( T )] = ∆ E [ m + ( T − Km ) 1 ( � a = 2)] ∆ m + (∆ T ) × P ( � µ 1 , m ) ≤ µ 2 , m ≥ � µ a , m : empirical mean of the first m observations from arm a � Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 20/ 92

  8. A better idea: Explore-Then-Commit ( ETC ) . Given m ∈ { 1 , . . . , T / K } , ◮ draw each arm m times ◮ compute the empirical best arm � a = argmax a � µ a ( Km ) ◮ keep playing this arm until round T A t +1 = � a for t ≥ Km = ⇒ EXPLORATION followed by EXPLOITATION Analysis for K = 2 arms. If µ 1 > µ 2 , ∆ := µ 1 − µ 2 . R ν ( ETC , T ) = ∆ E [ N 2 ( T )] = ∆ E [ m + ( T − Km ) 1 ( � a = 2)] ∆ m + (∆ T ) × P ( � µ 1 , m ) ≤ µ 2 , m ≥ � µ a , m : empirical mean of the first m observations from arm a � = ⇒ requires a concentration inequality Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 20/ 92

  9. A better idea: Explore-Then-Commit ( ETC ) . Given m ∈ { 1 , . . . , T / K } , ◮ draw each arm m times ◮ compute the empirical best arm � a = argmax a � µ a ( Km ) ◮ keep playing this arm until round T A t +1 = � a for t ≥ Km = ⇒ EXPLORATION followed by EXPLOITATION Analysis for two arms. µ 1 > µ 2 , ∆ := µ 1 − µ 2 . Assumption 1: ν 1 , ν 2 are bounded in [0 , 1]. R ν ( T ) = ∆ E [ N 2 ( T )] = ∆ E [ m + ( T − Km ) 1 ( � a = 2)] ∆ m + (∆ T ) × exp( − m ∆ 2 / 2) ≤ µ a , m : empirical mean of the first m observations from arm a � = ⇒ Hoeffding’s inequality Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 21/ 92

  10. A better idea: Explore-Then-Commit ( ETC ) . Given m ∈ { 1 , . . . , T / K } , ◮ draw each arm m times ◮ compute the empirical best arm � a = argmax a � µ a ( Km ) ◮ keep playing this arm until round T A t +1 = � a for t ≥ Km = ⇒ EXPLORATION followed by EXPLOITATION Analysis for two arms. µ 1 > µ 2 , ∆ := µ 1 − µ 2 . Assumption 2: ν 1 = N ( µ 1 , σ 2 ) , ν 2 = N ( µ 2 , σ 2 ) are Gaussian arms. R ν ( ETC , T ) = ∆ E [ N 2 ( T )] = ∆ E [ m + ( T − Km ) 1 ( � a = 2)] ∆ m + (∆ T ) × exp( − m ∆ 2 / 4 σ 2 ) ≤ µ a , m : empirical mean of the first m observations from arm a � = ⇒ Gaussian tail inequality Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 22/ 92

  11. A better idea: Explore-Then-Commit ( ETC ) . Given m ∈ { 1 , . . . , T / K } , ◮ draw each arm m times ◮ compute the empirical best arm � a = argmax a � µ a ( Km ) ◮ keep playing this arm until round T A t +1 = � a for t ≥ Km = ⇒ EXPLORATION followed by EXPLOITATION Analysis for two arms. µ 1 > µ 2 , ∆ := µ 1 − µ 2 . Assumption 2: ν 1 = N ( µ 1 , σ 2 ) , ν 2 = N ( µ 2 , σ 2 ) are Gaussian arms. R ν ( ETC , T ) = ∆ E [ N 2 ( T )] = ∆ E [ m + ( T − Km ) 1 ( � a = 2)] ∆ m + (∆ T ) × exp( − m ∆ 2 / 4 σ 2 ) ≤ µ a , m : empirical mean of the first m observations from arm a � = ⇒ Gaussian tail inequality Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 22/ 92

  12. A better idea: Explore-Then-Commit ( ETC ) . Given m ∈ { 1 , . . . , T / K } , ◮ draw each arm m times ◮ compute the empirical best arm � a = argmax a � µ a ( Km ) ◮ keep playing this arm until round T A t +1 = � a for t ≥ Km = ⇒ EXPLORATION followed by EXPLOITATION Analysis for two arms. µ 1 > µ 2 , ∆ := µ 1 − µ 2 . Assumption: ν 1 = N ( µ 1 , σ 2 ) , ν 2 = N ( µ 2 , σ 2 ) are Gaussian arms. � � For m = 4 σ 2 T ∆ 2 ∆ 2 log , 4 σ 2 � � � � � 1 � R ν ( ETC , T ) ≤ 4 σ 2 T ∆ 2 log + 1 = O ∆ log( T ) . ∆ 2 Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 23/ 92

  13. A better idea: Explore-Then-Commit ( ETC ) . Given m ∈ { 1 , . . . , T / K } , ◮ draw each arm m times ◮ compute the empirical best arm � a = argmax a � µ a ( Km ) ◮ keep playing this arm until round T A t +1 = � a for t ≥ Km = ⇒ EXPLORATION followed by EXPLOITATION Analysis for two arms. µ 1 > µ 2 , ∆ := µ 1 − µ 2 . Assumption: ν 1 = N ( µ 1 , σ 2 ) , ν 2 = N ( µ 2 , σ 2 ) are Gaussian arms. � � For m = 4 σ 2 T ∆ 2 ∆ 2 log , 4 σ 2 � � � � � 1 � R ν ( ETC , T ) ≤ 4 σ 2 T ∆ 2 log + 1 = O ∆ log( T ) . ∆ 2 + logarithmic regret! − requires the knowledge of T ( ≃ OKAY) and ∆ (NOT OKAY) Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 23/ 92

  14. Sequential Explore-Then-Commit (2 Gaussian arms) . ◮ explore uniformly until the random time   � 8 σ 2 log( T / t )   τ = inf  t ∈ N : | � µ 1 ( t ) − � µ 2 ( t ) | > t  1.0 0.5 0.0 −0.5 −1.0 0 200 400 600 800 1000 ◮ � a τ = argmax a � µ a ( τ ) and ( A t +1 = � a τ ) for t ∈ { τ + 1 , . . . , T } � 1 � � � T ∆ 2 � R ν ( S-ETC , T ) ≤ 4 σ 2 ∆ log + C log( T ) = O ∆ log( T ) . = ⇒ same regret rate, without knowing ∆ [Garivier et al. 2016] Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 24/ 92

  15. Numerical illustration . Two Gaussian arms: ν 1 = N (1 , 1) and ν 2 = N (1 . 5 , 1) 500 Uniform 40 FTL 35 Sequential-ETC 400 30 300 25 20 200 15 10 100 5 Sequential-ETC 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 Expected regret estimated over N = 500 runs for Sequential-ETC versus our two naive baselines. (dashed lines: empirical 0.05% and 0.95% quantiles of the regret) Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 25/ 92

  16. Is this a good regret rate? . For two-armed Gaussian bandits, � 1 � R ν ( ETC , T ) � 4 σ 2 � T ∆ 2 � ∆ log = O ∆ log( T ) . = ⇒ problem-dependent logarithmic regret bound R ν (algo , T ) = O (log( T )). Observation: blows up when ∆ tends to zero. . . � � � T ∆ 2 � 4 σ 2 R ν ( ETC , T ) min ∆ log , ∆ T � � � √ 4 σ 2 √ log( u 2 ) , u T min ≤ C T . ≤ u u > 0 = ⇒ problem-independent square-root regret bound √ R ν (algo , T ) = O ( T ). Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 26/ 92

  17. Best possible regret? Lower Bounds Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 27/ 92

  18. The Lai and Robbins lower bound . Context: a parametric bandit model where each arm is parameterized by its mean ν = ( ν µ 1 , . . . , ν µ K ), µ a ∈ I . distributions ν µ = ( µ 1 , . . . , µ K ) means ⇔ Key tool: Kullback-Leibler divergence. Kullback-Leibler divergence � � � ν µ , ν µ ′ � = E X ∼ ν µ log d ν µ kl ( µ, µ ′ ) := KL d ν µ ′ ( X ) Theorem [Lai and Robbins, 1985] For uniformly efficient algorithms ( R µ ( A , T ) = o ( T α ) for all α ∈ (0 , 1) and µ ∈ I K ), E µ [ N a ( T )] 1 µ a < µ ⋆ = ⇒ lim inf ≥ kl ( µ a , µ ⋆ ) . log T T →∞ Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 28/ 92

  19. The Lai and Robbins lower bound . Context: a parametric bandit model where each arm is parameterized by its mean ν = ( ν µ 1 , . . . , ν µ K ), µ a ∈ I . distributions ν µ = ( µ 1 , . . . , µ K ) means ⇔ Key tool: Kullback-Leibler divergence. Kullback-Leibler divergence kl ( µ, µ ′ ) := ( µ − µ ′ ) 2 (Gaussian bandits with variance σ 2 ) 2 σ 2 Theorem [Lai and Robbins, 1985] For uniformly efficient algorithms ( R µ ( A , T ) = o ( T α ) for all α ∈ (0 , 1) and µ ∈ I K ), E µ [ N a ( T )] 1 µ a < µ ⋆ = ⇒ lim inf ≥ kl ( µ a , µ ⋆ ) . log T T →∞ Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 28/ 92

  20. The Lai and Robbins lower bound . Context: a parametric bandit model where each arm is parameterized by its mean ν = ( ν µ 1 , . . . , ν µ K ), µ a ∈ I . distributions ν µ = ( µ 1 , . . . , µ K ) means ⇔ Key tool: Kullback-Leibler divergence. Kullback-Leibler divergence � 1 − µ � µ � � kl ( µ, µ ′ ) := µ log + (1 − µ ) log (Bernoulli bandits) µ ′ 1 − µ ′ Theorem [Lai and Robbins, 1985] For uniformly efficient algorithms ( R µ ( A , T ) = o ( T α ) for all α ∈ (0 , 1) and µ ∈ I K ), E µ [ N a ( T )] 1 µ a < µ ⋆ = ⇒ lim inf ≥ kl ( µ a , µ ⋆ ) . log T T →∞ Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 28/ 92

  21. Some room for better algorithms? . ◮ For two-armed Gaussian bandits, ETC satisfies � 1 � R ν ( ETC , T ) � 4 σ 2 � T ∆ 2 � ∆ log = O ∆ log( T ) , with ∆ = | µ 1 − µ 2 | . ◮ The Lai and Robbins’ lower bound yields, for large values of T , � 1 � � T ∆ 2 � R ν ( A , T ) � 2 σ 2 ∆ log = Ω ∆ log( T ) , as kl ( µ 1 , µ 2 ) = ( µ 1 − µ 2 ) 2 . 2 σ 2 = ⇒ Explore-Then-Commit is not asymptotically optimal. Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 29/ 92

  22. Mixing Exploration and Exploitation Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 30/ 92

  23. A simple strategy: ε -greedy . The ε -greedy rule [Sutton and Barton, 98] is the simplest way to alternate exploration and exploitation. ε -greedy strategy At round t , ◮ with probability ε A t ∼ U ( { 1 , . . . , K } ) ◮ with probability 1 − ε A t = argmax µ a ( t ) . � a =1 ,..., K ⇒ Linear regret: R ν ( ε -greedy , T ) ≥ ε K − 1 = K ∆ min T . ∆ min = a : µ a <µ ⋆ ∆ a . min Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 31/ 92

  24. A simple strategy: ε -greedy . A simple fix: make ε decreasing! ε t -greedy strategy At round t , � � ◮ with probability ε t := min 1 , K probability ց with t d 2 t A t ∼ U ( { 1 , . . . , K } ) ◮ with probability 1 − ε t A t = argmax µ a ( t − 1) . � a =1 ,..., K Theorem [Auer et al. 02] � � 1 If 0 < d ≤ ∆ min , R ν ( ε t -greedy , T ) = O d 2 K log( T ) . = ⇒ requires the knowledge of a lower bound on ∆ min . Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 32/ 92

  25. The Optimism Principle Upper Confidence Bounds Algorithms Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 33/ 92

  26. The optimism principle . Step 1: construct a set of statistically plausible models ◮ For each arm a , build a confidence interval I a ( t ) on the mean µ a : I a ( t ) = [ LCB a ( t ) , UCB a ( t )] LCB = Lower Confidence Bound UCB = Upper Confidence Bound Figure: Confidence intervals on the means after t rounds Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 34/ 92

  27. The optimism principle . Step 2 : act as if the best possible model were the true model (“optimism in face of uncertainty”) Figure: Confidence intervals on the means after t rounds Optimistic bandit model = argmax a =1 ,..., K µ a max µ ∈C ( t ) ◮ That is, select A t +1 = argmax UCB a ( t ) . a =1 ,..., K Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 35/ 92

  28. Optimistic Algorithms Building Confidence Intervals Analysis of UCB( α ) Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 36/ 92

  29. How to build confidence intervals? . We need UCB a ( t ) such that P ( µ a ≤ UCB a ( t )) � 1 − 1 / t . = ⇒ tool: concentration inequalities Example: rewards are σ 2 sub-Gaussian � e λ ( Z − µ ) � ≤ e λ 2 σ 2 / 2 . E [ Z ] = µ and (1) E Hoeffding inequality Z i i.i.d. satisfying (1). For all ( fixed ) s ≥ 1 � Z 1 + · · · + Z s � ≤ e − sx 2 / (2 σ 2 ) ≥ µ + x P s ◮ ν a bounded in [0 , 1]: 1 / 4 sub-Gaussian ◮ ν a = N ( µ a , σ 2 ): σ 2 sub-Gaussian Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 37/ 92

  30. How to build confidence intervals? . We need UCB a ( t ) such that P ( µ a ≤ UCB a ( t )) � 1 − 1 / t . = ⇒ tool: concentration inequalities Example: rewards are σ 2 sub-Gaussian � e λ ( Z − µ ) � ≤ e λ 2 σ 2 / 2 . E [ Z ] = µ and (1) E Hoeffding inequality Z i i.i.d. satisfying (1). For all ( fixed ) s ≥ 1 � Z 1 + · · · + Z s � ≤ e − sx 2 / (2 σ 2 ) ≤ µ − x P s ◮ ν a bounded in [0 , 1]: 1 / 4 sub-Gaussian ◮ ν a = N ( µ a , σ 2 ): σ 2 sub-Gaussian Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 37/ 92

  31. How to build confidence intervals? . We need UCB a ( t ) such that P ( µ a ≤ UCB a ( t )) � 1 − 1 / t . = ⇒ tool: concentration inequalities Example: rewards are σ 2 sub-Gaussian � e λ ( Z − µ ) � ≤ e λ 2 σ 2 / 2 . E [ Z ] = µ and (1) E Hoeffding inequality Z i i.i.d. satisfying (1). For all ( fixed ) s ≥ 1 � Z 1 + · · · + Z s � ≤ e − sx 2 / (2 σ 2 ) ≤ µ − x P s � Cannot be used directly in a bandit model as the number of observations s from each arm is random! Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 37/ 92

  32. How to build confidence intervals? . ◮ N a ( t ) = � t s =1 1 ( A s = a ) number of selections of a after t rounds � s µ a , s = 1 ◮ ˆ k =1 Y a , k average of the first s observations from arm a s ◮ � µ a ( t ) = � µ a , N a ( t ) empirical estimate of µ a after t rounds Hoeffding inequality + union bound � � � α log( t ) 1 µ a ( t ) + σ ≥ 1 − µ a ≤ � P N a ( t ) α 2 − 1 t Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 38/ 92

  33. How to build confidence intervals? . ◮ N a ( t ) = � t s =1 1 ( A s = a ) number of selections of a after t rounds � s µ a , s = 1 ◮ ˆ k =1 Y a , k average of the first s observations from arm a s ◮ � µ a ( t ) = � µ a , N a ( t ) empirical estimate of µ a after t rounds Hoeffding inequality + union bound � � � α log( t ) 1 µ a ( t ) + σ ≥ 1 − µ a ≤ � P N a ( t ) α 2 − 1 t Proof.   � � � � α log( t ) α log( t ) µ a ( t ) + σ  ∃ s ≤ t : µ a > � µ a , s + σ  µ a > � ≤ P P N a ( t ) s  �  t t � α log( t ) � 1 1  ≤ � t α/ 2 = ≤ µ a , s < µ a − σ t α/ 2 − 1 . P s s =1 s =1 Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 38/ 92

  34. A first UCB algorithm . UCB( α ) selects A t +1 = argmax a UCB a ( t ) where � α log( t ) UCB a ( t ) = µ a ( t ) + � . N a ( t ) � �� � � �� � exploitation term exploration bonus ◮ this form of UCB was first proposed for Gaussian rewards [Katehakis and Robbins, 95] ◮ popularized by [Auer et al. 02] for bounded rewards: UCB1, for α = 2 → see the next talk at 4pm ! ֒ ◮ the analysis was UCB( α ) was further refined to hold for α > 1 / 2 in that case [Bubeck, 11, Cappé et al. 13] Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 39/ 92

  35. A UCB algorithm in action (movie) . 1 0 6 31 436 17 9 Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 40/ 92

  36. Optimistic Algorithms Building Confidence Intervals Analysis of UCB( α ) Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 41/ 92

  37. Regret of UCB( α ) for bounded rewards . Theorem [Auer et al, 02] UCB( α ) with parameter α = 2 satisfies   � � � K � 1 + π 2 � 1 �  log( T ) + R ν ( UCB1 , T ) ≤ 8  ∆ a . ∆ a 3 a : µ a <µ ⋆ a =1 Theorem For every α > 1 and every sub-optimal arm a , there exists a constant 4 α C α > 0 such that E µ [ N a ( T )] ≤ ( µ ⋆ − µ a ) 2 log( T ) + C α . It follows that   � 1  log( T ) + KC α . R ν ( UCB ( α ) , T ) ≤ 4 α  ∆ a a : µ a <µ ⋆ Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 42/ 92

  38. Intermediate Summary . ◮ Several ways to solve the exploration/exploitation trade-off ◮ Explore-Then-Commit ◮ ε -greedy ◮ Upper Confidence Bound algorithms ◮ Good concentration inequalities are crucial to build good UCB algorithms! ◮ Performance lower bounds motivate the design of (optimal) algorithms Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 43/ 92

  39. A Bayesian Look at the MAB Model Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 44/ 92

  40. Bayesian Bandits Two points of view Bayes-UCB Thompson Sampling Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 45/ 92

  41. Historical perspective . 1952 Robbins, formulation of the MAB problem 1985 Lai and Robbins: lower bound, first asymptotically optimal algorithm 1987 Lai, asymptotic regret of kl -UCB 1995 Agrawal, UCB algorithms 1995 Katehakis and Robbins, a UCB algorithm for Gaussian bandits 2002 Auer et al: UCB1 with finite-time regret bound 2009 UCB-V, MOSS. . . 2011,13 Cappé et al: finite-time regret bound for kl -UCB Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 46/ 92

  42. Historical perspective . 1933 Thompson: a Bayesian mechanism for clinical trials 1952 Robbins, formulation of the MAB problem 1956 Bradt et al, Bellman: optimal solution of a Bayesian MAB problem 1979 Gittins: first Bayesian index policy 1985 Lai and Robbins: lower bound, first asymptocally optimal algorithm 1985 Berry and Fristedt: Bandit Problems, a survey on the Bayesian MAB 1987 Lai, asymptotic regret of kl -UCB + study of its Bayesian regret 1995 Agrawal, UCB algorithms 1995 Katehakis and Robbins, a UCB algorithm for Gaussian bandits 2002 Auer et al: UCB1 with finite-time regret bound 2009 UCB-V, MOSS. . . 2010 Thompson Sampling is re-discovered 2011,13 Cappé et al: finite-time regret bound for kl -UCB 2012,13 Thompson Sampling is asymptotically optimal Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 47/ 92

  43. Frequentist versus Bayesian bandit . ν µ = ( ν µ 1 , . . . , ν µ K ) ∈ ( P ) K . ◮ Two probabilistic models two points of view! Frequentist model Bayesian model µ 1 , . . . , µ K drawn from a µ 1 , . . . , µ K unknown parameters prior distribution : µ a ∼ π a i.i.d. arm a : ( Y a , s ) s | µ i.i.d. arm a : ( Y a , s ) s ∼ ν µ a ∼ ν µ a ◮ The regret can be computed in each case Frequentist Regret Bayesian regret (regret) (Bayes risk) � � � � T T � � R µ ( A , T )= E µ ( µ ⋆ − µ A t ) R π ( A , T )= E µ ∼ π ( µ ⋆ − µ A t ) t =1 t =1 � R µ ( A , T ) d π ( µ ) = Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 48/ 92

  44. Frequentist and Bayesian algorithms . ◮ Two types of tools to build bandit algorithms: Frequentist tools Bayesian tools MLE estimators of the means Posterior distributions π t Confidence Intervals a = L ( µ a | Y a , 1 , . . . , Y a , N a ( t ) ) 1 1 0 0 6 3 451 5 34 9 3 448 18 21 Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 49/ 92

  45. Example: Bernoulli bandits . Bernoulli bandit model µ = ( µ 1 , . . . , µ K ) ◮ Bayesian view : µ 1 , . . . , µ K are random variables prior distribution : µ a ∼ U ([0 , 1]) = ⇒ posterior distribution: π a ( t ) = L ( µ a | R 1 , . . . , R t ) � � = S a ( t ) +1 , N a ( t ) − S a ( t ) +1 Beta � �� � � �� � # ones # zeros 3.5 3 3 π a (t) 2.5 π a (t+1) π a (t) 2.5 si X t+1 = 1 2 π a (t+1) 2 si X t+1 = 0 1.5 1.5 π 0 1 1 0.5 0.5 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 t � S a ( t ) = R s 1 ( A s = a ) sum of the rewards from arm a s =1 Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 50/ 92

  46. Bayesian algorithm . A Bayesian bandit algorithm exploits the posterior distributions of the means to decide which arm to select. 1 0 2 4 346 107 40 Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 51/ 92

  47. Bayesian Bandits Two points of view Bayes-UCB Thompson Sampling Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 52/ 92

  48. The Bayes-UCB algorithm . ◮ Π 0 = ( π 1 (0) , . . . , π K (0)) be a prior distribution over ( µ 1 , . . . , µ K ) ◮ Π t = ( π 1 ( t ) , . . . , π K ( t )) be the posterior distribution over the means ( µ 1 , . . . , µ K ) after t observations The Bayes-UCB algorithm chooses at time t � � 1 A t +1 = argmax Q 1 − t (log t ) c , π a ( t ) a =1 ,..., K where Q ( α, π ) is the quantile of order α of the distribution π . α Q( α , π ) Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 53/ 92

  49. The Bayes-UCB algorithm . ◮ Π 0 = ( π 1 (0) , . . . , π K (0)) be a prior distribution over ( µ 1 , . . . , µ K ) ◮ Π t = ( π 1 ( t ) , . . . , π K ( t )) be the posterior distribution over the means ( µ 1 , . . . , µ K ) after t observations The Bayes-UCB algorithm chooses at time t � � 1 A t +1 = argmax Q 1 − t (log t ) c , π a ( t ) a =1 ,..., K where Q ( α, π ) is the quantile of order α of the distribution π . Bernoulli reward with uniform prior: ◮ π a (0) i . i . d ∼ U ([0 , 1]) = Beta(1 , 1) ◮ π a ( t ) = Beta( S a ( t ) + 1 , N a ( t ) − S a ( t ) + 1) Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 53/ 92

  50. The Bayes-UCB algorithm . ◮ Π 0 = ( π 1 (0) , . . . , π K (0)) be a prior distribution over ( µ 1 , . . . , µ K ) ◮ Π t = ( π 1 ( t ) , . . . , π K ( t )) be the posterior distribution over the means ( µ 1 , . . . , µ K ) after t observations The Bayes-UCB algorithm chooses at time t � � 1 A t +1 = argmax Q 1 − t (log t ) c , π a ( t ) a =1 ,..., K where Q ( α, π ) is the quantile of order α of the distribution π . Gaussian rewards with Gaussian prior: ◮ π a (0) i . i . d ∼ N (0 , κ 2 ) � � S a ( t ) σ 2 ◮ π a ( t ) = N N a ( t )+ σ 2 /κ 2 , N a ( t )+ σ 2 /κ 2 Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 53/ 92

  51. Bayes UCB in action (movie) . 1 0 6 19 443 4 27 Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 54/ 92

  52. Theoretical results in the Bernoulli case . ◮ Bayes-UCB is asymptotically optimal for Bernoulli rewards Theorem [K.,Cappé,Garivier 2012] Let ε > 0. The Bayes-UCB algorithm using a uniform prior over the arms and parameter c ≥ 5 satisfies 1 + ε E µ [ N a ( T )] ≤ kl ( µ a , µ ⋆ ) log( T ) + o ε, c (log( T )) . Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 55/ 92

  53. Bayesian Bandits Insights from the Optimal Solution Bayes-UCB Thompson Sampling Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 56/ 92

  54. Historical perspective . 1933 Thompson: in the context of clinical trial, the allocation of a treatment should be some increasing function of its posterior probability to be optimal 2010 Thompson Sampling rediscovered under different names Bayesian Learning Automaton [Granmo, 2010] Randomized probability matching [Scott, 2010] 2011 An empirical evaluation of Thompson Sampling: an efficient algorithm, beyond simple bandit models [Li and Chapelle, 2011] 2012 First (logarithmic) regret bound for Thompson Sampling [Agrawal and Goyal, 2012] 2012 Thompson Sampling is asymptotically optimal for Bernoulli bandits [K., Korda and Munos, 2012][Agrawal and Goyal, 2013] 2013- Many successful uses of Thompson Sampling beyond Bernoulli bandits (contextual bandits, reinforcement learning) Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 57/ 92

  55. Thompson Sampling . Two equivalent interpretations : ◮ “select an arm at random according to its probability of being the best” ◮ “draw a possible bandit model from the posterior distribution and act optimally in this sampled model” � = optimistic Thompson Sampling: a randomized Bayesian algorithm  ∀ a ∈ { 1 .. K } , θ a ( t ) ∼ π a ( t )  A t +1 = argmax θ a ( t ) .  a =1 ... K 10 8 6 4 2 θ 1 (t) µ 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 6 4 2 µ 2 θ 2 (t) 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 58/ 92

  56. Thompson Sampling is asymptotically optimal . Problem-dependent regret 1 + ε ∀ ε > 0 , E µ [ N a ( T )] ≤ kl ( µ a , µ ⋆ ) log( T ) + o µ,ε (log( T )) . This results holds: ◮ for Bernoulli bandits, with a uniform prior [K. Korda, Munos 12][Agrawal and Goyal 13] ◮ for Gaussian bandits, with Gaussian prior[Agrawal and Goyal 17] ◮ for exponential family bandits, with Jeffrey’s prior [Korda et al. 13] Problem-independent regret [Agrawal and Goyal 13] For Bernoulli and Gaussian bandits, Thompson Sampling satisfies �� � R µ ( TS , T ) = O KT log( T ) . ◮ Thompson Sampling is also asymptotically optimal for Gaussian with unknown mean and variance [Honda and Takemura, 14] Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 59/ 92

  57. Understanding Thompson Sampling . ◮ a key ingredient in the analysis of [K. Korda and Munos 12] Proposition There exists constants b = b ( µ ) ∈ (0 , 1) and C b < ∞ such that ∞ � N 1 ( t ) ≤ t b � � ≤ C b . P t =1 � N 1 ( t ) ≤ t b � = { there exists a time range of length at least t 1 − b − 1 with no draw of arm 1 } 9 8 7 6 5 4 3 2 1 µ 2 + δ µ 2 µ 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 60/ 92

  58. Bayesian versus Frequentist algorithms . ◮ Short horizon, T = 1000 (average over N = 10000 runs) 12 10 8 6 4 KLUCB 2 KLUCB + KLUCB−H + Bayes UCB 0 Thompson Sampling FH−Gittins −2 0 100 200 300 400 500 600 700 800 900 1000 K = 2 Bernoulli arms µ 1 = 0 . 2 , µ 2 = 0 . 25 Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 61/ 92

  59. Bayesian versus Frequentist algorithms . ◮ Long horizon, T = 20000 (average over N = 50000 runs) K = 10 Bernoulli arms bandit problem µ = [0 . 1 0 . 05 0 . 05 0 . 05 0 . 02 0 . 02 0 . 02 0 . 01 0 . 01 0 . 01] Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 62/ 92

  60. Other Bandit Models Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 63/ 92

  61. Other Bandit Models Many different extensions Piece-wise stationary bandits Multi-player bandits Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 64/ 92

  62. Many other bandits models and problems (1/2) . Most famous extensions: ◮ (centralized) multiple-actions → Implemented in our library SMPyBandits ! ֒ Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 65/ 92

  63. Many other bandits models and problems (1/2) . Most famous extensions: ◮ (centralized) multiple-actions ◮ multiple choice : choose m ∈ { 2 , . . . , K − 1 } arms (fixed size) → Implemented in our library SMPyBandits ! ֒ Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 65/ 92

  64. Many other bandits models and problems (1/2) . Most famous extensions: ◮ (centralized) multiple-actions ◮ multiple choice : choose m ∈ { 2 , . . . , K − 1 } arms (fixed size) ◮ combinatorial : choose a subset of arms S ⊂ { 1 , . . . , K } (large space) → Implemented in our library SMPyBandits ! ֒ Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 65/ 92

  65. Many other bandits models and problems (1/2) . Most famous extensions: ◮ (centralized) multiple-actions ◮ multiple choice : choose m ∈ { 2 , . . . , K − 1 } arms (fixed size) ◮ combinatorial : choose a subset of arms S ⊂ { 1 , . . . , K } (large space) ◮ non stationary → Implemented in our library SMPyBandits ! ֒ Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 65/ 92

  66. Many other bandits models and problems (1/2) . Most famous extensions: ◮ (centralized) multiple-actions ◮ multiple choice : choose m ∈ { 2 , . . . , K − 1 } arms (fixed size) ◮ combinatorial : choose a subset of arms S ⊂ { 1 , . . . , K } (large space) ◮ non stationary ◮ piece-wise stationary / abruptly changing → Implemented in our library SMPyBandits ! ֒ Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 65/ 92

  67. Many other bandits models and problems (1/2) . Most famous extensions: ◮ (centralized) multiple-actions ◮ multiple choice : choose m ∈ { 2 , . . . , K − 1 } arms (fixed size) ◮ combinatorial : choose a subset of arms S ⊂ { 1 , . . . , K } (large space) ◮ non stationary ◮ piece-wise stationary / abruptly changing ◮ slowly-varying → Implemented in our library SMPyBandits ! ֒ Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 65/ 92

  68. Many other bandits models and problems (1/2) . Most famous extensions: ◮ (centralized) multiple-actions ◮ multiple choice : choose m ∈ { 2 , . . . , K − 1 } arms (fixed size) ◮ combinatorial : choose a subset of arms S ⊂ { 1 , . . . , K } (large space) ◮ non stationary ◮ piece-wise stationary / abruptly changing ◮ slowly-varying ◮ adversarial. . . → Implemented in our library SMPyBandits ! ֒ Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 65/ 92

  69. Many other bandits models and problems (1/2) . Most famous extensions: ◮ (centralized) multiple-actions ◮ multiple choice : choose m ∈ { 2 , . . . , K − 1 } arms (fixed size) ◮ combinatorial : choose a subset of arms S ⊂ { 1 , . . . , K } (large space) ◮ non stationary ◮ piece-wise stationary / abruptly changing ◮ slowly-varying ◮ adversarial. . . ◮ (decentralized) collaborative/communicating bandits over a graph → Implemented in our library SMPyBandits ! ֒ Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 65/ 92

  70. Many other bandits models and problems (1/2) . Most famous extensions: ◮ (centralized) multiple-actions ◮ multiple choice : choose m ∈ { 2 , . . . , K − 1 } arms (fixed size) ◮ combinatorial : choose a subset of arms S ⊂ { 1 , . . . , K } (large space) ◮ non stationary ◮ piece-wise stationary / abruptly changing ◮ slowly-varying ◮ adversarial. . . ◮ (decentralized) collaborative/communicating bandits over a graph ◮ (decentralized) non communicating multi-player bandits → Implemented in our library SMPyBandits ! ֒ Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 65/ 92

  71. Many other bandits models and problems (2/2) . And many more extensions. . . ◮ non stochastic, Markov models rested/restless ◮ best arm identification (vs reward maximization) ◮ fixed budget setting ◮ fixed confidence setting ◮ PAC (probably approximately correct) algorithms ◮ bandits with (differential) privacy constraints ◮ for some applications (content recommendation) ◮ contextual bandits : observe a reward and a context ( C t ∈ R d ) ◮ cascading bandits ◮ delayed feedback bandits ◮ structured bandits (low-rank, many-armed, Lipschitz etc) ◮ X -armed, continuous-armed bandits Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 66/ 92

  72. Other Bandit Models Many different extensions Piece-wise stationary bandits Multi-player bandits Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 67/ 92

  73. Piece-wise stationary bandits . Stationary MAB problems Arm a gives rewards sampled from the same distribution for any time step ∀ t , r a ( t ) iid ∼ ν a = B ( µ a ) . Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 68/ 92

  74. Piece-wise stationary bandits . Stationary MAB problems Arm a gives rewards sampled from the same distribution for any time step ∀ t , r a ( t ) iid ∼ ν a = B ( µ a ) . Non stationary MAB problems? (possibly) different distributions for any time step ! ∀ t , r a ( t ) iid ∼ ν a ( t ) = B ( µ a ( t )) . = ⇒ harder problem! And very hard if µ a ( t ) can change at any step! Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 68/ 92

  75. Piece-wise stationary bandits . Stationary MAB problems Arm a gives rewards sampled from the same distribution for any time step ∀ t , r a ( t ) iid ∼ ν a = B ( µ a ) . Non stationary MAB problems? (possibly) different distributions for any time step ! ∀ t , r a ( t ) iid ∼ ν a ( t ) = B ( µ a ( t )) . = ⇒ harder problem! And very hard if µ a ( t ) can change at any step! Piece-wise stationary problems! → the litterature usually focuses on the easier case, when there are at ֒ √ most Y T = o ( T ) intervals, on which the means are all stationary. Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 68/ 92

  76. Example of a piece-wise stationary MAB problem . We plots the means µ 1 ( t ), µ 2 ( t ), µ 3 ( t ) of K = 3 arms. There are Y T = 4 break-points and 5 sequences between t = 1 and t = T = 5000: History of means for Non-Stationary MAB, Bernoulli with 4 break-points Arm #0 Arm #1 Arm #2 0.8 Successive means of the K = 3 arms 0.6 0.4 0.2 0 1000 2000 3000 4000 5000 Time steps t = 1 . . . T , horizon T = 5000 Lilian Besson & Émilie Kaufmann - Introduction to Multi-Armed Bandits 23 September, 2019 - 69/ 92

Recommend


More recommend