on the complexity of best arm identification in multi
play

On the Complexity of Best Arm Identification in Multi-Armed Bandit - PowerPoint PPT Presentation

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurlien Garivier Institut de Mathmatiques de Toulouse Information Theory, Learning and Big Data Simons Institute, Berkeley, March 2015 Simple Multi-Armed Bandit


  1. On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurélien Garivier Institut de Mathématiques de Toulouse Information Theory, Learning and Big Data Simons Institute, Berkeley, March 2015

  2. Simple Multi-Armed Bandit Model Roadmap Simple Multi-Armed Bandit Model 1 Complexity of Best Arm Identification 2 Lower bounds on the complexities Gaussian Feedback Binary Feedback

  3. Simple Multi-Armed Bandit Model The (stochastic) Multi-Armed Bandit Model Environment K arms with parameters θ = ( θ 1 , . . . , θ K ) such that for any possible choice of arm a t ∈ { 1 , . . . , K } at time t , one receives the reward X t = X a t , t where, for any 1 ≤ a ≤ K and s ≥ 1 , X a , s ∼ ν a , and the ( X a , s ) a , s are independent. Reward distributions ν a ∈ F a parametric family, or not: canonical exponential family, general bounded rewards Example Bernoulli rewards: θ ∈ [ 0 , 1 ] K , ν a = B ( θ a ) Strategy The agent’s actions follow a dynamical strategy π = ( π 1 , π 2 , . . . ) such that A t = π t ( X 1 , . . . , X t − 1 )

  4. Simple Multi-Armed Bandit Model Real challenges Randomized clinical trials original motivation since the 1930’s dynamic strategies can save resources Recommender systems: advertisement website optimization news, blog posts, . . . Computer experiments large systems can be simulated in order to optimize some criterion over a set of parameters but the simulation cost may be high, so that only few choices are possible for the parameters Games and planning (tree-structured options)

  5. Simple Multi-Armed Bandit Model Performance Evaluation: Cumulated Regret Cumulated Reward: S T = � T t = 1 X t Goal: Choose π so as to maximize T K � � � � E [ S T ] = E [ X t ✶ { A t = a }| X 1 , . . . , X t − 1 ] E t = 1 a = 1 K � µ a E [ N π = a ( T )] a = 1 a ( T ) = � where N π t ≤ T ✶ { A t = a } is the number of draws of arm a up to time T , and µ a = E ( ν a ) . Regret Minimization: maximizing E [ S T ] ⇐ ⇒ minimizing � R T = T µ ∗ − E [ S T ] = ( µ ∗ − µ a ) E [ N π a ( T )] a : µ a <µ ∗ where µ ∗ ∈ max { µ a : 1 ≤ a ≤ K }

  6. Simple Multi-Armed Bandit Model Upper Confidence Bound Strategies UCB [Lai&Robins ’85; Agrawal ’95; Auer&al ’02] Construct an upper confidence bound for the expected reward of each arm: � S a ( t ) log ( t ) + N a ( t ) 2 N a ( t ) � �� � � �� � estimated reward exploration bonus Choose the arm with the highest UCB It is an index strategy [Gittins ’79] Its behavior is easily interpretable and intuitively appealing Listen to Robert Nowak’s talk tomorrow!

  7. Simple Multi-Armed Bandit Model Optimality? Generalization of [Lai&Robbins ’85] Theorem [Burnetas and Katehakis, ’96] If π is a uniformly efficient strategy, then for any θ ∈ [ 0 , 1 ] K , � � E N a ( T ) 1 ≥ lim inf K inf ( ν a , µ ∗ ) log ( T ) T →∞ δ 1 2 where � K inf ( ν a , µ ∗ ) = inf K ( ν a , ν ′ ) : ν ′ ∈ F a , E ( ν ′ ) ≥ µ ∗ � ν a K inf ( ν a , µ⋆ ) ν ∗ Idea: change of distribution δ 1 µ ∗ δ 0

  8. Simple Multi-Armed Bandit Model Reaching Optimality: Empirical Likelihood The KL-UCB Algorithm , AoS 2013 joint work with O. Cappé, O-A. Maillard, R. Munos, G. Stoltz Parameters: An operator Π F : M 1 ( S ) → F ; a non-decreasing function f : N → R Initialization: Pull each arm of { 1 , . . . , K } once for t = K to T − 1 do compute for each arm a the quantity � � � � ≤ f ( t ) � � U a ( t ) = sup E ( ν ) : ν ∈ F and KL Π F ˆ ν a ( t ) , ν N a ( t ) pick an arm A t + 1 ∈ arg max U a ( t ) a ∈{ 1 ,..., K } end for

  9. Simple Multi-Armed Bandit Model Regret bound Theorem: Assume that F is the set of finitely supported probability distributions over S = [ 0 , 1 ] , that µ a > 0 for all arms a and that µ ⋆ < 1 . There exists a constant M ( ν a , µ ⋆ ) > 0 only depending on ν a and µ ⋆ such that, with the choice � � for t ≥ 2 , for all T ≥ 3 : f ( t ) = log ( t ) + log log ( t ) log ( T ) 36 � 4 / 5 log � � � � � ν a , µ ⋆ � + N a ( T ) ≤ log ( T ) log ( T ) E � ( µ ⋆ ) 4 K inf � � 2 µ ⋆ 72 � � 4 / 5 + ( µ ⋆ ) 4 + log ( T ) � ν a , µ ⋆ � 2 ( 1 − µ ⋆ ) K inf +( 1 − µ ⋆ ) 2 M ( ν a , µ ⋆ ) � � 2 / 5 log ( T ) 2 ( µ ⋆ ) 2 � � 2 µ ⋆ + log log ( T ) ν a , µ ⋆ � + ν a , µ ⋆ � 2 + 4 . � � K inf ( 1 − µ ⋆ ) K inf

  10. Simple Multi-Armed Bandit Model Regret bound Theorem: Assume that F is the set of finitely supported probability distributions over S = [ 0 , 1 ] , that µ a > 0 for all arms a and that µ ⋆ < 1 . There exists a constant M ( ν a , µ ⋆ ) > 0 only depending on ν a and µ ⋆ such that, with the choice � � f ( t ) = log ( t ) + log log ( t ) for t ≥ 2 , for all T ≥ 3 : log ( T ) 36 � 4 / 5 log � � � � � ν a , µ ⋆ � + N a ( T ) ≤ log ( T ) log ( T ) E � ( µ ⋆ ) 4 K inf � � 2 µ ⋆ 72 � � 4 / 5 + ( µ ⋆ ) 4 + log ( T ) � ν a , µ ⋆ � 2 ( 1 − µ ⋆ ) K inf +( 1 − µ ⋆ ) 2 M ( ν a , µ ⋆ ) � � 2 / 5 log ( T ) 2 ( µ ⋆ ) 2 � � 2 µ ⋆ + log log ( T ) ν a , µ ⋆ � + ν a , µ ⋆ � 2 + 4 . � � K inf ( 1 − µ ⋆ ) K inf

  11. Complexity of Best Arm Identification Roadmap Simple Multi-Armed Bandit Model 1 Complexity of Best Arm Identification 2 Lower bounds on the complexities Gaussian Feedback Binary Feedback

  12. Complexity of Best Arm Identification Best Arm Identification Strategies A two-armed bandit model is a pair ν = ( ν 1 , ν 2 ) of probability distributions (’arms’) with respective means µ 1 and µ 2 a ∗ = argmax a µ a is the (unknown) best arm Strategy = a sampling rule ( A t ) t ∈ N where A t ∈ { 1 , 2 } is the arm chosen at time t (based on past observations) a sample Z t ∼ ν A t is observed a stopping rule τ indicating when he stops sampling the arms a recommendation rule ˆ a τ ∈ { 1 , 2 } indicating which arm he thinks is best (at the end of the interaction) In classical A/B Testing, the sampling rule A t is uniform on { 1 , 2 } and the stopping rule τ = t is fixed in advance.

  13. Complexity of Best Arm Identification Best Arm Identification Joint work with Emilie Kaufmann and Olivier Cappé (Telecom ParisTech) Goal: design a strategy A = (( A t ) , τ, ˆ a τ ) such that: Fixed-budget setting Fixed-confidence setting a τ � = a ∗ ) ≤ δ P ν (ˆ τ = t a t � = a ∗ ) as small p t ( ν ) := P ν (ˆ E ν [ τ ] as small as possible as possible See also: [Mannor&Tsitsiklis ’04], [Even-Dar&al. ’06], [Audibert&al.’10], [Bubeck&al. ’11,’13], [Kalyanakrishnan&al. ’12], [Karnin&al. ’13], [Jamieson&al. ’14]...

  14. Complexity of Best Arm Identification Two possible goals Goal: design a strategy A = (( A t ) , τ, ˆ a τ ) such that: Fixed-budget setting Fixed-confidence setting a τ � = a ∗ ) ≤ δ τ = t P ν (ˆ a t � = a ∗ ) as small p t ( ν ) := P ν (ˆ E ν [ τ ] as small as possible as possible In the particular case of uniform sampling : Fixed-budget setting Fixed-confidence setting classical test of sequential test of ( µ 1 > µ 2 ) against ( µ 1 < µ 2 ) ( µ 1 > µ 2 ) against ( µ 1 < µ 2 ) based on t samples with probability of error uniformly bounded by δ [Siegmund 85]: sequential tests can save samples !

  15. Complexity of Best Arm Identification The complexities of best-arm identification For a class M bandit models, algorithm A = (( A t ) , τ, ˆ a τ ) is... Fixed-budget setting Fixed-confidence setting consistent on M if δ -PAC on M if a t � = a ∗ ) − a τ � = a ∗ ) ≤ δ ∀ ν ∈ M , p t ( ν ) = P ν (ˆ t →∞ 0 → ∀ ν ∈ M , P ν (ˆ From the literature � � t E ν [ τ ] ≃ C ′ H ′ ( ν ) log ( 1 /δ ) p t ( ν ) ≃ exp − CH ( ν ) [Audibert&al.’10],[Bubeck&al’11] [Mannor&Tsitsiklis ’04],[Even-Dar&al. ’06] [Bubeck&al’13],... [Kalanakrishnan&al’12],... = ⇒ two complexities � � − 1 E ν [ τ ] − 1 κ B ( ν ) = inf lim sup t log p t ( ν ) κ C ( ν ) = A δ − PAC lim sup inf log ( 1 /δ A cons. t →∞ δ → 0 for a probability of error ≤ δ , for a probability of error ≤ δ , budget t ≃ κ B ( ν ) log ( 1 /δ ) E ν [ τ ] ≃ κ C ( ν ) log ( 1 /δ )

  16. Complexity of Best Arm Identification Lower bounds on the complexities Changes of distribution Theorem: how to use (and hide) the change of distribution Let ν and ν ′ be two bandit models with K arms such that for all a , the distributions ν a and ν ′ a are mutually absolutely continuous. For any almost-surely finite stopping time σ with respect to ( F t ) , K � � � E ν [ N a ( σ )] KL ( ν a , ν ′ a ) ≥ sup kl P ν ( E ) , P ν ′ ( E ) , E∈F σ a = 1 � � where kl ( x , y ) = x log ( x / y ) + ( 1 − x ) log ( 1 − x ) / ( 1 − y ) . Useful remark: 1 � � ∀ δ ∈ [ 0 , 1 ] , δ, 1 − δ ≥ log kl 2 . 4 δ ,

Recommend


More recommend