Beating Stochastic and Adversarial Semi-bandits Optimally and Simultaneously Julian Zimmert (University of Copenhagen) Haipeng Luo (University of Southern California) Chen-Yu Wei (University of Southern California)
Semi-bandits Example Day 1 15 mins Day 2 13 mins Day 3 16 mins . . . Goal : minimize the average commuting time
Types of Environments i.i.d. adversarial (more benign) Algorithms for i.i.d.: perform bad in the adversarial case. Algorithms for adversarial: when the environment is i.i.d., they do not take advantage of it.
Types of Environments i.i.d. adversarial (more benign) Algorithms for i.i.d.: perform bad in the adversarial case. Algorithms for adversarial: when the environment is i.i.d., they do not take advantage of it. ⇒ To achieve optimal performance, they need to know which environments they are in and pick the corresponding algorithms.
Motivation i.i.d. unknown adversarial (more benign) mixed What if 1. We have no prior knowledge about the environment. 2. The environment is usually i.i.d., but we want to be robust to adversarial attack. 3. The environment is usually arbitrary but we want to exploit the benignness when we got lucky.
Our Results ◮ We propose the first semi-bandit algorithm that has optimal performance guarantees in both i.i.d. and adversarial environments, without knowing which environment it is in.
Formalizing Semi-bandits Given: action set X = { X (1) , X (2) , . . . } ⊆ { 0 , 1 } d . For t = 1 , . . . , T , ◮ The learner chooses X t ∈ X . ◮ The environment reveals ℓ ti for which X ti = 1. ◮ The learner suffers loss � X t , ℓ t � . d = #edges
Formalizing Semi-bandits Given: action set X = { X (1) , X (2) , . . . } ⊆ { 0 , 1 } d . (set of all paths) For t = 1 , . . . , T , ◮ The learner chooses X t ∈ X . ◮ The environment reveals ℓ ti for which X ti = 1. ◮ The learner suffers loss � X t , ℓ t � . d = #edges
Formalizing Semi-bandits Given: action set X = { X (1) , X (2) , . . . } ⊆ { 0 , 1 } d . (set of all paths) For t = 1 , . . . , T , ◮ The learner chooses X t ∈ X (choose a path). ◮ The environment reveals ℓ ti for which X ti = 1. ◮ The learner suffers loss � X t , ℓ t � . d = #edges
Formalizing Semi-bandits Given: action set X = { X (1) , X (2) , . . . } ⊆ { 0 , 1 } d . (set of all paths) For t = 1 , . . . , T , ◮ The learner chooses X t ∈ X (choose a path). ◮ The environment reveals ℓ ti for which X ti = 1. (reveal the cost on each chosen edge) ◮ The learner suffers loss � X t , ℓ t � . d = #edges
Formalizing Semi-bandits Given: action set X = { X (1) , X (2) , . . . } ⊆ { 0 , 1 } d . (set of all paths) For t = 1 , . . . , T , ◮ The learner chooses X t ∈ X (choose a path). ◮ The environment reveals ℓ ti for which X ti = 1. (reveal the cost on each chosen edge) ◮ The learner suffers loss � X t , ℓ t � . (suffer the path cost) d = #edges
Semi-bandits Regret Bounds Goal : Minimize � T � T � � � � Regret = E � X t , ℓ t � − X ∈X E min � X , ℓ t � . t =1 t =1 � �� � � �� � Learner’s total cost Best fixed action’s total cost ◮ When ℓ t are i.i.d.: Regret = Θ (log T ) � √ � ◮ When ℓ t are adversarially generated: Regret = Θ T √ Our algorithm : always has O ( T ), but gets O (log T ) when the losses happen to be i.i.d.
Related Work in Multi-armed Bandit (MAB) MAB is special case of SB with X = { e 1 , . . . , e d } . Algorithm Idea
Related Work in Multi-armed Bandit (MAB) MAB is special case of SB with X = { e 1 , . . . , e d } . Algorithm Idea SAO [BS12] i.i.d. algorithm + non-i.i.d. detection SAPO [AC16]
Related Work in Multi-armed Bandit (MAB) MAB is special case of SB with X = { e 1 , . . . , e d } . Algorithm Idea SAO [BS12] i.i.d. algorithm + non-i.i.d. detection SAPO [AC16] EXP3++ adversarial algorithm (EXP3) [SS14, SL17] + sophisticated exploration mechanism
Related Work in Multi-armed Bandit (MAB) MAB is special case of SB with X = { e 1 , . . . , e d } . Algorithm Idea SAO [BS12] i.i.d. algorithm + non-i.i.d. detection SAPO [AC16] EXP3++ adversarial algorithm (EXP3) [SS14, SL17] + sophisticated exploration mechanism BROAD [WL18] adversarial algorithm (FTRL with special regularizer) T-INF [ZS19] + improved analysis (optimal)
Related Work in Multi-armed Bandit (MAB) MAB is special case of SB with X = { e 1 , . . . , e d } . Algorithm Idea SAO [BS12] i.i.d. algorithm + non-i.i.d. detection SAPO [AC16] EXP3++ adversarial algorithm (EXP3) [SS14, SL17] + sophisticated exploration mechanism BROAD [WL18] adversarial algorithm (FTRL with special regularizer) T-INF [ZS19] + improved analysis (optimal) Our work is a generalization of [WL18] and [ZS19]’s idea to semi-bandits.
Algorithm Following the Regularized Leader Learning rate η t = 1 / √ t , regularizer Ψ
Algorithm Following the Regularized Leader Learning rate η t = 1 / √ t , regularizer Ψ for t = 1 , 2 , 3 , . . . ◮ Compute � t − 1 � � ˆ + η − 1 x t = argmin x , ℓ s t Ψ( x ) . x ∈ Conv( X ) s =1
Algorithm Following the Regularized Leader Learning rate η t = 1 / √ t , regularizer Ψ for t = 1 , 2 , 3 , . . . ◮ Compute � t − 1 � � ˆ + η − 1 x t = argmin x , ℓ s t Ψ( x ) . x ∈ Conv( X ) s =1 ◮ Sample X t such that E [ X t ] = x t , and observe ℓ ti for i with X ti = 1.
Algorithm Following the Regularized Leader Learning rate η t = 1 / √ t , regularizer Ψ for t = 1 , 2 , 3 , . . . ◮ Compute � t − 1 � � ˆ + η − 1 x t = argmin x , ℓ s t Ψ( x ) . x ∈ Conv( X ) s =1 ◮ Sample X t such that E [ X t ] = x t , and observe ℓ ti for i with X ti = 1. ◮ Construct ℓ t ’s unbiased estimator ˆ ℓ t : ˆ ℓ ti = ℓ ti 1 [ X ti =1] . x ti
Regularizer (Key Contribution) Two-sided hybrid regularizer: d d −√ x i � � Ψ( x ) = + (1 − x i ) log(1 − x i ) . i =1 i =1 � �� � � �� � Neg-entropy for complement [AB09]’s Poly-INF
Regularizer (Key Contribution) Two-sided hybrid regularizer: d d −√ x i � � Ψ( x ) = + (1 − x i ) log(1 − x i ) . i =1 i =1 � �� � � �� � Neg-entropy for complement [AB09]’s Poly-INF Intuition: ◮ when x i is close to 0, the learner starves for information ⇒ like a bandit problem ⇒ using the optimal regularizer for bandit (Poly-INF) ◮ when x i is close to 1 ⇒ like a full-info problem ⇒ using the optimal regularizer for full-info (Neg-entropy)
Results Overview X General Env. md log T i.i.d. ∆ min √ Adversarial mdT m � max X ∈X � X � 1 . ∆ min = E [second-best action’s loss] − E [best action’s loss] (minimal optimality gap)
Results Overview X { X ∈ { 0 , 1 } d : � X � 1 = m } General { 0 , 1 } d Env. � � md log T log T log T i.i.d. ∆ min i > m ∆ i i ∆ i � √ √ √ m ≤ d mdT , Adversarial mdT 2 d T ( d − m ) √ T log d m > d 2 m � max X ∈X � X � 1 . ∆ min = E [second-best action’s loss] − E [best action’s loss] (minimal optimality gap)
Results Overview X { X ∈ { 0 , 1 } d : � X � 1 = m } General { 0 , 1 } d Env. � � md log T log T log T i.i.d. ∆ min i > m ∆ i i ∆ i � √ √ √ m ≤ d mdT , Adversarial mdT 2 d T ( d − m ) √ T log d m > d 2 m � max X ∈X � X � 1 . ∆ min = E [second-best action’s loss] − E [best action’s loss] (minimal optimality gap)
Analysis Steps √ 1. Analyze FTRL for the new regularizer and get O ( T ) for the adversarial setting. 2. Further use self-bounding technique to get O (log T ) for the i.i.d. setting.
Analyzing FTRL for the New Regularizer Key lemma. T � √ x ti , � �� 1 1 � � √ t Reg ≤ min (1 − x ti ) 1 + log . 1 − x ti t =1 i Remarks . 1. The analysis is mostly standard, but needs more care (don’t drop some terms as did in usual analysis). 2. The two-sided -ness of the regularizer is the key to get “min {· , ·} ”. √ 3. From this bound, we get O ( T ) bound easily.
Self-bounding to Get O (log T ) Bound T � √ x ti , � �� 1 1 � � √ t Reg ≤ min (1 − x ti ) 1 + log 1 − x ti t =1 i � �� � � Goal : upper bound this by C Pr[ X t � = X ∗ ] Intuitively true: Pr[ X t � = X ∗ ] → 0 ⇒ x t → X ∗ ⇒ the above expression → 0 .
Self-bounding to Get O (log T ) Bound T � √ x ti , � �� 1 1 � � √ t Reg ≤ min (1 − x ti ) 1 + log 1 − x ti t =1 i � �� � � Goal : upper bound this by C Pr[ X t � = X ∗ ] Assume it is proved...
Self-bounding to Get O (log T ) Bound T � √ x ti , � �� 1 1 � � √ t Reg ≤ min (1 − x ti ) 1 + log 1 − x ti t =1 i � �� � � Goal : upper bound this by C Pr[ X t � = X ∗ ] Assume it is proved... � ∆ min Pr[ X t � = X ∗ ] ≤ Reg t
Self-bounding to Get O (log T ) Bound T � √ x ti , � �� 1 1 � � √ t Reg ≤ min (1 − x ti ) 1 + log 1 − x ti t =1 i � �� � � Goal : upper bound this by C Pr[ X t � = X ∗ ] Assume it is proved... � Pr[ X t � = X ∗ ] C � � ∆ min Pr[ X t � = X ∗ ] ≤ Reg ≤ √ t t t C 2 ∆ min Pr[ X t � = X ∗ ] � � ≤ + 2 t ∆ min 2 t t (AM-GM)
Recommend
More recommend