crush optimism with pessimism structured bandits beyond
play

Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic - PowerPoint PPT Presentation

Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic Optimality Kwang-Sung Jun join work with Chicheng Zhang 1 Structured bandits e.g., linear = ! , , " # = { $ : #


  1. Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic Optimality Kwang-Sung Jun join work with Chicheng Zhang 1

  2. Structured bandits e.g., linear 𝒝 = 𝑏 ! , …, 𝑏 " ∈ ℝ # β„± = {𝑏 ↦ πœ„ $ 𝑏: πœ„ ∈ ℝ # } β€’ Input : Arm set 𝒝 , hypothesis class β„± βŠ‚ 𝒝 β†’ ℝ β€œthe set of possible configurations of the mean rewards ” β€’ Initialize : The environment chooses 𝑔 βˆ— ∈ β„± (unknown to the learner) For 𝑒 = 1, …, π‘œ β€’ Learner: chooses an arm 𝑏 " ∈ 𝒝 " = 𝑔 βˆ— 𝑏 " + (zero-mean stochastic noise) β€’ Environment: generates the reward 𝑠 β€’ Learner: receives 𝑠 " β€’ Goal : Minimize the cumulative regret # $βˆˆπ’ 𝑔 βˆ— 𝑏 𝑔 βˆ— 𝑏 " 𝔽 Reg # = 𝔽 π‘œ β‹… max βˆ’ : "'( β€’ Note: fixed arm set (=non-contextual), realizability 𝑔 βˆ— ∈ β„± 2

  3. Structured bandits β€’ Why relevant? Techniques may transfer to RL (e.g., ergodic RL [Ok18] ) β€’ Naive strategy: UCB ) ⟹ * log π‘œ regret bound (instance-dependent) β€’ Scales with the number of arms 𝐿 β€’ Instead, the complexity of the hypothesis class β„± should appear. β€’ The asymptotically optimal regret is well-defined. β€’ E.g., linear bandits : 𝑑 βˆ— β‹… log π‘œ for some well-defined 𝑑 βˆ— β‰ͺ ) * . The goal of this paper Achieve the asymptotic optimality with improved finite-time regret for any β„± . (the worst-case regret is beyond the scope) [Ok18] Ok et al., Exploration in Structured Reinforcement Learning, NeurIPS, 2018 3

  4. Asymptotic optimality (instance-dependent) Do they like orange or apple? β€’ Optimism in the face of uncertainty Maybe have them try lemon and see if they (e.g., UCB, Thompson sampling) are sensitive to sourness.. ⟹ optimal asymptotic / worst-case regret in 𝑳 -armed bandits . (1,0) β€’ Linear bandits: optimal worst-case rate = 𝑒 π‘œ sour β€’ Asymptotically optimal regret? ⟹ No! (0.95, 0.1) (1,0) sweet (AISTATS’17) mean reward = 1*sweet + 0*sour 4

  5. Asymptotic optimality: lower bound β€’ 𝔽 Reg # β‰₯ 𝑑 𝑔 βˆ— β‹… log π‘œ (asymptotically) .βˆˆπ’ 𝑔 βˆ— 𝑐 βˆ’ 𝑔 βˆ— (𝑏) Ξ” + = max " 𝑑 𝑔 βˆ— = & ! ,…,& " )* 3 min 𝛿 + β‹… Ξ” + +,! s. t. 𝛿 + βˆ— 1 = 0 " βˆ€π‘• ∈ π’Ÿ 𝑔 βˆ— , 3 𝛿 + β‹… KL - 𝑔 𝑏 , 𝑕 𝑏 β‰₯ 1 +,! ”competing” hypotheses KL divergence with noise distribution πœ‰ β€’ 𝛿 βˆ— = 𝛿 ( βˆ— , …, 𝛿 ) βˆ— β‰₯ 0 : the solution βˆ— β‹… log π‘œ times. β€’ To be optimal, we must pull arm 𝑏 like 𝛿 $ βˆ— βˆ— β€’ E.g., 𝛿 +,-./ = 8, 𝛿 .01/2, = 0 ⟹ lemon is the informative arm ! β€’ When 𝑑 𝑔 βˆ— = 0 : Bounded regret ! (except for pathological ones [Lattimore14]) Lattimore & Munos, Bounded regret for finite-armed structured bandits, 2014. 5

  6. Existing asymptotically optimal algorithms β€’ Mostly uses forced exploration. [Lattimore+17,Combes+17,Hao+20] +.2 # ⟹ ensures every arm ’s pull count is an unbounded function of π‘œ such as (3+.2 +.2 # . ⟹ 𝔽 Reg # βͺ… 𝑑 𝑔 βˆ— β‹… log π‘œ + 𝐿 β‹… +.2 # (3+.2 +.2 # β€’ Issues 1. 𝐿 appears in the regret* ⟹ what if 𝐿 is exponentially large? 2. cannot achieve bounded regret when 𝑑 𝑔 βˆ— = 0 β€’ Parallel studies avoid forced exploration, but still depend on 𝐿 . [Menard+20, Degenne+20] *Dependence on 𝐿 can be avoided in special cases (e.g., linear). 6

  7. Contribution Proposed algorithm: Research Question CRush Optimism with Pessimism (CROP) Assume β„± is finite. Can we design an algorithm that β€’ enjoys the asymptotic optimality β€’ adapts to bounded regret whenever possible β€’ does not necessarily depend on 𝐿 ? β€’ No forced exploration 😁 β€’ The regret scales not with 𝐿 but with 𝐿 " ≀ 𝐿 (defined in the paper). β€’ An interesting log log π‘œ term in the regret* * it’s necessary (will be updated in arxiv) 7

  8. Preliminaries 8

  9. Assumptions β€’ β„± < ∞ β€’ The noise model " = 𝑔 βˆ— 𝑏 " + 𝜊 " 𝑠 where 𝜊 " is 1-sub-Gaussian. (generalized to 𝜏 ! in the paper) β€’ Notations: 𝑏 βˆ— 𝑔 ≔ arg max 𝜈 βˆ— 𝑔 ≔ 𝑔 𝑏 βˆ— 𝑔 $βˆˆπ’ 𝑔 𝑏 , 𝑏 βˆ— 𝑔 = 𝑏 β€’ 𝑔 supports arm 𝑏 ⟺ 𝜈 βˆ— 𝑔 = 𝑀 β€’ 𝑔 supports reward 𝑀 ⟺ β€’ [Assumption] Every 𝑔 ∈ β„± has a unique best arm ( i. e. , 𝑏 βˆ— 𝑔 = 1 ) 9

  10. Competing hypotheses β€’ π’Ÿ 𝑔 βˆ— consists of 𝑔 ∈ β„± such that β€’ (1) assigns the same reward to the best arm 𝑏 βˆ— (𝑔 βˆ— ) β€’ (2) but supports a different arm 𝑏 βˆ— 𝑔 β‰  𝑏 βˆ— (𝑔 βˆ— ) β€’ Importance: it’s why we get log(π‘œ) regret! mean reward 𝑔 ) 𝑔 % 1 𝑔 & .75 = 𝑔 βˆ— 𝑔 $ .5 𝑔 ' .25 𝑔 ( 0 arms 1 2 3 10

  11. Lower bound revisited 𝔽 Reg # β‰₯ 𝑑 𝑔 βˆ— β‹… log π‘œ , asymptotically. Assume Gaussian rewards. β€’ β€’ .βˆˆπ’ 𝑔 βˆ— 𝑐 βˆ’ 𝑔 βˆ— (𝑏) Ξ” + = max " 𝑑 𝑔 βˆ— ≔ & ! ,…,& " )* 3 min 𝛿 + β‹… Ξ” + +,! 𝛿 + βˆ— 1 βˆ— = 0 s. t. " 𝛿 + β‹… 𝑔 βˆ— 𝑏 βˆ’ 𝑕 𝑏 2 βˆ€π‘• ∈ π’Ÿ 𝑔 βˆ— , 3 β‰₯ 1 2 +,! ”competing” hypotheses 𝛿 + ln π‘œ samples for each 𝑏 ∈ 𝒝 can distinguish 𝑔 βˆ— from 𝑕 confidently. Finds arm pull allocations that (1) eliminate competing hypotheses and (2) β€˜reward’-efficient Agrawal, Teneketzis, Anantharam. Asymptotically Efficient Adaptive Allocation Schemes for Controlled I.I.D. Processes: Finite Parameter Space, 1989. 11

  12. Example: Cheating code base arms cheating arms 𝐿 * log 2 𝐿 * β€’ πœ— > 0 : very small (like 0.0001) { 1 βˆ’ πœ—, 1, 1 + πœ—} 0, Ξ› rewards: β€’ Ξ› > 0 : not too small (like 0.5) A1 A2 A3 A4 A5 A6 +.2 ! ) β€’ The lower bound: Θ ln π‘œ 4 ! 𝑔 𝟐 1 βˆ’ πœ— 1 βˆ’ πœ— 1 βˆ’ πœ— 0 0 ) ) 𝑔 1 βˆ’ πœ— 𝟐 1 βˆ’ πœ— 1 βˆ’ πœ— 0 Ξ› β€’ UCB: Θ 5 ln π‘œ % 𝑔 1 βˆ’ πœ— 1 βˆ’ πœ— 𝟐 1 βˆ’ πœ— Ξ› 0 & β€’ Exponential gap in 𝐿 ! 𝑔 1 βˆ’ πœ— 1 βˆ’ πœ— 1 βˆ’ πœ— 𝟐 Ξ› Ξ› $ 𝑔 𝟐 + 𝛝 1 1 βˆ’ πœ— 1 βˆ’ πœ— 0 0 ' 𝑔 1 𝟐 + 𝛝 1 βˆ’ πœ— 1 βˆ’ πœ— 0 Ξ› ( 𝑔 1 βˆ’ πœ— 1 βˆ’ πœ— 𝟐 + 𝛝 1 Ξ› 0 * … … … … … … … 12

  13. The function classes regret contribution β€’ π’Ÿ 𝑔 βˆ— : Competing ⟹ cannot distinguishable using 𝑏 βˆ— (𝑔 βˆ— ) , but supports a different arm Θ(log π‘œ) β€’ 𝒠 𝑔 βˆ— : Docile ⟹ distinguishable using 𝑏 βˆ— (𝑔 βˆ— ) Θ(1) β€’ β„° 𝑔 βˆ— : Equivalent ⟹ supports 𝑏 βˆ— (𝑔 βˆ— ) and the reward 𝜈 βˆ— (𝑔 βˆ— ) can be Θ(log log π‘œ) β€’ [Proposition 2] β„± = π’Ÿ 𝑔 βˆ— βˆͺ 𝒠 𝑔 βˆ— βˆͺ β„°(𝑔 βˆ— ) ( disjoint union) β„± mean reward 𝑔 ! 𝑔 𝑔 𝑔 𝑔 6 4 2 2 1 𝑔 β„° βˆ— π’Ÿ βˆ— 4 .75 = 𝑔 βˆ— 𝑔 𝑔 𝑔 𝑔 3 3 5 ! .5 𝑔 5 .25 𝑔 6 𝒠 βˆ— = 0 arms 1 2 3 13

  14. CRush Optimism with Pessimism (CROP) 14

  15. CROP: Overview ! confidence level: 1 βˆ’ poly β€’ The confidence set 7 " 7 𝑀 " 𝑔 ≔ : 𝑠 6 βˆ’ 𝑔 𝑏 6 6'( β„± " ∢= 𝑔 ∈ β„±: 𝑀 "8( 𝑔 βˆ’ min 9βˆˆβ„± 𝑀 "8( 𝑕 ≀ 𝛾 " ≔ Θ ln 𝑒 β„± ERM β€’ Four important branches β€’ Exploit, Feasible, Fallback, Conflict β€’ Exploit β€’ Does every 𝑔 ∈ β„± " support the same best arm? β€’ If yes, pull that arm. 15

  16. CROP v1 At time 𝑒 , β€’ Maintain a confidence set β„± " βŠ† β„± Cf. optimism: f β€’ If every 𝑔 ∈ β„± " agree on the best arm 𝑔 " = arg max ;βˆˆβ„± " max $βˆˆπ’ 𝑔(𝑏) β€’ (Exploit) pull that arm. β€’ Else: (Feasible) β€’ Compute the pessimism : e 𝑔 " = arg min ;βˆˆβ„± " max $βˆˆπ’ 𝑔(𝑏) (break ties by the cum. loss) β€’ Compute 𝛿 βˆ— ≔ solution of the optimization problem 𝑑 e 𝑔 " <=++_?.=/@($) β€’ (Tracking) Pull 𝑏 " = arg min βˆ— C # $βˆˆπ’ 16

  17. Why pessimism? Arms A1 A2 A3 A4 A5 𝑔 1 .99 .98 0 0 ) 𝑔 .98 .99 .98 .25 0 % 𝑔 .97 .97 .98 .25 .25 & β€’ Suppose β„± " = 𝑔 ( , 𝑔 7 , 𝑔 D β€’ If I knew 𝑔 βˆ— , I could track 𝛿 𝑔 βˆ— (= the solution of 𝑑(𝑔 βˆ— ) ) β€’ Which 𝑔 should I track? β€’ Pessimism : either does the right thing, or eliminates itself. β€’ Other choices: may get stuck (so does ERM) Key idea: the LB constraints prescribes how to distinguish 𝑔 βˆ— from those supporting higher rewards. 17

  18. But we may still get stuck. Arms A1 A2 A3 A4 A5 𝑔 1 .99 .98 0 0 ) 𝑔 βˆ— = β€’ Due to docile hypotheses. 𝑔 .98 .99 .98 .25 0 % 𝑔 .97 .97 1 .25 .25 β€’ We must do something else. & 𝑔 .97 .97 1 .2499 .25 $ πœ” 𝑔 ≔ arg &∈ *,8 " Ξ” 9:; 𝑔 β‹… 𝛿 + βˆ— 1 + min 3 Ξ” + 𝑔 β‹… 𝛿 + +<+ βˆ— 1 2 𝑔 𝑏 βˆ’ 𝑕 𝑏 s. t. βˆ€ 𝑕 ∈ π’Ÿ 𝑔 βˆͺ 𝒠 𝑔 : 𝜈 βˆ— 𝑕 β‰₯ 𝜈 βˆ— 𝑔 , 3 𝛿 + β‰₯ 1 2 + 𝛿 β‰₯ max 𝛿 𝑔 , 𝜚 𝑔 β€’ Includes docile hypotheses with best rewards higher 𝜈 βˆ— 𝑔 18

Recommend


More recommend