Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic Optimality Kwang-Sung Jun join work with Chicheng Zhang 1
Structured bandits e.g., linear π = π ! , β¦, π " β β # β± = {π β¦ π $ π: π β β # } β’ Input : Arm set π , hypothesis class β± β π β β βthe set of possible configurations of the mean rewards β β’ Initialize : The environment chooses π β β β± (unknown to the learner) For π’ = 1, β¦, π β’ Learner: chooses an arm π " β π " = π β π " + (zero-mean stochastic noise) β’ Environment: generates the reward π β’ Learner: receives π " β’ Goal : Minimize the cumulative regret # $βπ π β π π β π " π½ Reg # = π½ π β max β : "'( β’ Note: fixed arm set (=non-contextual), realizability π β β β± 2
Structured bandits β’ Why relevant? Techniques may transfer to RL (e.g., ergodic RL [Ok18] ) β’ Naive strategy: UCB ) βΉ * log π regret bound (instance-dependent) β’ Scales with the number of arms πΏ β’ Instead, the complexity of the hypothesis class β± should appear. β’ The asymptotically optimal regret is well-defined. β’ E.g., linear bandits : π β β log π for some well-defined π β βͺ ) * . The goal of this paper Achieve the asymptotic optimality with improved finite-time regret for any β± . (the worst-case regret is beyond the scope) [Ok18] Ok et al., Exploration in Structured Reinforcement Learning, NeurIPS, 2018 3
Asymptotic optimality (instance-dependent) Do they like orange or apple? β’ Optimism in the face of uncertainty Maybe have them try lemon and see if they (e.g., UCB, Thompson sampling) are sensitive to sourness.. βΉ optimal asymptotic / worst-case regret in π³ -armed bandits . (1,0) β’ Linear bandits: optimal worst-case rate = π π sour β’ Asymptotically optimal regret? βΉ No! (0.95, 0.1) (1,0) sweet (AISTATSβ17) mean reward = 1*sweet + 0*sour 4
Asymptotic optimality: lower bound β’ π½ Reg # β₯ π π β β log π (asymptotically) .βπ π β π β π β (π) Ξ + = max " π π β = & ! ,β¦,& " )* 3 min πΏ + β Ξ + +,! s. t. πΏ + β 1 = 0 " βπ β π π β , 3 πΏ + β KL - π π , π π β₯ 1 +,! βcompetingβ hypotheses KL divergence with noise distribution π β’ πΏ β = πΏ ( β , β¦, πΏ ) β β₯ 0 : the solution β β log π times. β’ To be optimal, we must pull arm π like πΏ $ β β β’ E.g., πΏ +,-./ = 8, πΏ .01/2, = 0 βΉ lemon is the informative arm ! β’ When π π β = 0 : Bounded regret ! (except for pathological ones [Lattimore14]) Lattimore & Munos, Bounded regret for finite-armed structured bandits, 2014. 5
Existing asymptotically optimal algorithms β’ Mostly uses forced exploration. [Lattimore+17,Combes+17,Hao+20] +.2 # βΉ ensures every arm βs pull count is an unbounded function of π such as (3+.2 +.2 # . βΉ π½ Reg # βͺ π π β β log π + πΏ β +.2 # (3+.2 +.2 # β’ Issues 1. πΏ appears in the regret* βΉ what if πΏ is exponentially large? 2. cannot achieve bounded regret when π π β = 0 β’ Parallel studies avoid forced exploration, but still depend on πΏ . [Menard+20, Degenne+20] *Dependence on πΏ can be avoided in special cases (e.g., linear). 6
Contribution Proposed algorithm: Research Question CRush Optimism with Pessimism (CROP) Assume β± is finite. Can we design an algorithm that β’ enjoys the asymptotic optimality β’ adapts to bounded regret whenever possible β’ does not necessarily depend on πΏ ? β’ No forced exploration π β’ The regret scales not with πΏ but with πΏ " β€ πΏ (defined in the paper). β’ An interesting log log π term in the regret* * itβs necessary (will be updated in arxiv) 7
Preliminaries 8
Assumptions β’ β± < β β’ The noise model " = π β π " + π " π where π " is 1-sub-Gaussian. (generalized to π ! in the paper) β’ Notations: π β π β arg max π β π β π π β π $βπ π π , π β π = π β’ π supports arm π βΊ π β π = π€ β’ π supports reward π€ βΊ β’ [Assumption] Every π β β± has a unique best arm ( i. e. , π β π = 1 ) 9
Competing hypotheses β’ π π β consists of π β β± such that β’ (1) assigns the same reward to the best arm π β (π β ) β’ (2) but supports a different arm π β π β π β (π β ) β’ Importance: itβs why we get log(π) regret! mean reward π ) π % 1 π & .75 = π β π $ .5 π ' .25 π ( 0 arms 1 2 3 10
Lower bound revisited π½ Reg # β₯ π π β β log π , asymptotically. Assume Gaussian rewards. β’ β’ .βπ π β π β π β (π) Ξ + = max " π π β β & ! ,β¦,& " )* 3 min πΏ + β Ξ + +,! πΏ + β 1 β = 0 s. t. " πΏ + β π β π β π π 2 βπ β π π β , 3 β₯ 1 2 +,! βcompetingβ hypotheses πΏ + ln π samples for each π β π can distinguish π β from π confidently. Finds arm pull allocations that (1) eliminate competing hypotheses and (2) βrewardβ-efficient Agrawal, Teneketzis, Anantharam. Asymptotically Efficient Adaptive Allocation Schemes for Controlled I.I.D. Processes: Finite Parameter Space, 1989. 11
Example: Cheating code base arms cheating arms πΏ * log 2 πΏ * β’ π > 0 : very small (like 0.0001) { 1 β π, 1, 1 + π} 0, Ξ rewards: β’ Ξ > 0 : not too small (like 0.5) A1 A2 A3 A4 A5 A6 +.2 ! ) β’ The lower bound: Ξ ln π 4 ! π π 1 β π 1 β π 1 β π 0 0 ) ) π 1 β π π 1 β π 1 β π 0 Ξ β’ UCB: Ξ 5 ln π % π 1 β π 1 β π π 1 β π Ξ 0 & β’ Exponential gap in πΏ ! π 1 β π 1 β π 1 β π π Ξ Ξ $ π π + π 1 1 β π 1 β π 0 0 ' π 1 π + π 1 β π 1 β π 0 Ξ ( π 1 β π 1 β π π + π 1 Ξ 0 * β¦ β¦ β¦ β¦ β¦ β¦ β¦ 12
The function classes regret contribution β’ π π β : Competing βΉ cannot distinguishable using π β (π β ) , but supports a different arm Ξ(log π) β’ π π β : Docile βΉ distinguishable using π β (π β ) Ξ(1) β’ β° π β : Equivalent βΉ supports π β (π β ) and the reward π β (π β ) can be Ξ(log log π) β’ [Proposition 2] β± = π π β βͺ π π β βͺ β°(π β ) ( disjoint union) β± mean reward π ! π π π π 6 4 2 2 1 π β° β π β 4 .75 = π β π π π π 3 3 5 ! .5 π 5 .25 π 6 π β = 0 arms 1 2 3 13
CRush Optimism with Pessimism (CROP) 14
CROP: Overview ! confidence level: 1 β poly β’ The confidence set 7 " 7 π " π β : π 6 β π π 6 6'( β± " βΆ= π β β±: π "8( π β min 9ββ± π "8( π β€ πΎ " β Ξ ln π’ β± ERM β’ Four important branches β’ Exploit, Feasible, Fallback, Conflict β’ Exploit β’ Does every π β β± " support the same best arm? β’ If yes, pull that arm. 15
CROP v1 At time π’ , β’ Maintain a confidence set β± " β β± Cf. optimism: f β’ If every π β β± " agree on the best arm π " = arg max ;ββ± " max $βπ π(π) β’ (Exploit) pull that arm. β’ Else: (Feasible) β’ Compute the pessimism : e π " = arg min ;ββ± " max $βπ π(π) (break ties by the cum. loss) β’ Compute πΏ β β solution of the optimization problem π e π " <=++_?.=/@($) β’ (Tracking) Pull π " = arg min β C # $βπ 16
Why pessimism? Arms A1 A2 A3 A4 A5 π 1 .99 .98 0 0 ) π .98 .99 .98 .25 0 % π .97 .97 .98 .25 .25 & β’ Suppose β± " = π ( , π 7 , π D β’ If I knew π β , I could track πΏ π β (= the solution of π(π β ) ) β’ Which π should I track? β’ Pessimism : either does the right thing, or eliminates itself. β’ Other choices: may get stuck (so does ERM) Key idea: the LB constraints prescribes how to distinguish π β from those supporting higher rewards. 17
But we may still get stuck. Arms A1 A2 A3 A4 A5 π 1 .99 .98 0 0 ) π β = β’ Due to docile hypotheses. π .98 .99 .98 .25 0 % π .97 .97 1 .25 .25 β’ We must do something else. & π .97 .97 1 .2499 .25 $ π π β arg &β *,8 " Ξ 9:; π β πΏ + β 1 + min 3 Ξ + π β πΏ + +<+ β 1 2 π π β π π s. t. β π β π π βͺ π π : π β π β₯ π β π , 3 πΏ + β₯ 1 2 + πΏ β₯ max πΏ π , π π β’ Includes docile hypotheses with best rewards higher π β π 18
Recommend
More recommend