Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic - PowerPoint PPT Presentation

Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic Optimality Kwang-Sung Jun join work with Chicheng Zhang 1

Structured bandits e.g., linear 𝒝 = 𝑏 ! , …, 𝑏 " ∈ ℝ # ℱ = {𝑏 ↦ 𝜄 $ 𝑏: 𝜄 ∈ ℝ # } • Input : Arm set 𝒝 , hypothesis class ℱ ⊂ 𝒝 → ℝ “the set of possible configurations of the mean rewards ” • Initialize : The environment chooses 𝑔 ∗ ∈ ℱ (unknown to the learner) For 𝑢 = 1, …, 𝑜 • Learner: chooses an arm 𝑏 " ∈ 𝒝 " = 𝑔 ∗ 𝑏 " + (zero-mean stochastic noise) • Environment: generates the reward 𝑠 • Learner: receives 𝑠 " • Goal : Minimize the cumulative regret # $∈𝒝 𝑔 ∗ 𝑏 𝑔 ∗ 𝑏 " 𝔽 Reg # = 𝔽 𝑜 ⋅ max − : "'( • Note: fixed arm set (=non-contextual), realizability 𝑔 ∗ ∈ ℱ 2

Structured bandits • Why relevant? Techniques may transfer to RL (e.g., ergodic RL [Ok18] ) • Naive strategy: UCB ) ⟹ * log 𝑜 regret bound (instance-dependent) • Scales with the number of arms 𝐿 • Instead, the complexity of the hypothesis class ℱ should appear. • The asymptotically optimal regret is well-defined. • E.g., linear bandits : 𝑑 ∗ ⋅ log 𝑜 for some well-defined 𝑑 ∗ ≪ ) * . The goal of this paper Achieve the asymptotic optimality with improved finite-time regret for any ℱ . (the worst-case regret is beyond the scope) [Ok18] Ok et al., Exploration in Structured Reinforcement Learning, NeurIPS, 2018 3

Asymptotic optimality (instance-dependent) Do they like orange or apple? • Optimism in the face of uncertainty Maybe have them try lemon and see if they (e.g., UCB, Thompson sampling) are sensitive to sourness.. ⟹ optimal asymptotic / worst-case regret in 𝑳 -armed bandits . (1,0) • Linear bandits: optimal worst-case rate = 𝑒 𝑜 sour • Asymptotically optimal regret? ⟹ No! (0.95, 0.1) (1,0) sweet (AISTATS’17) mean reward = 1*sweet + 0*sour 4

Asymptotic optimality: lower bound • 𝔽 Reg # ≥ 𝑑 𝑔 ∗ ⋅ log 𝑜 (asymptotically) .∈𝒝 𝑔 ∗ 𝑐 − 𝑔 ∗ (𝑏) Δ + = max " 𝑑 𝑔 ∗ = & ! ,…,& " )* 3 min 𝛿 + ⋅ Δ + +,! s. t. 𝛿 + ∗ 1 = 0 " ∀𝑕 ∈ 𝒟 𝑔 ∗ , 3 𝛿 + ⋅ KL - 𝑔 𝑏 , 𝑕 𝑏 ≥ 1 +,! ”competing” hypotheses KL divergence with noise distribution 𝜉 • 𝛿 ∗ = 𝛿 ( ∗ , …, 𝛿 ) ∗ ≥ 0 : the solution ∗ ⋅ log 𝑜 times. • To be optimal, we must pull arm 𝑏 like 𝛿 $ ∗ ∗ • E.g., 𝛿 +,-./ = 8, 𝛿 .01/2, = 0 ⟹ lemon is the informative arm ! • When 𝑑 𝑔 ∗ = 0 : Bounded regret ! (except for pathological ones [Lattimore14]) Lattimore & Munos, Bounded regret for finite-armed structured bandits, 2014. 5

Existing asymptotically optimal algorithms • Mostly uses forced exploration. [Lattimore+17,Combes+17,Hao+20] +.2 # ⟹ ensures every arm ’s pull count is an unbounded function of 𝑜 such as (3+.2 +.2 # . ⟹ 𝔽 Reg # ⪅ 𝑑 𝑔 ∗ ⋅ log 𝑜 + 𝐿 ⋅ +.2 # (3+.2 +.2 # • Issues 1. 𝐿 appears in the regret* ⟹ what if 𝐿 is exponentially large? 2. cannot achieve bounded regret when 𝑑 𝑔 ∗ = 0 • Parallel studies avoid forced exploration, but still depend on 𝐿 . [Menard+20, Degenne+20] *Dependence on 𝐿 can be avoided in special cases (e.g., linear). 6

Contribution Proposed algorithm: Research Question CRush Optimism with Pessimism (CROP) Assume ℱ is finite. Can we design an algorithm that • enjoys the asymptotic optimality • adapts to bounded regret whenever possible • does not necessarily depend on 𝐿 ? • No forced exploration 😁 • The regret scales not with 𝐿 but with 𝐿 " ≤ 𝐿 (defined in the paper). • An interesting log log 𝑜 term in the regret* * it’s necessary (will be updated in arxiv) 7

Preliminaries 8

Assumptions • ℱ < ∞ • The noise model " = 𝑔 ∗ 𝑏 " + 𝜊 " 𝑠 where 𝜊 " is 1-sub-Gaussian. (generalized to 𝜏 ! in the paper) • Notations: 𝑏 ∗ 𝑔 ≔ arg max 𝜈 ∗ 𝑔 ≔ 𝑔 𝑏 ∗ 𝑔 $∈𝒝 𝑔 𝑏 , 𝑏 ∗ 𝑔 = 𝑏 • 𝑔 supports arm 𝑏 ⟺ 𝜈 ∗ 𝑔 = 𝑤 • 𝑔 supports reward 𝑤 ⟺ • [Assumption] Every 𝑔 ∈ ℱ has a unique best arm ( i. e. , 𝑏 ∗ 𝑔 = 1 ) 9

Competing hypotheses • 𝒟 𝑔 ∗ consists of 𝑔 ∈ ℱ such that • (1) assigns the same reward to the best arm 𝑏 ∗ (𝑔 ∗ ) • (2) but supports a different arm 𝑏 ∗ 𝑔 ≠ 𝑏 ∗ (𝑔 ∗ ) • Importance: it’s why we get log(𝑜) regret! mean reward 𝑔 ) 𝑔 % 1 𝑔 & .75 = 𝑔 ∗ 𝑔 $ .5 𝑔 ' .25 𝑔 ( 0 arms 1 2 3 10

Lower bound revisited 𝔽 Reg # ≥ 𝑑 𝑔 ∗ ⋅ log 𝑜 , asymptotically. Assume Gaussian rewards. • • .∈𝒝 𝑔 ∗ 𝑐 − 𝑔 ∗ (𝑏) Δ + = max " 𝑑 𝑔 ∗ ≔ & ! ,…,& " )* 3 min 𝛿 + ⋅ Δ + +,! 𝛿 + ∗ 1 ∗ = 0 s. t. " 𝛿 + ⋅ 𝑔 ∗ 𝑏 − 𝑕 𝑏 2 ∀𝑕 ∈ 𝒟 𝑔 ∗ , 3 ≥ 1 2 +,! ”competing” hypotheses 𝛿 + ln 𝑜 samples for each 𝑏 ∈ 𝒝 can distinguish 𝑔 ∗ from 𝑕 confidently. Finds arm pull allocations that (1) eliminate competing hypotheses and (2) ‘reward’-efficient Agrawal, Teneketzis, Anantharam. Asymptotically Efficient Adaptive Allocation Schemes for Controlled I.I.D. Processes: Finite Parameter Space, 1989. 11

Example: Cheating code base arms cheating arms 𝐿 * log 2 𝐿 * • 𝜗 > 0 : very small (like 0.0001) { 1 − 𝜗, 1, 1 + 𝜗} 0, Λ rewards: • Λ > 0 : not too small (like 0.5) A1 A2 A3 A4 A5 A6 +.2 ! ) • The lower bound: Θ ln 𝑜 4 ! 𝑔 𝟐 1 − 𝜗 1 − 𝜗 1 − 𝜗 0 0 ) ) 𝑔 1 − 𝜗 𝟐 1 − 𝜗 1 − 𝜗 0 Λ • UCB: Θ 5 ln 𝑜 % 𝑔 1 − 𝜗 1 − 𝜗 𝟐 1 − 𝜗 Λ 0 & • Exponential gap in 𝐿 ! 𝑔 1 − 𝜗 1 − 𝜗 1 − 𝜗 𝟐 Λ Λ $ 𝑔 𝟐 + 𝛝 1 1 − 𝜗 1 − 𝜗 0 0 ' 𝑔 1 𝟐 + 𝛝 1 − 𝜗 1 − 𝜗 0 Λ ( 𝑔 1 − 𝜗 1 − 𝜗 𝟐 + 𝛝 1 Λ 0 * … … … … … … … 12

The function classes regret contribution • 𝒟 𝑔 ∗ : Competing ⟹ cannot distinguishable using 𝑏 ∗ (𝑔 ∗ ) , but supports a different arm Θ(log 𝑜) • 𝒠 𝑔 ∗ : Docile ⟹ distinguishable using 𝑏 ∗ (𝑔 ∗ ) Θ(1) • ℰ 𝑔 ∗ : Equivalent ⟹ supports 𝑏 ∗ (𝑔 ∗ ) and the reward 𝜈 ∗ (𝑔 ∗ ) can be Θ(log log 𝑜) • [Proposition 2] ℱ = 𝒟 𝑔 ∗ ∪ 𝒠 𝑔 ∗ ∪ ℰ(𝑔 ∗ ) ( disjoint union) ℱ mean reward 𝑔 ! 𝑔 𝑔 𝑔 𝑔 6 4 2 2 1 𝑔 ℰ ∗ 𝒟 ∗ 4 .75 = 𝑔 ∗ 𝑔 𝑔 𝑔 𝑔 3 3 5 ! .5 𝑔 5 .25 𝑔 6 𝒠 ∗ = 0 arms 1 2 3 13

CRush Optimism with Pessimism (CROP) 14

CROP: Overview ! confidence level: 1 − poly • The confidence set 7 " 7 𝑀 " 𝑔 ≔ : 𝑠 6 − 𝑔 𝑏 6 6'( ℱ " ∶= 𝑔 ∈ ℱ: 𝑀 "8( 𝑔 − min 9∈ℱ 𝑀 "8( 𝑕 ≤ 𝛾 " ≔ Θ ln 𝑢 ℱ ERM • Four important branches • Exploit, Feasible, Fallback, Conflict • Exploit • Does every 𝑔 ∈ ℱ " support the same best arm? • If yes, pull that arm. 15

CROP v1 At time 𝑢 , • Maintain a confidence set ℱ " ⊆ ℱ Cf. optimism: f • If every 𝑔 ∈ ℱ " agree on the best arm 𝑔 " = arg max ;∈ℱ " max $∈𝒝 𝑔(𝑏) • (Exploit) pull that arm. • Else: (Feasible) • Compute the pessimism : e 𝑔 " = arg min ;∈ℱ " max $∈𝒝 𝑔(𝑏) (break ties by the cum. loss) • Compute 𝛿 ∗ ≔ solution of the optimization problem 𝑑 e 𝑔 " <=++_?.=/@($) • (Tracking) Pull 𝑏 " = arg min ∗ C # $∈𝒝 16

Why pessimism? Arms A1 A2 A3 A4 A5 𝑔 1 .99 .98 0 0 ) 𝑔 .98 .99 .98 .25 0 % 𝑔 .97 .97 .98 .25 .25 & • Suppose ℱ " = 𝑔 ( , 𝑔 7 , 𝑔 D • If I knew 𝑔 ∗ , I could track 𝛿 𝑔 ∗ (= the solution of 𝑑(𝑔 ∗ ) ) • Which 𝑔 should I track? • Pessimism : either does the right thing, or eliminates itself. • Other choices: may get stuck (so does ERM) Key idea: the LB constraints prescribes how to distinguish 𝑔 ∗ from those supporting higher rewards. 17

But we may still get stuck. Arms A1 A2 A3 A4 A5 𝑔 1 .99 .98 0 0 ) 𝑔 ∗ = • Due to docile hypotheses. 𝑔 .98 .99 .98 .25 0 % 𝑔 .97 .97 1 .25 .25 • We must do something else. & 𝑔 .97 .97 1 .2499 .25 $ 𝜔 𝑔 ≔ arg &∈ *,8 " Δ 9:; 𝑔 ⋅ 𝛿 + ∗ 1 + min 3 Δ + 𝑔 ⋅ 𝛿 + +<+ ∗ 1 2 𝑔 𝑏 − 𝑕 𝑏 s. t. ∀ 𝑕 ∈ 𝒟 𝑔 ∪ 𝒠 𝑔 : 𝜈 ∗ 𝑕 ≥ 𝜈 ∗ 𝑔 , 3 𝛿 + ≥ 1 2 + 𝛿 ≥ max 𝛿 𝑔 , 𝜚 𝑔 • Includes docile hypotheses with best rewards higher 𝜈 ∗ 𝑔 18

Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic - PowerPoint PPT Presentation

Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic Optimality Kwang-Sung Jun join work with Chicheng Zhang 1 Structured bandits e.g., linear = ! , , " # = { $ : #

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Page 2 Clinical Manifestations Causes of Mortality after Crush Syndrome Untreated Crush Injury

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

What If We Only Know Our Solution Hurwiczs Home Page Optimism-Pessimism Title Page

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

DK CRUSH : SURESTIM? Olivier Darremont Cilinique St Augustin, Bordeaux 2018 ESC GUIDELINES SL

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Small Business Optimism Index Small Business Optimism Index

Epistemic Optimism Julien Dutant Kings College London Les Principes de lpistmologie,

Chicag cago o Bandits dits Affili liate te Program ram Junior r Affiliate and Tra vel

Data Poisoning Attack cks on Stoch chastic c Bandits Fang Liu and Ness Shroff Outline

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

Common Sense Addition Computing Computing Explained by Hurwicz Let Us Apply Hurwicz . . .

OPTIMISM OF AGEING T Total 33% 64% India 1 73% 26% Turkey 2 67% 31% % who are looking

Anytime Online-to-Batch, Optimism, and Acceleration Ashok Cutkosky Google Research Stochastic

AICPA Business and Industry Economic Outlook Survey Detailed Survey Results: 1Q 2020 Management

Making Lemonade Teaching Young Children Teaching Young Children to Think Optimistically

Over-optimism in biostatistics and bioinformatics Anne-Laure Boulesteix joint with M. Jelizarow,

(Ir)rational Exuberance: Optimism, Ambiguity and Risk Anat Bracha and Don Brown Boston FRB and

The case for optimism Singapore Healthcare Management Congress August 14 16, 2018 Michael J.

Sambuz

Useful Links

Newsletter

Mail Us