On adaptive regret bounds for non- stochastic bandits Gergely Neu - PowerPoint PPT Presentation

On adaptive regret bounds for non- stochastic bandits Gergely Neu INRIA Lille, SequeL team → Universitat Pompeu Fabra, Barcelona

• Online learning and bandits • Adaptive bounds in online learning • Adaptive bounds for bandits Outline • What we already have • What’s new: First-order bounds • What may be possible • What seems impossible* *Opinion alert!

Online learning and non-stochastic bandits For each round 𝑢 = 1,2, … , 𝑈 • Learner chooses action 𝐽 𝑢 ∈ {1,2, … , 𝑂} • Environment chooses losses ℓ 𝑢,𝑗 ∈ [0,1] for all 𝑗 • Learner suffers loss ℓ 𝑢,𝐽 𝑢 • Learner observes losses ℓ 𝑢,𝑗 for all 𝑗

Online learning and non-stochastic bandits For each round 𝑢 = 1,2, … , 𝑈 • Learner chooses action 𝐽 𝑢 ∈ {1,2, … , 𝑂} • Environment chooses losses ℓ 𝑢,𝑗 ∈ [0,1] for all 𝑗 • Learner suffers loss ℓ 𝑢,𝐽 𝑢 • Learner observes losses ℓ 𝑢,𝑗 for all 𝑗 For each round 𝑢 = 1,2, … , 𝑈 • Learner chooses action 𝐽 𝑢 ∈ {1,2, … , 𝑂} • Environment chooses losses ℓ 𝑢,𝑗 ∈ [0,1] for all 𝑗 • Learner suffers loss ℓ 𝑢,𝐽 𝑢 • Learner observes its own loss ℓ 𝑢,𝐽 𝑢

Online learning and non-stochastic bandits For each round 𝑢 = 1,2, … , 𝑈 • Learner chooses action 𝐽 𝑢 ∈ {1,2, … , 𝑂} • Need to explore! Environment chooses losses ℓ 𝑢,𝑗 ∈ [0,1] for all 𝑗 • Learner suffers loss ℓ 𝑢,𝐽 𝑢 • Learner observes losses ℓ 𝑢,𝑗 for all 𝑗 For each round 𝑢 = 1,2, … , 𝑈 • Learner chooses action 𝐽 𝑢 ∈ {1,2, … , 𝑂} • Environment chooses losses ℓ 𝑢,𝑗 ∈ [0,1] for all 𝑗 • Learner suffers loss ℓ 𝑢,𝐽 𝑢 • Learner observes its own loss ℓ 𝑢,𝐽 𝑢

Minimax regret • Define (expected) regret against action 𝑗 as 𝑈 𝑈 𝑆 𝑈,𝑗 = 𝐅 ℓ 𝑢,𝐽 𝑢 − ℓ 𝑢,𝑗 𝑢=1 𝑢=1 • Goal: minimize regret against the best action 𝑗 ∗ 𝑆 𝑈 = 𝑆 𝑈,𝑗 ∗ = max 𝑆 𝑈,𝑗 𝑗

Minimax regret • Define (expected) regret against action 𝑗 as 𝑈 𝑈 𝑆 𝑈,𝑗 = 𝐅 ℓ 𝑢,𝐽 𝑢 − ℓ 𝑢,𝑗 𝑢=1 𝑢=1 • Goal: minimize regret against the best action 𝑗 ∗ 𝑆 𝑈 = 𝑆 𝑈,𝑗 ∗ = max 𝑆 𝑈,𝑗 𝑗 Full information Bandit 𝑆 𝑈 = Θ 𝑂𝑈 𝑆 𝑈 = Θ 𝑈 log 𝑂

Beyond minimax: i.i.d. losses Full information Bandit 𝑆 𝑈 = Θ 𝑂𝑈 𝑆 𝑈 = Θ 𝑈 log 𝑂 Θ(log 𝑂) Θ(𝑂log 𝑈)

Beyond minimax: “ Higher-order ” bounds Full information Bandit minimax 𝑆 𝑈 = 𝑃 𝑈 log 𝑂 𝑆 𝑈 = 𝑃 𝑂𝑈 first-order 𝑀 𝑈,𝑗 ∗ log 𝑂 𝑆 𝑈 = 𝑃 𝑀 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 second-order 𝑆 𝑈 = 𝑃 𝑇 𝑢,𝑗 ∗ log 𝑂 2 𝑇 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 Cesa-Bianchi, Mansour, Stoltz (2005) variance 𝑆 𝑈 = 𝑃 𝑊 𝑈,𝑗 ∗ log 𝑂 2 𝑊 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 − 𝑛 Hazan and Kale (2010) with a little cheating

Beyond minimax: “Higher - order” bounds Full information Bandit minimax 𝑆 𝑈 = 𝑃 𝑈 log 𝑂 𝑆 𝑈 = 𝑃 𝑂𝑈 first-order 𝑀 𝑈,𝑗 ∗ log 𝑂 𝑆 𝑈 = 𝑃 𝑀 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 second-order 𝑆 𝑈 = 𝑗 𝑇 𝑢,𝑗 𝑃 𝑆 𝑈 = 𝑃 𝑇 𝑢,𝑗 ∗ log 𝑂 2 𝑇 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 Cesa-Bianchi, Mansour, Stoltz (2005) Auer et al. (2002) + some hacking variance 𝑃 𝑂 2 𝑗 𝑊 𝑆 𝑈 = 𝑆 𝑈 = 𝑃 𝑊 𝑈,𝑗 ∗ log 𝑂 𝑢,𝑗 2 𝑊 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 − 𝑛 Hazan and Kale (2010) Hazan and Kale (2011) with a little cheating

Beyond minimax: “Higher - order” bounds Full information Bandit minimax 𝑆 𝑈 = 𝑃 𝑈 log 𝑂 𝑆 𝑈 = 𝑃 𝑂𝑈 first-order (it ’ s complicated) 𝑀 𝑈,𝑗 ∗ log 𝑂 𝑆 𝑈 = 𝑃 𝑀 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 second-order 𝑆 𝑈 = 𝑗 𝑇 𝑢,𝑗 𝑃 𝑆 𝑈 = 𝑃 𝑇 𝑢,𝑗 ∗ log 𝑂 2 𝑇 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 Cesa-Bianchi, Mansour, Stoltz (2005) Auer et al. (2002) + some hacking variance 𝑃 𝑂 2 𝑗 𝑊 𝑆 𝑈 = 𝑆 𝑈 = 𝑃 𝑊 𝑈,𝑗 ∗ log 𝑂 𝑢,𝑗 2 𝑊 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 − 𝑛 Hazan and Kale (2010) Hazan and Kale (2011) with a little cheating

First-order bounds for bandits (it ’ s complicated) • “Small - gain” bounds: • Consider the gain game with 𝑕 𝑢,𝑗 = 1 − ℓ 𝑢,𝑗 • Auer, Cesa-Bianchi, Freund and Schapire (2002): 𝐻 𝑈,𝑗 = 𝑢 𝑕 𝑢,𝑗 𝑆 𝑈 = 𝑃 𝑂𝐻 𝑈,𝑗 ∗ log 𝑂

First-order bounds for bandits (it’s complicated) • “Small - gain” bounds: • Consider the gain game with 𝑕 𝑢,𝑗 = 1 − ℓ 𝑢,𝑗 • Auer, Cesa-Bianchi, Freund and Schapire (2002): 𝐻 𝑈,𝑗 = 𝑢 𝑕 𝑢,𝑗 𝑆 𝑈 = 𝑃 𝑂𝐻 𝑈,𝑗 ∗ log 𝑂 Problem: only good if best expert is bad!

First-order bounds for bandits (it’s complicated) • “Small - gain” bounds: 𝑂𝐻 𝑈,𝑗 ∗ log 𝑂 𝑆 𝑈 = 𝑃 • A little trickier analysis gives or 𝑢 𝑗 𝑕 𝑢,𝑗 log 𝑂 𝑆 𝑈 = 𝑃 𝑢 𝑗 ℓ 𝑢,𝑗 log 𝑂 𝑆 𝑈 = 𝑃

First-order bounds for bandits (it’s complicated) • “Small - gain” bounds: 𝑂𝐻 𝑈,𝑗 ∗ log 𝑂 𝑆 𝑈 = 𝑃 • A little trickier analysis gives or 𝑢 𝑗 𝑕 𝑢,𝑗 log 𝑂 𝑆 𝑈 = 𝑃 𝑢 𝑗 ℓ 𝑢,𝑗 log 𝑂 𝑆 𝑈 = 𝑃 Problem: one misbehaving action ruins the bound!

First-order bounds for bandits (it’s complicated) • “Small - gain” bounds: 𝑂𝐻 𝑈,𝑗 ∗ log 𝑂 𝑆 𝑈 = 𝑃 • A little trickier analysis gives 𝑢 𝑗 ℓ 𝑢,𝑗 log 𝑂 𝑆 𝑈 = 𝑃 • Some obscure actual first-order bounds: ∗ Stoltz (2005): 𝑂 𝑀 𝑈 › ∗ Allenberg, Auer, Györfi and Ottucsák (2006): 𝑂𝑀 𝑈 › Rakhlin and Sridharan (2013): 𝑂 3/2 𝑀 𝑈 ∗ ›

First-order bounds for bandits (it’s complicated) • “Small - gain” bounds: 𝑂𝐻 𝑈,𝑗 ∗ log 𝑂 𝑆 𝑈 = 𝑃 • A little trickier analysis gives 𝑢 𝑗 ℓ 𝑢,𝑗 log 𝑂 𝑆 𝑈 = 𝑃 • Some obscure actual first-order bounds: ∗ Stoltz (2005): 𝑂 𝑀 𝑈 › Problem: ∗ Allenberg, Auer, Györfi and Ottucsák (2006): 𝑂𝑀 𝑈 › no real insight from analyses! Rakhlin and Sridharan (2013): 𝑂 3/2 𝑀 𝑈 ∗ ›

First-order bounds for non- stochastic bandits

A typical bandit algorithm For every round 𝑢 = 1,2, … , 𝑈 • Choose arm 𝐽 𝑢 = 𝑗 with probability 𝑞 𝑢,𝑗 • Compute unbiased loss estimate ℓ 𝑢,𝑗 = ℓ 𝑢,𝑗 𝟐 𝐽 𝑢 =𝑗 𝑞 𝑢,𝑗 Use • ℓ 𝑢,𝑗 in a black-box online learning algorithm to compute 𝒒 𝑢+1

A typical regret bound 𝑈 𝑂 log 𝑂 𝑆 𝑈 ≤ 2 𝑞 𝑢,𝑗 + 𝜃𝐅 ℓ 𝑢,𝑗 𝜃 𝑢=1 𝑗=1 ( 𝜃 : “learning rate”)

A typical regret bound 𝑈 𝑂 log 𝑂 𝑆 𝑈 ≤ 2 𝑞 𝑢,𝑗 + 𝜃𝐅 ℓ 𝑢,𝑗 𝜃 𝑢=1 𝑗=1 𝑈 𝑂 log 𝑂 ≤ + 𝜃𝐅 ℓ 𝑢,𝑗 𝜃 𝑢=1 𝑗=1 ( 𝜃 : “learning rate”)

A typical regret bound 𝑈 𝑂 log 𝑂 𝑆 𝑈 ≤ 2 𝑞 𝑢,𝑗 + 𝜃𝐅 ℓ 𝑢,𝑗 𝜃 𝑢=1 𝑗=1 𝑈 𝑂 log 𝑂 ≤ + 𝜃𝐅 ℓ 𝑢,𝑗 𝜃 𝑢=1 𝑗=1 𝑂 = log 𝑂 + 𝜃 𝑀 𝑈,𝑗 𝜃 𝑗=1 ( 𝜃 : “learning rate”)

A typical regret bound 𝑈 𝑂 log 𝑂 𝑆 𝑈 ≤ 2 𝑞 𝑢,𝑗 + 𝜃𝐅 ℓ 𝑢,𝑗 𝜃 𝑢=1 𝑗=1 𝑈 𝑂 log 𝑂 ≤ + 𝜃𝐅 ℓ 𝑢,𝑗 𝜃 𝑢=1 𝑗=1 𝑂 = log 𝑂 = 𝑂 𝑗=1 𝑃 𝑀 𝑈,𝑗 + 𝜃 𝑀 𝑈,𝑗 𝜃 𝑗=1 ( 𝜃 : “learning rate”) (for appropriate 𝜃 )

A typical regret bound 𝑆 𝑈 = 𝑂 𝑗=1 𝑃 𝑀 𝑈,𝑗

A typical regret bound 𝑆 𝑈 = 𝑂 𝑗=1 𝑃 𝑀 𝑈,𝑗 It’s all because 𝐅 𝑀 𝑈,𝑗 = 𝑀 𝑈,𝑗 !!!

A typical regret bound 𝑆 𝑈 = 𝑂 𝑗=1 𝑃 𝑀 𝑈,𝑗 It’s all because Idea: try to enforce 𝐅 𝐅 𝑀 𝑈,𝑗 = 𝑀 𝑈,𝑗 !!! 𝑀 𝑈,𝑗 = 𝑃(𝑀 𝑈,𝑗 ∗ )

A typical regret bound Need optimistic 𝑆 𝑈 = 𝑂 𝑗=1 𝑃 𝑀 𝑈,𝑗 estimates! It’s all because Idea: try to enforce 𝐅 𝐅 𝑀 𝑈,𝑗 = 𝑀 𝑈,𝑗 !!! 𝑀 𝑈,𝑗 = 𝑃(𝑀 𝑈,𝑗 ∗ )

A typical algorithm – fixed! For every round 𝑢 = 1,2, … , 𝑈 • Choose arm 𝐽 𝑢 = 𝑗 with probability 𝑞 𝑢,𝑗 • Compute unbiased loss estimate ℓ 𝑢,𝑗 = ℓ 𝑢,𝑗 𝟐 𝐽 𝑢 =𝑗 𝑞 𝑢,𝑗 Use • ℓ 𝑢,𝑗 in a black-box online learning algorithm to compute 𝒒 𝑢+1

A typical algorithm – fixed! For every round 𝑢 = 1,2, … , 𝑈 • Choose arm 𝐽 𝑢 = 𝑗 with probability 𝑞 𝑢,𝑗 • Compute biased loss estimate “Implicit ℓ 𝑢,𝑗 = ℓ 𝑢,𝑗 ℓ 𝑢,𝑗 exploration” 𝟐 𝐽 𝑢 =𝑗 ℓ 𝑢,𝑗 = 𝑞 𝑢,𝑗 + 𝛿 𝟐 𝐽 𝑢 =𝑗 𝑞 𝑢,𝑗 (Kocák, N, Valko and Munos, 2015) Use • ℓ 𝑢,𝑗 in a black-box online learning algorithm to compute 𝒒 𝑢+1

On adaptive regret bounds for non- stochastic bandits Gergely Neu - PowerPoint PPT Presentation

On adaptive regret bounds for non- stochastic bandits Gergely Neu INRIA Lille, SequeL team Universitat Pompeu Fabra, Barcelona Online learning and bandits Adaptive bounds in online learning Adaptive bounds for bandits Outline

A Closer Look at Adaptive Regret Dmitry Adamskiy Joint work with Wouter Koolen, Volodya Vovk and

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

Counterfactual Regret Minimization and Domination in Extensive-Form Games Richard Gibson

No-Regret Learning in Convex Games Geoff Gordon, Amy Greenwald, Casey Marks, and Martin Zinkevich

Adaptive Control Chapter 12: Indirect Adaptive Control 1 Adaptive Control Landau, Lozano,

Adaptive Control Chapter 13: Multimodel adaptive control with switching Chapter 13: Multimodel

Adaptive Control Chapter 14: Adaptive regulation Rejection of unknown disturbances 1

Acceleration through Optimistic No-Regret Dynamics Jun-Kun Wang and Jacob Abernethy Georgia Tech

Composability of Regret Minimizers Gabriele Farina 1 Christian Kroer 2 Tuomas Sandholm 1,3,4,5 1

Royal Economic Society The history of Regret Theory Robert Sugden Contribution to Economic

An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting Cem

Regret bounds for online variational inference Pierre Alquier ACML Nagoya, Nov. 18, 2019

Regret Bounds for Lifelong Learning Pierre Alquier Groupe de Travail de Machine learning du CMLA

L M A D A Learning And Mining from DatA NANJING UNIVERSITY Adaptive Regret of Convex and

A Test To Allow TCP Senders to Identify Receiver Cheating Toby Moncaster , Bob Briscoe, Arnaud

Free to Move, Free to Stay: 21st Century Immigration Reform Karina Ruiz, Heidi Altman, Patrice

The Power of In-Memory Computing: From Supercomputing to Stream Processing William Bain,

Early Colonial Ideology - part 3 revised 02.14.12 || English 2327: American Literature I || D. Glen

Theory of Computation (CS3102) Syllabus University of Virginia Professor Gabriel Robins Course

CS6501: T opics in Learning and Game Theory (Fall 2019) How Can Classifiers Induce Right

From Subscription to Micro-transaction Bringing Heroes Kingdoms to the East Introduction

Decision Making Beyond Sometimes It Is . . . Cheating May Hurt . . . Arrows Impossibility