On adaptive regret bounds for non- stochastic bandits Gergely Neu INRIA Lille, SequeL team → Universitat Pompeu Fabra, Barcelona
• Online learning and bandits • Adaptive bounds in online learning • Adaptive bounds for bandits Outline • What we already have • What’s new: First-order bounds • What may be possible • What seems impossible* *Opinion alert!
Online learning and non-stochastic bandits For each round 𝑢 = 1,2, … , 𝑈 • Learner chooses action 𝐽 𝑢 ∈ {1,2, … , 𝑂} • Environment chooses losses ℓ 𝑢,𝑗 ∈ [0,1] for all 𝑗 • Learner suffers loss ℓ 𝑢,𝐽 𝑢 • Learner observes losses ℓ 𝑢,𝑗 for all 𝑗
Online learning and non-stochastic bandits For each round 𝑢 = 1,2, … , 𝑈 • Learner chooses action 𝐽 𝑢 ∈ {1,2, … , 𝑂} • Environment chooses losses ℓ 𝑢,𝑗 ∈ [0,1] for all 𝑗 • Learner suffers loss ℓ 𝑢,𝐽 𝑢 • Learner observes losses ℓ 𝑢,𝑗 for all 𝑗 For each round 𝑢 = 1,2, … , 𝑈 • Learner chooses action 𝐽 𝑢 ∈ {1,2, … , 𝑂} • Environment chooses losses ℓ 𝑢,𝑗 ∈ [0,1] for all 𝑗 • Learner suffers loss ℓ 𝑢,𝐽 𝑢 • Learner observes its own loss ℓ 𝑢,𝐽 𝑢
Online learning and non-stochastic bandits For each round 𝑢 = 1,2, … , 𝑈 • Learner chooses action 𝐽 𝑢 ∈ {1,2, … , 𝑂} • Need to explore! Environment chooses losses ℓ 𝑢,𝑗 ∈ [0,1] for all 𝑗 • Learner suffers loss ℓ 𝑢,𝐽 𝑢 • Learner observes losses ℓ 𝑢,𝑗 for all 𝑗 For each round 𝑢 = 1,2, … , 𝑈 • Learner chooses action 𝐽 𝑢 ∈ {1,2, … , 𝑂} • Environment chooses losses ℓ 𝑢,𝑗 ∈ [0,1] for all 𝑗 • Learner suffers loss ℓ 𝑢,𝐽 𝑢 • Learner observes its own loss ℓ 𝑢,𝐽 𝑢
Minimax regret • Define (expected) regret against action 𝑗 as 𝑈 𝑈 𝑆 𝑈,𝑗 = 𝐅 ℓ 𝑢,𝐽 𝑢 − ℓ 𝑢,𝑗 𝑢=1 𝑢=1 • Goal: minimize regret against the best action 𝑗 ∗ 𝑆 𝑈 = 𝑆 𝑈,𝑗 ∗ = max 𝑆 𝑈,𝑗 𝑗
Minimax regret • Define (expected) regret against action 𝑗 as 𝑈 𝑈 𝑆 𝑈,𝑗 = 𝐅 ℓ 𝑢,𝐽 𝑢 − ℓ 𝑢,𝑗 𝑢=1 𝑢=1 • Goal: minimize regret against the best action 𝑗 ∗ 𝑆 𝑈 = 𝑆 𝑈,𝑗 ∗ = max 𝑆 𝑈,𝑗 𝑗 Full information Bandit 𝑆 𝑈 = Θ 𝑂𝑈 𝑆 𝑈 = Θ 𝑈 log 𝑂
Beyond minimax: i.i.d. losses Full information Bandit 𝑆 𝑈 = Θ 𝑂𝑈 𝑆 𝑈 = Θ 𝑈 log 𝑂 Θ(log 𝑂) Θ(𝑂log 𝑈)
Beyond minimax: “ Higher-order ” bounds Full information Bandit minimax 𝑆 𝑈 = 𝑃 𝑈 log 𝑂 𝑆 𝑈 = 𝑃 𝑂𝑈 first-order 𝑀 𝑈,𝑗 ∗ log 𝑂 𝑆 𝑈 = 𝑃 𝑀 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 second-order 𝑆 𝑈 = 𝑃 𝑇 𝑢,𝑗 ∗ log 𝑂 2 𝑇 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 Cesa-Bianchi, Mansour, Stoltz (2005) variance 𝑆 𝑈 = 𝑃 𝑊 𝑈,𝑗 ∗ log 𝑂 2 𝑊 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 − 𝑛 Hazan and Kale (2010) with a little cheating
Beyond minimax: “Higher - order” bounds Full information Bandit minimax 𝑆 𝑈 = 𝑃 𝑈 log 𝑂 𝑆 𝑈 = 𝑃 𝑂𝑈 first-order 𝑀 𝑈,𝑗 ∗ log 𝑂 𝑆 𝑈 = 𝑃 𝑀 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 second-order 𝑆 𝑈 = 𝑗 𝑇 𝑢,𝑗 𝑃 𝑆 𝑈 = 𝑃 𝑇 𝑢,𝑗 ∗ log 𝑂 2 𝑇 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 Cesa-Bianchi, Mansour, Stoltz (2005) Auer et al. (2002) + some hacking variance 𝑃 𝑂 2 𝑗 𝑊 𝑆 𝑈 = 𝑆 𝑈 = 𝑃 𝑊 𝑈,𝑗 ∗ log 𝑂 𝑢,𝑗 2 𝑊 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 − 𝑛 Hazan and Kale (2010) Hazan and Kale (2011) with a little cheating
Beyond minimax: “Higher - order” bounds Full information Bandit minimax 𝑆 𝑈 = 𝑃 𝑈 log 𝑂 𝑆 𝑈 = 𝑃 𝑂𝑈 first-order (it ’ s complicated) 𝑀 𝑈,𝑗 ∗ log 𝑂 𝑆 𝑈 = 𝑃 𝑀 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 second-order 𝑆 𝑈 = 𝑗 𝑇 𝑢,𝑗 𝑃 𝑆 𝑈 = 𝑃 𝑇 𝑢,𝑗 ∗ log 𝑂 2 𝑇 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 Cesa-Bianchi, Mansour, Stoltz (2005) Auer et al. (2002) + some hacking variance 𝑃 𝑂 2 𝑗 𝑊 𝑆 𝑈 = 𝑆 𝑈 = 𝑃 𝑊 𝑈,𝑗 ∗ log 𝑂 𝑢,𝑗 2 𝑊 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 − 𝑛 Hazan and Kale (2010) Hazan and Kale (2011) with a little cheating
First-order bounds for bandits (it ’ s complicated) • “Small - gain” bounds: • Consider the gain game with 𝑢,𝑗 = 1 − ℓ 𝑢,𝑗 • Auer, Cesa-Bianchi, Freund and Schapire (2002): 𝐻 𝑈,𝑗 = 𝑢 𝑢,𝑗 𝑆 𝑈 = 𝑃 𝑂𝐻 𝑈,𝑗 ∗ log 𝑂
First-order bounds for bandits (it’s complicated) • “Small - gain” bounds: • Consider the gain game with 𝑢,𝑗 = 1 − ℓ 𝑢,𝑗 • Auer, Cesa-Bianchi, Freund and Schapire (2002): 𝐻 𝑈,𝑗 = 𝑢 𝑢,𝑗 𝑆 𝑈 = 𝑃 𝑂𝐻 𝑈,𝑗 ∗ log 𝑂 Problem: only good if best expert is bad!
First-order bounds for bandits (it’s complicated) • “Small - gain” bounds: 𝑂𝐻 𝑈,𝑗 ∗ log 𝑂 𝑆 𝑈 = 𝑃 • A little trickier analysis gives or 𝑢 𝑗 𝑢,𝑗 log 𝑂 𝑆 𝑈 = 𝑃 𝑢 𝑗 ℓ 𝑢,𝑗 log 𝑂 𝑆 𝑈 = 𝑃
First-order bounds for bandits (it’s complicated) • “Small - gain” bounds: 𝑂𝐻 𝑈,𝑗 ∗ log 𝑂 𝑆 𝑈 = 𝑃 • A little trickier analysis gives or 𝑢 𝑗 𝑢,𝑗 log 𝑂 𝑆 𝑈 = 𝑃 𝑢 𝑗 ℓ 𝑢,𝑗 log 𝑂 𝑆 𝑈 = 𝑃 Problem: one misbehaving action ruins the bound!
First-order bounds for bandits (it’s complicated) • “Small - gain” bounds: 𝑂𝐻 𝑈,𝑗 ∗ log 𝑂 𝑆 𝑈 = 𝑃 • A little trickier analysis gives 𝑢 𝑗 ℓ 𝑢,𝑗 log 𝑂 𝑆 𝑈 = 𝑃 • Some obscure actual first-order bounds: ∗ Stoltz (2005): 𝑂 𝑀 𝑈 › ∗ Allenberg, Auer, Györfi and Ottucsák (2006): 𝑂𝑀 𝑈 › Rakhlin and Sridharan (2013): 𝑂 3/2 𝑀 𝑈 ∗ ›
First-order bounds for bandits (it’s complicated) • “Small - gain” bounds: 𝑂𝐻 𝑈,𝑗 ∗ log 𝑂 𝑆 𝑈 = 𝑃 • A little trickier analysis gives 𝑢 𝑗 ℓ 𝑢,𝑗 log 𝑂 𝑆 𝑈 = 𝑃 • Some obscure actual first-order bounds: ∗ Stoltz (2005): 𝑂 𝑀 𝑈 › Problem: ∗ Allenberg, Auer, Györfi and Ottucsák (2006): 𝑂𝑀 𝑈 › no real insight from analyses! Rakhlin and Sridharan (2013): 𝑂 3/2 𝑀 𝑈 ∗ ›
First-order bounds for non- stochastic bandits
A typical bandit algorithm For every round 𝑢 = 1,2, … , 𝑈 • Choose arm 𝐽 𝑢 = 𝑗 with probability 𝑞 𝑢,𝑗 • Compute unbiased loss estimate ℓ 𝑢,𝑗 = ℓ 𝑢,𝑗 𝟐 𝐽 𝑢 =𝑗 𝑞 𝑢,𝑗 Use • ℓ 𝑢,𝑗 in a black-box online learning algorithm to compute 𝒒 𝑢+1
A typical regret bound 𝑈 𝑂 log 𝑂 𝑆 𝑈 ≤ 2 𝑞 𝑢,𝑗 + 𝜃𝐅 ℓ 𝑢,𝑗 𝜃 𝑢=1 𝑗=1 ( 𝜃 : “learning rate”)
A typical regret bound 𝑈 𝑂 log 𝑂 𝑆 𝑈 ≤ 2 𝑞 𝑢,𝑗 + 𝜃𝐅 ℓ 𝑢,𝑗 𝜃 𝑢=1 𝑗=1 𝑈 𝑂 log 𝑂 ≤ + 𝜃𝐅 ℓ 𝑢,𝑗 𝜃 𝑢=1 𝑗=1 ( 𝜃 : “learning rate”)
A typical regret bound 𝑈 𝑂 log 𝑂 𝑆 𝑈 ≤ 2 𝑞 𝑢,𝑗 + 𝜃𝐅 ℓ 𝑢,𝑗 𝜃 𝑢=1 𝑗=1 𝑈 𝑂 log 𝑂 ≤ + 𝜃𝐅 ℓ 𝑢,𝑗 𝜃 𝑢=1 𝑗=1 𝑂 = log 𝑂 + 𝜃 𝑀 𝑈,𝑗 𝜃 𝑗=1 ( 𝜃 : “learning rate”)
A typical regret bound 𝑈 𝑂 log 𝑂 𝑆 𝑈 ≤ 2 𝑞 𝑢,𝑗 + 𝜃𝐅 ℓ 𝑢,𝑗 𝜃 𝑢=1 𝑗=1 𝑈 𝑂 log 𝑂 ≤ + 𝜃𝐅 ℓ 𝑢,𝑗 𝜃 𝑢=1 𝑗=1 𝑂 = log 𝑂 = 𝑂 𝑗=1 𝑃 𝑀 𝑈,𝑗 + 𝜃 𝑀 𝑈,𝑗 𝜃 𝑗=1 ( 𝜃 : “learning rate”) (for appropriate 𝜃 )
A typical regret bound 𝑆 𝑈 = 𝑂 𝑗=1 𝑃 𝑀 𝑈,𝑗
A typical regret bound 𝑆 𝑈 = 𝑂 𝑗=1 𝑃 𝑀 𝑈,𝑗 It’s all because 𝐅 𝑀 𝑈,𝑗 = 𝑀 𝑈,𝑗 !!!
A typical regret bound 𝑆 𝑈 = 𝑂 𝑗=1 𝑃 𝑀 𝑈,𝑗 It’s all because Idea: try to enforce 𝐅 𝐅 𝑀 𝑈,𝑗 = 𝑀 𝑈,𝑗 !!! 𝑀 𝑈,𝑗 = 𝑃(𝑀 𝑈,𝑗 ∗ )
A typical regret bound Need optimistic 𝑆 𝑈 = 𝑂 𝑗=1 𝑃 𝑀 𝑈,𝑗 estimates! It’s all because Idea: try to enforce 𝐅 𝐅 𝑀 𝑈,𝑗 = 𝑀 𝑈,𝑗 !!! 𝑀 𝑈,𝑗 = 𝑃(𝑀 𝑈,𝑗 ∗ )
A typical algorithm – fixed! For every round 𝑢 = 1,2, … , 𝑈 • Choose arm 𝐽 𝑢 = 𝑗 with probability 𝑞 𝑢,𝑗 • Compute unbiased loss estimate ℓ 𝑢,𝑗 = ℓ 𝑢,𝑗 𝟐 𝐽 𝑢 =𝑗 𝑞 𝑢,𝑗 Use • ℓ 𝑢,𝑗 in a black-box online learning algorithm to compute 𝒒 𝑢+1
A typical algorithm – fixed! For every round 𝑢 = 1,2, … , 𝑈 • Choose arm 𝐽 𝑢 = 𝑗 with probability 𝑞 𝑢,𝑗 • Compute biased loss estimate “Implicit ℓ 𝑢,𝑗 = ℓ 𝑢,𝑗 ℓ 𝑢,𝑗 exploration” 𝟐 𝐽 𝑢 =𝑗 ℓ 𝑢,𝑗 = 𝑞 𝑢,𝑗 + 𝛿 𝟐 𝐽 𝑢 =𝑗 𝑞 𝑢,𝑗 (Kocák, N, Valko and Munos, 2015) Use • ℓ 𝑢,𝑗 in a black-box online learning algorithm to compute 𝒒 𝑢+1
Recommend
More recommend