On adaptive regret bounds for non- stochastic bandits Gergely Neu - PowerPoint PPT Presentation
On adaptive regret bounds for non- stochastic bandits Gergely Neu INRIA Lille, SequeL team Universitat Pompeu Fabra, Barcelona Online learning and bandits Adaptive bounds in online learning Adaptive bounds for bandits Outline
On adaptive regret bounds for non- stochastic bandits Gergely Neu INRIA Lille, SequeL team → Universitat Pompeu Fabra, Barcelona
• Online learning and bandits • Adaptive bounds in online learning • Adaptive bounds for bandits Outline • What we already have • What’s new: First-order bounds • What may be possible • What seems impossible* *Opinion alert!
Online learning and non-stochastic bandits For each round 𝑢 = 1,2, … , 𝑈 • Learner chooses action 𝐽 𝑢 ∈ {1,2, … , 𝑂} • Environment chooses losses ℓ 𝑢,𝑗 ∈ [0,1] for all 𝑗 • Learner suffers loss ℓ 𝑢,𝐽 𝑢 • Learner observes losses ℓ 𝑢,𝑗 for all 𝑗
Online learning and non-stochastic bandits For each round 𝑢 = 1,2, … , 𝑈 • Learner chooses action 𝐽 𝑢 ∈ {1,2, … , 𝑂} • Environment chooses losses ℓ 𝑢,𝑗 ∈ [0,1] for all 𝑗 • Learner suffers loss ℓ 𝑢,𝐽 𝑢 • Learner observes losses ℓ 𝑢,𝑗 for all 𝑗 For each round 𝑢 = 1,2, … , 𝑈 • Learner chooses action 𝐽 𝑢 ∈ {1,2, … , 𝑂} • Environment chooses losses ℓ 𝑢,𝑗 ∈ [0,1] for all 𝑗 • Learner suffers loss ℓ 𝑢,𝐽 𝑢 • Learner observes its own loss ℓ 𝑢,𝐽 𝑢
Online learning and non-stochastic bandits For each round 𝑢 = 1,2, … , 𝑈 • Learner chooses action 𝐽 𝑢 ∈ {1,2, … , 𝑂} • Need to explore! Environment chooses losses ℓ 𝑢,𝑗 ∈ [0,1] for all 𝑗 • Learner suffers loss ℓ 𝑢,𝐽 𝑢 • Learner observes losses ℓ 𝑢,𝑗 for all 𝑗 For each round 𝑢 = 1,2, … , 𝑈 • Learner chooses action 𝐽 𝑢 ∈ {1,2, … , 𝑂} • Environment chooses losses ℓ 𝑢,𝑗 ∈ [0,1] for all 𝑗 • Learner suffers loss ℓ 𝑢,𝐽 𝑢 • Learner observes its own loss ℓ 𝑢,𝐽 𝑢
Minimax regret • Define (expected) regret against action 𝑗 as 𝑈 𝑈 𝑆 𝑈,𝑗 = 𝐅 ℓ 𝑢,𝐽 𝑢 − ℓ 𝑢,𝑗 𝑢=1 𝑢=1 • Goal: minimize regret against the best action 𝑗 ∗ 𝑆 𝑈 = 𝑆 𝑈,𝑗 ∗ = max 𝑆 𝑈,𝑗 𝑗
Minimax regret • Define (expected) regret against action 𝑗 as 𝑈 𝑈 𝑆 𝑈,𝑗 = 𝐅 ℓ 𝑢,𝐽 𝑢 − ℓ 𝑢,𝑗 𝑢=1 𝑢=1 • Goal: minimize regret against the best action 𝑗 ∗ 𝑆 𝑈 = 𝑆 𝑈,𝑗 ∗ = max 𝑆 𝑈,𝑗 𝑗 Full information Bandit 𝑆 𝑈 = Θ 𝑂𝑈 𝑆 𝑈 = Θ 𝑈 log 𝑂
Beyond minimax: i.i.d. losses Full information Bandit 𝑆 𝑈 = Θ 𝑂𝑈 𝑆 𝑈 = Θ 𝑈 log 𝑂 Θ(log 𝑂) Θ(𝑂log 𝑈)
Beyond minimax: “ Higher-order ” bounds Full information Bandit minimax 𝑆 𝑈 = 𝑃 𝑈 log 𝑂 𝑆 𝑈 = 𝑃 𝑂𝑈 first-order 𝑀 𝑈,𝑗 ∗ log 𝑂 𝑆 𝑈 = 𝑃 𝑀 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 second-order 𝑆 𝑈 = 𝑃 𝑇 𝑢,𝑗 ∗ log 𝑂 2 𝑇 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 Cesa-Bianchi, Mansour, Stoltz (2005) variance 𝑆 𝑈 = 𝑃 𝑊 𝑈,𝑗 ∗ log 𝑂 2 𝑊 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 − 𝑛 Hazan and Kale (2010) with a little cheating
Beyond minimax: “Higher - order” bounds Full information Bandit minimax 𝑆 𝑈 = 𝑃 𝑈 log 𝑂 𝑆 𝑈 = 𝑃 𝑂𝑈 first-order 𝑀 𝑈,𝑗 ∗ log 𝑂 𝑆 𝑈 = 𝑃 𝑀 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 second-order 𝑆 𝑈 = 𝑗 𝑇 𝑢,𝑗 𝑃 𝑆 𝑈 = 𝑃 𝑇 𝑢,𝑗 ∗ log 𝑂 2 𝑇 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 Cesa-Bianchi, Mansour, Stoltz (2005) Auer et al. (2002) + some hacking variance 𝑃 𝑂 2 𝑗 𝑊 𝑆 𝑈 = 𝑆 𝑈 = 𝑃 𝑊 𝑈,𝑗 ∗ log 𝑂 𝑢,𝑗 2 𝑊 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 − 𝑛 Hazan and Kale (2010) Hazan and Kale (2011) with a little cheating
Beyond minimax: “Higher - order” bounds Full information Bandit minimax 𝑆 𝑈 = 𝑃 𝑈 log 𝑂 𝑆 𝑈 = 𝑃 𝑂𝑈 first-order (it ’ s complicated) 𝑀 𝑈,𝑗 ∗ log 𝑂 𝑆 𝑈 = 𝑃 𝑀 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 second-order 𝑆 𝑈 = 𝑗 𝑇 𝑢,𝑗 𝑃 𝑆 𝑈 = 𝑃 𝑇 𝑢,𝑗 ∗ log 𝑂 2 𝑇 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 Cesa-Bianchi, Mansour, Stoltz (2005) Auer et al. (2002) + some hacking variance 𝑃 𝑂 2 𝑗 𝑊 𝑆 𝑈 = 𝑆 𝑈 = 𝑃 𝑊 𝑈,𝑗 ∗ log 𝑂 𝑢,𝑗 2 𝑊 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 − 𝑛 Hazan and Kale (2010) Hazan and Kale (2011) with a little cheating
First-order bounds for bandits (it ’ s complicated) • “Small - gain” bounds: • Consider the gain game with 𝑢,𝑗 = 1 − ℓ 𝑢,𝑗 • Auer, Cesa-Bianchi, Freund and Schapire (2002): 𝐻 𝑈,𝑗 = 𝑢 𝑢,𝑗 𝑆 𝑈 = 𝑃 𝑂𝐻 𝑈,𝑗 ∗ log 𝑂
First-order bounds for bandits (it’s complicated) • “Small - gain” bounds: • Consider the gain game with 𝑢,𝑗 = 1 − ℓ 𝑢,𝑗 • Auer, Cesa-Bianchi, Freund and Schapire (2002): 𝐻 𝑈,𝑗 = 𝑢 𝑢,𝑗 𝑆 𝑈 = 𝑃 𝑂𝐻 𝑈,𝑗 ∗ log 𝑂 Problem: only good if best expert is bad!
First-order bounds for bandits (it’s complicated) • “Small - gain” bounds: 𝑂𝐻 𝑈,𝑗 ∗ log 𝑂 𝑆 𝑈 = 𝑃 • A little trickier analysis gives or 𝑢 𝑗 𝑢,𝑗 log 𝑂 𝑆 𝑈 = 𝑃 𝑢 𝑗 ℓ 𝑢,𝑗 log 𝑂 𝑆 𝑈 = 𝑃
First-order bounds for bandits (it’s complicated) • “Small - gain” bounds: 𝑂𝐻 𝑈,𝑗 ∗ log 𝑂 𝑆 𝑈 = 𝑃 • A little trickier analysis gives or 𝑢 𝑗 𝑢,𝑗 log 𝑂 𝑆 𝑈 = 𝑃 𝑢 𝑗 ℓ 𝑢,𝑗 log 𝑂 𝑆 𝑈 = 𝑃 Problem: one misbehaving action ruins the bound!
First-order bounds for bandits (it’s complicated) • “Small - gain” bounds: 𝑂𝐻 𝑈,𝑗 ∗ log 𝑂 𝑆 𝑈 = 𝑃 • A little trickier analysis gives 𝑢 𝑗 ℓ 𝑢,𝑗 log 𝑂 𝑆 𝑈 = 𝑃 • Some obscure actual first-order bounds: ∗ Stoltz (2005): 𝑂 𝑀 𝑈 › ∗ Allenberg, Auer, Györfi and Ottucsák (2006): 𝑂𝑀 𝑈 › Rakhlin and Sridharan (2013): 𝑂 3/2 𝑀 𝑈 ∗ ›
First-order bounds for bandits (it’s complicated) • “Small - gain” bounds: 𝑂𝐻 𝑈,𝑗 ∗ log 𝑂 𝑆 𝑈 = 𝑃 • A little trickier analysis gives 𝑢 𝑗 ℓ 𝑢,𝑗 log 𝑂 𝑆 𝑈 = 𝑃 • Some obscure actual first-order bounds: ∗ Stoltz (2005): 𝑂 𝑀 𝑈 › Problem: ∗ Allenberg, Auer, Györfi and Ottucsák (2006): 𝑂𝑀 𝑈 › no real insight from analyses! Rakhlin and Sridharan (2013): 𝑂 3/2 𝑀 𝑈 ∗ ›
First-order bounds for non- stochastic bandits
A typical bandit algorithm For every round 𝑢 = 1,2, … , 𝑈 • Choose arm 𝐽 𝑢 = 𝑗 with probability 𝑞 𝑢,𝑗 • Compute unbiased loss estimate ℓ 𝑢,𝑗 = ℓ 𝑢,𝑗 𝟐 𝐽 𝑢 =𝑗 𝑞 𝑢,𝑗 Use • ℓ 𝑢,𝑗 in a black-box online learning algorithm to compute 𝒒 𝑢+1
A typical regret bound 𝑈 𝑂 log 𝑂 𝑆 𝑈 ≤ 2 𝑞 𝑢,𝑗 + 𝜃𝐅 ℓ 𝑢,𝑗 𝜃 𝑢=1 𝑗=1 ( 𝜃 : “learning rate”)
A typical regret bound 𝑈 𝑂 log 𝑂 𝑆 𝑈 ≤ 2 𝑞 𝑢,𝑗 + 𝜃𝐅 ℓ 𝑢,𝑗 𝜃 𝑢=1 𝑗=1 𝑈 𝑂 log 𝑂 ≤ + 𝜃𝐅 ℓ 𝑢,𝑗 𝜃 𝑢=1 𝑗=1 ( 𝜃 : “learning rate”)
A typical regret bound 𝑈 𝑂 log 𝑂 𝑆 𝑈 ≤ 2 𝑞 𝑢,𝑗 + 𝜃𝐅 ℓ 𝑢,𝑗 𝜃 𝑢=1 𝑗=1 𝑈 𝑂 log 𝑂 ≤ + 𝜃𝐅 ℓ 𝑢,𝑗 𝜃 𝑢=1 𝑗=1 𝑂 = log 𝑂 + 𝜃 𝑀 𝑈,𝑗 𝜃 𝑗=1 ( 𝜃 : “learning rate”)
A typical regret bound 𝑈 𝑂 log 𝑂 𝑆 𝑈 ≤ 2 𝑞 𝑢,𝑗 + 𝜃𝐅 ℓ 𝑢,𝑗 𝜃 𝑢=1 𝑗=1 𝑈 𝑂 log 𝑂 ≤ + 𝜃𝐅 ℓ 𝑢,𝑗 𝜃 𝑢=1 𝑗=1 𝑂 = log 𝑂 = 𝑂 𝑗=1 𝑃 𝑀 𝑈,𝑗 + 𝜃 𝑀 𝑈,𝑗 𝜃 𝑗=1 ( 𝜃 : “learning rate”) (for appropriate 𝜃 )
A typical regret bound 𝑆 𝑈 = 𝑂 𝑗=1 𝑃 𝑀 𝑈,𝑗
A typical regret bound 𝑆 𝑈 = 𝑂 𝑗=1 𝑃 𝑀 𝑈,𝑗 It’s all because 𝐅 𝑀 𝑈,𝑗 = 𝑀 𝑈,𝑗 !!!
A typical regret bound 𝑆 𝑈 = 𝑂 𝑗=1 𝑃 𝑀 𝑈,𝑗 It’s all because Idea: try to enforce 𝐅 𝐅 𝑀 𝑈,𝑗 = 𝑀 𝑈,𝑗 !!! 𝑀 𝑈,𝑗 = 𝑃(𝑀 𝑈,𝑗 ∗ )
A typical regret bound Need optimistic 𝑆 𝑈 = 𝑂 𝑗=1 𝑃 𝑀 𝑈,𝑗 estimates! It’s all because Idea: try to enforce 𝐅 𝐅 𝑀 𝑈,𝑗 = 𝑀 𝑈,𝑗 !!! 𝑀 𝑈,𝑗 = 𝑃(𝑀 𝑈,𝑗 ∗ )
A typical algorithm – fixed! For every round 𝑢 = 1,2, … , 𝑈 • Choose arm 𝐽 𝑢 = 𝑗 with probability 𝑞 𝑢,𝑗 • Compute unbiased loss estimate ℓ 𝑢,𝑗 = ℓ 𝑢,𝑗 𝟐 𝐽 𝑢 =𝑗 𝑞 𝑢,𝑗 Use • ℓ 𝑢,𝑗 in a black-box online learning algorithm to compute 𝒒 𝑢+1
A typical algorithm – fixed! For every round 𝑢 = 1,2, … , 𝑈 • Choose arm 𝐽 𝑢 = 𝑗 with probability 𝑞 𝑢,𝑗 • Compute biased loss estimate “Implicit ℓ 𝑢,𝑗 = ℓ 𝑢,𝑗 ℓ 𝑢,𝑗 exploration” 𝟐 𝐽 𝑢 =𝑗 ℓ 𝑢,𝑗 = 𝑞 𝑢,𝑗 + 𝛿 𝟐 𝐽 𝑢 =𝑗 𝑞 𝑢,𝑗 (Kocák, N, Valko and Munos, 2015) Use • ℓ 𝑢,𝑗 in a black-box online learning algorithm to compute 𝒒 𝑢+1
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.