on adaptive regret
play

On adaptive regret bounds for non- stochastic bandits Gergely Neu - PowerPoint PPT Presentation

On adaptive regret bounds for non- stochastic bandits Gergely Neu INRIA Lille, SequeL team Universitat Pompeu Fabra, Barcelona Online learning and bandits Adaptive bounds in online learning Adaptive bounds for bandits Outline


  1. On adaptive regret bounds for non- stochastic bandits Gergely Neu INRIA Lille, SequeL team → Universitat Pompeu Fabra, Barcelona

  2. • Online learning and bandits • Adaptive bounds in online learning • Adaptive bounds for bandits Outline • What we already have • What’s new: First-order bounds • What may be possible • What seems impossible* *Opinion alert!

  3. Online learning and non-stochastic bandits For each round 𝑢 = 1,2, … , 𝑈 • Learner chooses action 𝐽 𝑢 ∈ {1,2, … , 𝑂} • Environment chooses losses ℓ 𝑢,𝑗 ∈ [0,1] for all 𝑗 • Learner suffers loss ℓ 𝑢,𝐽 𝑢 • Learner observes losses ℓ 𝑢,𝑗 for all 𝑗

  4. Online learning and non-stochastic bandits For each round 𝑢 = 1,2, … , 𝑈 • Learner chooses action 𝐽 𝑢 ∈ {1,2, … , 𝑂} • Environment chooses losses ℓ 𝑢,𝑗 ∈ [0,1] for all 𝑗 • Learner suffers loss ℓ 𝑢,𝐽 𝑢 • Learner observes losses ℓ 𝑢,𝑗 for all 𝑗 For each round 𝑢 = 1,2, … , 𝑈 • Learner chooses action 𝐽 𝑢 ∈ {1,2, … , 𝑂} • Environment chooses losses ℓ 𝑢,𝑗 ∈ [0,1] for all 𝑗 • Learner suffers loss ℓ 𝑢,𝐽 𝑢 • Learner observes its own loss ℓ 𝑢,𝐽 𝑢

  5. Online learning and non-stochastic bandits For each round 𝑢 = 1,2, … , 𝑈 • Learner chooses action 𝐽 𝑢 ∈ {1,2, … , 𝑂} • Need to explore! Environment chooses losses ℓ 𝑢,𝑗 ∈ [0,1] for all 𝑗 • Learner suffers loss ℓ 𝑢,𝐽 𝑢 • Learner observes losses ℓ 𝑢,𝑗 for all 𝑗 For each round 𝑢 = 1,2, … , 𝑈 • Learner chooses action 𝐽 𝑢 ∈ {1,2, … , 𝑂} • Environment chooses losses ℓ 𝑢,𝑗 ∈ [0,1] for all 𝑗 • Learner suffers loss ℓ 𝑢,𝐽 𝑢 • Learner observes its own loss ℓ 𝑢,𝐽 𝑢

  6. Minimax regret • Define (expected) regret against action 𝑗 as 𝑈 𝑈 𝑆 𝑈,𝑗 = 𝐅 ℓ 𝑢,𝐽 𝑢 − ℓ 𝑢,𝑗 𝑢=1 𝑢=1 • Goal: minimize regret against the best action 𝑗 ∗ 𝑆 𝑈 = 𝑆 𝑈,𝑗 ∗ = max 𝑆 𝑈,𝑗 𝑗

  7. Minimax regret • Define (expected) regret against action 𝑗 as 𝑈 𝑈 𝑆 𝑈,𝑗 = 𝐅 ℓ 𝑢,𝐽 𝑢 − ℓ 𝑢,𝑗 𝑢=1 𝑢=1 • Goal: minimize regret against the best action 𝑗 ∗ 𝑆 𝑈 = 𝑆 𝑈,𝑗 ∗ = max 𝑆 𝑈,𝑗 𝑗 Full information Bandit 𝑆 𝑈 = Θ 𝑂𝑈 𝑆 𝑈 = Θ 𝑈 log 𝑂

  8. Beyond minimax: i.i.d. losses Full information Bandit 𝑆 𝑈 = Θ 𝑂𝑈 𝑆 𝑈 = Θ 𝑈 log 𝑂 Θ(log 𝑂) Θ(𝑂log 𝑈)

  9. Beyond minimax: “ Higher-order ” bounds Full information Bandit minimax 𝑆 𝑈 = 𝑃 𝑈 log 𝑂 𝑆 𝑈 = 𝑃 𝑂𝑈 first-order 𝑀 𝑈,𝑗 ∗ log 𝑂 𝑆 𝑈 = 𝑃 𝑀 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 second-order 𝑆 𝑈 = 𝑃 𝑇 𝑢,𝑗 ∗ log 𝑂 2 𝑇 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 Cesa-Bianchi, Mansour, Stoltz (2005) variance 𝑆 𝑈 = 𝑃 𝑊 𝑈,𝑗 ∗ log 𝑂 2 𝑊 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 − 𝑛 Hazan and Kale (2010) with a little cheating

  10. Beyond minimax: “Higher - order” bounds Full information Bandit minimax 𝑆 𝑈 = 𝑃 𝑈 log 𝑂 𝑆 𝑈 = 𝑃 𝑂𝑈 first-order 𝑀 𝑈,𝑗 ∗ log 𝑂 𝑆 𝑈 = 𝑃 𝑀 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 second-order 𝑆 𝑈 = 𝑗 𝑇 𝑢,𝑗 𝑃 𝑆 𝑈 = 𝑃 𝑇 𝑢,𝑗 ∗ log 𝑂 2 𝑇 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 Cesa-Bianchi, Mansour, Stoltz (2005) Auer et al. (2002) + some hacking variance 𝑃 𝑂 2 𝑗 𝑊 𝑆 𝑈 = 𝑆 𝑈 = 𝑃 𝑊 𝑈,𝑗 ∗ log 𝑂 𝑢,𝑗 2 𝑊 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 − 𝑛 Hazan and Kale (2010) Hazan and Kale (2011) with a little cheating

  11. Beyond minimax: “Higher - order” bounds Full information Bandit minimax 𝑆 𝑈 = 𝑃 𝑈 log 𝑂 𝑆 𝑈 = 𝑃 𝑂𝑈 first-order (it ’ s complicated) 𝑀 𝑈,𝑗 ∗ log 𝑂 𝑆 𝑈 = 𝑃 𝑀 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 second-order 𝑆 𝑈 = 𝑗 𝑇 𝑢,𝑗 𝑃 𝑆 𝑈 = 𝑃 𝑇 𝑢,𝑗 ∗ log 𝑂 2 𝑇 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 Cesa-Bianchi, Mansour, Stoltz (2005) Auer et al. (2002) + some hacking variance 𝑃 𝑂 2 𝑗 𝑊 𝑆 𝑈 = 𝑆 𝑈 = 𝑃 𝑊 𝑈,𝑗 ∗ log 𝑂 𝑢,𝑗 2 𝑊 𝑈,𝑗 = 𝑢 ℓ 𝑢,𝑗 − 𝑛 Hazan and Kale (2010) Hazan and Kale (2011) with a little cheating

  12. First-order bounds for bandits (it ’ s complicated) • “Small - gain” bounds: • Consider the gain game with 𝑕 𝑢,𝑗 = 1 − ℓ 𝑢,𝑗 • Auer, Cesa-Bianchi, Freund and Schapire (2002): 𝐻 𝑈,𝑗 = 𝑢 𝑕 𝑢,𝑗 𝑆 𝑈 = 𝑃 𝑂𝐻 𝑈,𝑗 ∗ log 𝑂

  13. First-order bounds for bandits (it’s complicated) • “Small - gain” bounds: • Consider the gain game with 𝑕 𝑢,𝑗 = 1 − ℓ 𝑢,𝑗 • Auer, Cesa-Bianchi, Freund and Schapire (2002): 𝐻 𝑈,𝑗 = 𝑢 𝑕 𝑢,𝑗 𝑆 𝑈 = 𝑃 𝑂𝐻 𝑈,𝑗 ∗ log 𝑂 Problem: only good if best expert is bad!

  14. First-order bounds for bandits (it’s complicated) • “Small - gain” bounds: 𝑂𝐻 𝑈,𝑗 ∗ log 𝑂 𝑆 𝑈 = 𝑃 • A little trickier analysis gives or 𝑢 𝑗 𝑕 𝑢,𝑗 log 𝑂 𝑆 𝑈 = 𝑃 𝑢 𝑗 ℓ 𝑢,𝑗 log 𝑂 𝑆 𝑈 = 𝑃

  15. First-order bounds for bandits (it’s complicated) • “Small - gain” bounds: 𝑂𝐻 𝑈,𝑗 ∗ log 𝑂 𝑆 𝑈 = 𝑃 • A little trickier analysis gives or 𝑢 𝑗 𝑕 𝑢,𝑗 log 𝑂 𝑆 𝑈 = 𝑃 𝑢 𝑗 ℓ 𝑢,𝑗 log 𝑂 𝑆 𝑈 = 𝑃 Problem: one misbehaving action ruins the bound!

  16. First-order bounds for bandits (it’s complicated) • “Small - gain” bounds: 𝑂𝐻 𝑈,𝑗 ∗ log 𝑂 𝑆 𝑈 = 𝑃 • A little trickier analysis gives 𝑢 𝑗 ℓ 𝑢,𝑗 log 𝑂 𝑆 𝑈 = 𝑃 • Some obscure actual first-order bounds: ∗ Stoltz (2005): 𝑂 𝑀 𝑈 › ∗ Allenberg, Auer, Györfi and Ottucsák (2006): 𝑂𝑀 𝑈 › Rakhlin and Sridharan (2013): 𝑂 3/2 𝑀 𝑈 ∗ ›

  17. First-order bounds for bandits (it’s complicated) • “Small - gain” bounds: 𝑂𝐻 𝑈,𝑗 ∗ log 𝑂 𝑆 𝑈 = 𝑃 • A little trickier analysis gives 𝑢 𝑗 ℓ 𝑢,𝑗 log 𝑂 𝑆 𝑈 = 𝑃 • Some obscure actual first-order bounds: ∗ Stoltz (2005): 𝑂 𝑀 𝑈 › Problem: ∗ Allenberg, Auer, Györfi and Ottucsák (2006): 𝑂𝑀 𝑈 › no real insight from analyses! Rakhlin and Sridharan (2013): 𝑂 3/2 𝑀 𝑈 ∗ ›

  18. First-order bounds for non- stochastic bandits

  19. A typical bandit algorithm For every round 𝑢 = 1,2, … , 𝑈 • Choose arm 𝐽 𝑢 = 𝑗 with probability 𝑞 𝑢,𝑗 • Compute unbiased loss estimate ℓ 𝑢,𝑗 = ℓ 𝑢,𝑗 𝟐 𝐽 𝑢 =𝑗 𝑞 𝑢,𝑗 Use • ℓ 𝑢,𝑗 in a black-box online learning algorithm to compute 𝒒 𝑢+1

  20. A typical regret bound 𝑈 𝑂 log 𝑂 𝑆 𝑈 ≤ 2 𝑞 𝑢,𝑗 + 𝜃𝐅 ℓ 𝑢,𝑗 𝜃 𝑢=1 𝑗=1 ( 𝜃 : “learning rate”)

  21. A typical regret bound 𝑈 𝑂 log 𝑂 𝑆 𝑈 ≤ 2 𝑞 𝑢,𝑗 + 𝜃𝐅 ℓ 𝑢,𝑗 𝜃 𝑢=1 𝑗=1 𝑈 𝑂 log 𝑂 ≤ + 𝜃𝐅 ℓ 𝑢,𝑗 𝜃 𝑢=1 𝑗=1 ( 𝜃 : “learning rate”)

  22. A typical regret bound 𝑈 𝑂 log 𝑂 𝑆 𝑈 ≤ 2 𝑞 𝑢,𝑗 + 𝜃𝐅 ℓ 𝑢,𝑗 𝜃 𝑢=1 𝑗=1 𝑈 𝑂 log 𝑂 ≤ + 𝜃𝐅 ℓ 𝑢,𝑗 𝜃 𝑢=1 𝑗=1 𝑂 = log 𝑂 + 𝜃 𝑀 𝑈,𝑗 𝜃 𝑗=1 ( 𝜃 : “learning rate”)

  23. A typical regret bound 𝑈 𝑂 log 𝑂 𝑆 𝑈 ≤ 2 𝑞 𝑢,𝑗 + 𝜃𝐅 ℓ 𝑢,𝑗 𝜃 𝑢=1 𝑗=1 𝑈 𝑂 log 𝑂 ≤ + 𝜃𝐅 ℓ 𝑢,𝑗 𝜃 𝑢=1 𝑗=1 𝑂 = log 𝑂 = 𝑂 𝑗=1 𝑃 𝑀 𝑈,𝑗 + 𝜃 𝑀 𝑈,𝑗 𝜃 𝑗=1 ( 𝜃 : “learning rate”) (for appropriate 𝜃 )

  24. A typical regret bound 𝑆 𝑈 = 𝑂 𝑗=1 𝑃 𝑀 𝑈,𝑗

  25. A typical regret bound 𝑆 𝑈 = 𝑂 𝑗=1 𝑃 𝑀 𝑈,𝑗 It’s all because 𝐅 𝑀 𝑈,𝑗 = 𝑀 𝑈,𝑗 !!!

  26. A typical regret bound 𝑆 𝑈 = 𝑂 𝑗=1 𝑃 𝑀 𝑈,𝑗 It’s all because Idea: try to enforce 𝐅 𝐅 𝑀 𝑈,𝑗 = 𝑀 𝑈,𝑗 !!! 𝑀 𝑈,𝑗 = 𝑃(𝑀 𝑈,𝑗 ∗ )

  27. A typical regret bound Need optimistic 𝑆 𝑈 = 𝑂 𝑗=1 𝑃 𝑀 𝑈,𝑗 estimates! It’s all because Idea: try to enforce 𝐅 𝐅 𝑀 𝑈,𝑗 = 𝑀 𝑈,𝑗 !!! 𝑀 𝑈,𝑗 = 𝑃(𝑀 𝑈,𝑗 ∗ )

  28. A typical algorithm – fixed! For every round 𝑢 = 1,2, … , 𝑈 • Choose arm 𝐽 𝑢 = 𝑗 with probability 𝑞 𝑢,𝑗 • Compute unbiased loss estimate ℓ 𝑢,𝑗 = ℓ 𝑢,𝑗 𝟐 𝐽 𝑢 =𝑗 𝑞 𝑢,𝑗 Use • ℓ 𝑢,𝑗 in a black-box online learning algorithm to compute 𝒒 𝑢+1

  29. A typical algorithm – fixed! For every round 𝑢 = 1,2, … , 𝑈 • Choose arm 𝐽 𝑢 = 𝑗 with probability 𝑞 𝑢,𝑗 • Compute biased loss estimate “Implicit ℓ 𝑢,𝑗 = ℓ 𝑢,𝑗 ℓ 𝑢,𝑗 exploration” 𝟐 𝐽 𝑢 =𝑗 ℓ 𝑢,𝑗 = 𝑞 𝑢,𝑗 + 𝛿 𝟐 𝐽 𝑢 =𝑗 𝑞 𝑢,𝑗 (Kocák, N, Valko and Munos, 2015) Use • ℓ 𝑢,𝑗 in a black-box online learning algorithm to compute 𝒒 𝑢+1

Recommend


More recommend