meta learning contextual bandit exploration
play

Meta-Learning Contextual Bandit Exploration Amr Sharaf Hal Daum e - PowerPoint PPT Presentation

Meta-Learning Contextual Bandit Exploration Amr Sharaf Hal Daum e III University of Maryland Microsoft Research & University of Maryland amr@cs.umd.edu me@hal3.name Abstract 1 Can we learn to explore in contextual bandits? 2


  1. Meta-Learning Contextual Bandit Exploration Amr Sharaf Hal Daum´ e III University of Maryland Microsoft Research & University of Maryland amr@cs.umd.edu me@hal3.name Abstract 1

  2. Can we learn to explore in contextual bandits? 2

  3. Contextual Bandits: News Display 3

  4. Contextual Bandits: News Display NEW NEW NEW NEW 4

  5. Contextual Bandits: News Display NEW 5

  6. Contextual Bandits: News Display NEW Goal: Maximize Sum of Rewards 6

  7. Training Mêlée by Imitation Access to π * at train Roll-out with π * Goal: learn π loss exploit (exploit) t-1 t … loss explore (explore) Deviation Roll-in with π Examples / Time 7

  8. Generalization: Meta-Features - No direct dependency on the contexts x. - Features include: - Calibrated predicted probability p(a t | f t , x t ); - Entropy of the predicted probability distribution; - A one-hot encoding for the predicted action ft(x t ); - Current time step t; - Average observed rewards for each action. 8

  9. A representative learning curve 9

  10. Win / Loss Statistics Win statistics: each (row, column) entry shows the number of times the row algorithm won against the column, minus the number of losses. 10

  11. Win / Loss Statistics Win statistics: each (row, column) entry shows the number of times the row algorithm won against the column, minus the number of losses. 11

  12. Theoretical Guarantees - The no-regret property of Aggrevate can be leveraged in our meta-learning setting. - We relate the regret of the learner to the overall regret of π . - This shows that, if the underlying classifier improves su ffi ciently quickly, Mêlée will achieve sublinear regret. 12

  13. Conclusion - Q: Can we learn to explore in contextual bandits? - A: Yes, by imitating an expert exploration policy; - Generalize across bandit problems using meta-features; - Outperform alternative strategies in most settings; - We provide theoretical guarantees. 13

Recommend


More recommend