Meta-Learning Contextual Bandit Exploration Amr Sharaf Hal Daum´ e III University of Maryland Microsoft Research & University of Maryland amr@cs.umd.edu me@hal3.name Abstract 1
Can we learn to explore in contextual bandits? 2
Contextual Bandits: News Display 3
Contextual Bandits: News Display NEW NEW NEW NEW 4
Contextual Bandits: News Display NEW 5
Contextual Bandits: News Display NEW Goal: Maximize Sum of Rewards 6
Training Mêlée by Imitation Access to π * at train Roll-out with π * Goal: learn π loss exploit (exploit) t-1 t … loss explore (explore) Deviation Roll-in with π Examples / Time 7
Generalization: Meta-Features - No direct dependency on the contexts x. - Features include: - Calibrated predicted probability p(a t | f t , x t ); - Entropy of the predicted probability distribution; - A one-hot encoding for the predicted action ft(x t ); - Current time step t; - Average observed rewards for each action. 8
A representative learning curve 9
Win / Loss Statistics Win statistics: each (row, column) entry shows the number of times the row algorithm won against the column, minus the number of losses. 10
Win / Loss Statistics Win statistics: each (row, column) entry shows the number of times the row algorithm won against the column, minus the number of losses. 11
Theoretical Guarantees - The no-regret property of Aggrevate can be leveraged in our meta-learning setting. - We relate the regret of the learner to the overall regret of π . - This shows that, if the underlying classifier improves su ffi ciently quickly, Mêlée will achieve sublinear regret. 12
Conclusion - Q: Can we learn to explore in contextual bandits? - A: Yes, by imitating an expert exploration policy; - Generalize across bandit problems using meta-features; - Outperform alternative strategies in most settings; - We provide theoretical guarantees. 13
Recommend
More recommend