Warm-starting contextual bandits: robustly combining supervised and bandit feedback Chicheng Zhang 1 ; Alekh Agarwal 1 ; Hal DaumΓ© III 1,2 ; John Langford 1 ; Sahand Negahban 3 1 Microsoft Research, 2 University of Maryland, 3 Yale University
Warm-starting contextual bandits β’ For timestep π’ = 1,2, . . π: β’ Observe context π¦ π’ with associated cost π π’ = (π π’ 1 , β¦ , π π’ πΏ ) π π’ from distribution πΈ π¦ π’ β’ Take an action π π’ β {1, β¦ πΏ} π π’ β’ Receive cost π π’ π π’ β [0,1] π π’ (π π’ ) Learning algorithm User π β’ Goal: incur low cumulative cost: Ο π’=1 π π’ (π π’ ) 2
Warm-starting contextual bandits β’ Receive warm-starting examples π = π¦, π ~ π β’ For timestep π’ = 1,2, . . π: β’ Observe context π¦ π’ with associated cost π π’ = (π π’ 1 , β¦ , π π’ πΏ ) π π’ Fully labeled π from distribution πΈ π¦ π’ β’ Take an action π π’ β {1, β¦ πΏ} π π’ β’ Receive cost π π’ π π’ β [0,1] π π’ (π π’ ) Learning algorithm User π β’ Goal: incur low cumulative cost: Ο π’=1 π π’ (π π’ ) 3
Warm-starting contextual bandits: motivation β’ Some labeled examples often exist in applications, e.g. β’ News recommendation: editorial relevance annotations β’ Healthcare: historical medical records w/ prescribed treatments β’ Leveraging historical data can reduce unsafe exploration 4
Warm-starting contextual bandits: motivation β’ Some labeled examples often exist in applications, e.g. β’ News recommendation: editorial relevance annotations β’ Healthcare: historical medical records w/ prescribed treatments β’ Leveraging historical data can reduce unsafe exploration Key Challenge: π may not be the same as πΈ β’ Editors fail to capture usersβ preferences β’ Medical record data from another population How to utilize the warm-starting examples robustly and effectively? 5
Algorithm & performance guarantees ARRoW-CB: iteratively finds the best relative weighting of warm-start and bandit examples to rapidly learn a good policy
Algorithm & performance guarantees ARRoW-CB: iteratively finds the best relative weighting of warm-start and bandit examples to rapidly learn a good policy β’ Theorem (informal): Compared to algorithms that ignore π , * the regret of ARRoW-CB is - never much worse (robustness) - much smaller, if π and πΈ are close enough, and |π| is large enough * π~π is the warm start data
Empirical evaluation β’ 524 datasets from openml.org Algorithm 1 β’ CDFs of normalized errors % settings w/ error β€ π Algorithm 2 π
Empirical evaluation β’ 524 datasets from openml.org Algorithm 1 β’ CDFs of normalized errors % settings w/ error β€ π Algorithm 2 π β’ Moderate noise setting β’ Algorithms: ARRoW-CB, % settings w/ Sup-Only, error β€ π Bandit-Only, Sim-Bandit (uses both sources) π
Empirical evaluation β’ 524 datasets from openml.org Algorithm 1 β’ CDFs of normalized errors % settings w/ error β€ π Algorithm 2 π β’ Moderate noise setting β’ Algorithms: ARRoW-CB, % settings w/ Sup-Only, error β€ π Bandit-Only, Sim-Bandit (uses both sources) Poster Thu #52 π
Recommend
More recommend