robustly combining supervised and
play

robustly combining supervised and bandit feedback Chicheng Zhang 1 ; - PowerPoint PPT Presentation

Warm-starting contextual bandits: robustly combining supervised and bandit feedback Chicheng Zhang 1 ; Alekh Agarwal 1 ; Hal Daum III 1,2 ; John Langford 1 ; Sahand Negahban 3 1 Microsoft Research, 2 University of Maryland, 3 Yale University


  1. Warm-starting contextual bandits: robustly combining supervised and bandit feedback Chicheng Zhang 1 ; Alekh Agarwal 1 ; Hal DaumΓ© III 1,2 ; John Langford 1 ; Sahand Negahban 3 1 Microsoft Research, 2 University of Maryland, 3 Yale University

  2. Warm-starting contextual bandits β€’ For timestep 𝑒 = 1,2, . . π‘ˆ: β€’ Observe context 𝑦 𝑒 with associated cost 𝑑 𝑒 = (𝑑 𝑒 1 , … , 𝑑 𝑒 𝐿 ) 𝑑 𝑒 from distribution 𝐸 𝑦 𝑒 β€’ Take an action 𝑏 𝑒 ∈ {1, … 𝐿} 𝑏 𝑒 β€’ Receive cost 𝑑 𝑒 𝑏 𝑒 ∈ [0,1] 𝑑 𝑒 (𝑏 𝑒 ) Learning algorithm User π‘ˆ β€’ Goal: incur low cumulative cost: Οƒ 𝑒=1 𝑑 𝑒 (𝑏 𝑒 ) 2

  3. Warm-starting contextual bandits β€’ Receive warm-starting examples 𝑇 = 𝑦, 𝑑 ~ 𝑋 β€’ For timestep 𝑒 = 1,2, . . π‘ˆ: β€’ Observe context 𝑦 𝑒 with associated cost 𝑑 𝑒 = (𝑑 𝑒 1 , … , 𝑑 𝑒 𝐿 ) 𝑑 𝑒 Fully labeled 𝑇 from distribution 𝐸 𝑦 𝑒 β€’ Take an action 𝑏 𝑒 ∈ {1, … 𝐿} 𝑏 𝑒 β€’ Receive cost 𝑑 𝑒 𝑏 𝑒 ∈ [0,1] 𝑑 𝑒 (𝑏 𝑒 ) Learning algorithm User π‘ˆ β€’ Goal: incur low cumulative cost: Οƒ 𝑒=1 𝑑 𝑒 (𝑏 𝑒 ) 3

  4. Warm-starting contextual bandits: motivation β€’ Some labeled examples often exist in applications, e.g. β€’ News recommendation: editorial relevance annotations β€’ Healthcare: historical medical records w/ prescribed treatments β€’ Leveraging historical data can reduce unsafe exploration 4

  5. Warm-starting contextual bandits: motivation β€’ Some labeled examples often exist in applications, e.g. β€’ News recommendation: editorial relevance annotations β€’ Healthcare: historical medical records w/ prescribed treatments β€’ Leveraging historical data can reduce unsafe exploration Key Challenge: 𝑋 may not be the same as 𝐸 β€’ Editors fail to capture users’ preferences β€’ Medical record data from another population How to utilize the warm-starting examples robustly and effectively? 5

  6. Algorithm & performance guarantees ARRoW-CB: iteratively finds the best relative weighting of warm-start and bandit examples to rapidly learn a good policy

  7. Algorithm & performance guarantees ARRoW-CB: iteratively finds the best relative weighting of warm-start and bandit examples to rapidly learn a good policy β€’ Theorem (informal): Compared to algorithms that ignore 𝑇 , * the regret of ARRoW-CB is - never much worse (robustness) - much smaller, if 𝑋 and 𝐸 are close enough, and |𝑇| is large enough * 𝑇~𝑋 is the warm start data

  8. Empirical evaluation β€’ 524 datasets from openml.org Algorithm 1 β€’ CDFs of normalized errors % settings w/ error ≀ 𝑓 Algorithm 2 𝑓

  9. Empirical evaluation β€’ 524 datasets from openml.org Algorithm 1 β€’ CDFs of normalized errors % settings w/ error ≀ 𝑓 Algorithm 2 𝑓 β€’ Moderate noise setting β€’ Algorithms: ARRoW-CB, % settings w/ Sup-Only, error ≀ 𝑓 Bandit-Only, Sim-Bandit (uses both sources) 𝑓

  10. Empirical evaluation β€’ 524 datasets from openml.org Algorithm 1 β€’ CDFs of normalized errors % settings w/ error ≀ 𝑓 Algorithm 2 𝑓 β€’ Moderate noise setting β€’ Algorithms: ARRoW-CB, % settings w/ Sup-Only, error ≀ 𝑓 Bandit-Only, Sim-Bandit (uses both sources) Poster Thu #52 𝑓

Recommend


More recommend