robustly combining supervised and bandit feedback Chicheng Zhang 1 ; - PowerPoint PPT Presentation

Warm-starting contextual bandits: robustly combining supervised and bandit feedback Chicheng Zhang 1 ; Alekh Agarwal 1 ; Hal Daumé III 1,2 ; John Langford 1 ; Sahand Negahban 3 1 Microsoft Research, 2 University of Maryland, 3 Yale University

Warm-starting contextual bandits • For timestep 𝑢 = 1,2, . . 𝑈: • Observe context 𝑦 𝑢 with associated cost 𝑑 𝑢 = (𝑑 𝑢 1 , … , 𝑑 𝑢 𝐿 ) 𝑑 𝑢 from distribution 𝐸 𝑦 𝑢 • Take an action 𝑏 𝑢 ∈ {1, … 𝐿} 𝑏 𝑢 • Receive cost 𝑑 𝑢 𝑏 𝑢 ∈ [0,1] 𝑑 𝑢 (𝑏 𝑢 ) Learning algorithm User 𝑈 • Goal: incur low cumulative cost: σ 𝑢=1 𝑑 𝑢 (𝑏 𝑢 ) 2

Warm-starting contextual bandits • Receive warm-starting examples 𝑇 = 𝑦, 𝑑 ~ 𝑋 • For timestep 𝑢 = 1,2, . . 𝑈: • Observe context 𝑦 𝑢 with associated cost 𝑑 𝑢 = (𝑑 𝑢 1 , … , 𝑑 𝑢 𝐿 ) 𝑑 𝑢 Fully labeled 𝑇 from distribution 𝐸 𝑦 𝑢 • Take an action 𝑏 𝑢 ∈ {1, … 𝐿} 𝑏 𝑢 • Receive cost 𝑑 𝑢 𝑏 𝑢 ∈ [0,1] 𝑑 𝑢 (𝑏 𝑢 ) Learning algorithm User 𝑈 • Goal: incur low cumulative cost: σ 𝑢=1 𝑑 𝑢 (𝑏 𝑢 ) 3

Warm-starting contextual bandits: motivation • Some labeled examples often exist in applications, e.g. • News recommendation: editorial relevance annotations • Healthcare: historical medical records w/ prescribed treatments • Leveraging historical data can reduce unsafe exploration 4

Warm-starting contextual bandits: motivation • Some labeled examples often exist in applications, e.g. • News recommendation: editorial relevance annotations • Healthcare: historical medical records w/ prescribed treatments • Leveraging historical data can reduce unsafe exploration Key Challenge: 𝑋 may not be the same as 𝐸 • Editors fail to capture users’ preferences • Medical record data from another population How to utilize the warm-starting examples robustly and effectively? 5

Algorithm & performance guarantees ARRoW-CB: iteratively finds the best relative weighting of warm-start and bandit examples to rapidly learn a good policy

Algorithm & performance guarantees ARRoW-CB: iteratively finds the best relative weighting of warm-start and bandit examples to rapidly learn a good policy • Theorem (informal): Compared to algorithms that ignore 𝑇 , * the regret of ARRoW-CB is - never much worse (robustness) - much smaller, if 𝑋 and 𝐸 are close enough, and |𝑇| is large enough * 𝑇~𝑋 is the warm start data

Empirical evaluation • 524 datasets from openml.org Algorithm 1 • CDFs of normalized errors % settings w/ error ≤ 𝑓 Algorithm 2 𝑓

Empirical evaluation • 524 datasets from openml.org Algorithm 1 • CDFs of normalized errors % settings w/ error ≤ 𝑓 Algorithm 2 𝑓 • Moderate noise setting • Algorithms: ARRoW-CB, % settings w/ Sup-Only, error ≤ 𝑓 Bandit-Only, Sim-Bandit (uses both sources) 𝑓

Empirical evaluation • 524 datasets from openml.org Algorithm 1 • CDFs of normalized errors % settings w/ error ≤ 𝑓 Algorithm 2 𝑓 • Moderate noise setting • Algorithms: ARRoW-CB, % settings w/ Sup-Only, error ≤ 𝑓 Bandit-Only, Sim-Bandit (uses both sources) Poster Thu #52 𝑓

robustly combining supervised and bandit feedback Chicheng Zhang 1 ; - PowerPoint PPT Presentation

Warm-starting contextual bandits: robustly combining supervised and bandit feedback Chicheng Zhang 1 ; Alekh Agarwal 1 ; Hal Daum III 1,2 ; John Langford 1 ; Sahand Negahban 3 1 Microsoft Research, 2 University of Maryland, 3 Yale University

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

Combining Models Oliver Schulte - CMPT 726 Bishop PRML Ch. 14 Combining Models: Some Theory

Semi-Supervised Learning Maria-Florina Balcan 03/30/2015 Readings: Semi-Supervised Learning.

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Martin Emms September 20, 2019 4CSLL5

Margin-based Semi-supervised Learning Using Apollonius circle MONA EMADI AND JAFAR TANHA T TC S

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Iterative Hybrid Algorithm for Semi-supervised Classification Martin SAVESKI Supervised by

CS330 Paper Presentation: October 16th, 2019 Supervised Classification Semi-Supervised

Generative Adversarial Networks (GANs) By: Ismail Elezi ismail.elezi@gmail.com Supervised

Machine Learning for NLP Supervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

Nordgold Robustly Positioned for Cash Flow and Growth Nikolai Zelenski, CEO February 2014

Nordgold: Robustly Positioned for Cash Flow and Growth in a Volatile Environment September 2014

Robustly Positioned for Cash Flow and Growth Oleg Pelevin, Head of Strategy March 2014 Disclaimer

Robustly Reusable Fuzzy Extractor from Standard Assumptions Yunhua Wen and Shengli Liu Shanghai

Betting on Performance a note on hypothesis-driven performance testing James Lewis : @boicy 2

Section 8 Section 8 Programming a 8-1 1 Software Development Flow Software Development Flow

LO: To write an opening innovated warning story Innovating the opening: Attleborough has a

Understand your Design Systematic variation using optiSLang inside Workbench PRACE Autumn School

IN5320 - Development in Platform Ecosystems Lecture 5: Design in Platform ecosystems 5th of

Contextual Suggestion Track TREC Thaer Samar, Alejandro Bellogin, Jimmy Lin, Arjen P. de Vries,

Experimental Challenges in Cyber Security: A Story of

Motivation (1) Mutter: Wie oft muss ich dir noch sagen, dass du die Zimmer aufr aumen sollst?