Toward Better Use of Data in Contextual and Linear Bandits Nima - PowerPoint PPT Presentation

Toward Better Use of Data in Contextual and Linear Bandits Nima Hamidi and Mohsen Bayati Stanford University October 2, 2020 References: arXiv 2002.05152 & arXiv 2006.06790 N. Hamidi, M. Bayati (Stanford University) Better Use of Data in Linear Bandits October 2, 2020 1 / 53

Overview Motivation 1 Confidence-based Policies 2 Sieved-Greedy 3 N. Hamidi, M. Bayati (Stanford University) Better Use of Data in Linear Bandits October 2, 2020 2 / 53

How to Test New Medical Interventions? A hospital wants to reduce post-discharge complications: – Use one of two newly designed telehealth (A or B) Should select one of A or B per patient A/B test or Randomized Control Trial (RCT) have high opportunity cost – In healthcare,experimentation is costly or unethical 1 A B 1Sibbald, Bonnie. 1998. Understanding controlled trials: Why are randomized controlled trials important?, British Medical Journal (Clinical Research Ed.) 316(201). N. Hamidi, M. Bayati (Stanford University) Better Use of Data in Linear Bandits October 2, 2020 3 / 53

Beyond Healthcare “Today, Microsoft and several other leading companies, including Amazon, Booking.com, Facebook, and Google, each conduct more than 10,000 online controlled experiments annually, with many tests engaging millions of users.” Kohavi and Thompke, Harvard Business Review, 2017 N. Hamidi, M. Bayati (Stanford University) Better Use of Data in Linear Bandits October 2, 2020 4 / 53

Beyond Healthcare “Today, Microsoft and several other leading companies, including Amazon, Booking.com, Facebook, and Google, each conduct more than 10,000 online controlled experiments annually, with many tests engaging millions of users.” Kohavi and Thompke, Harvard Business Review, 2017 Also, N. Hamidi, M. Bayati (Stanford University) Better Use of Data in Linear Bandits October 2, 2020 4 / 53

Multi-armed Bandit Experiments N. Hamidi, M. Bayati (Stanford University) Better Use of Data in Linear Bandits October 2, 2020 5 / 53

Example (Google Analytics) 2 A/B testing – Website configurations A and B with conversion rates 4% and 5% respectively Using Thompson Sampling , instead of A/B testing, can run experiment with 78.5% less data → 97.5 conversions saved (on avg.) 2 Source: Google Analytics Support Page N. Hamidi, M. Bayati (Stanford University) Better Use of Data in Linear Bandits October 2, 2020 6 / 53

Stochastic Linear Bandit Problem Let Θ ⋆ ∈ R d be fixed (and unknown). At time t , the action set A t ⊆ R d is revealed to a policy π . The policy chooses � A t ∈ A t . It observes a reward r t = � Θ ⋆ , � A t � + ε t . Conditional on the history, ε t has zero mean. N. Hamidi, M. Bayati (Stanford University) Better Use of Data in Linear Bandits October 2, 2020 7 / 53

Evaluation Metric The objective is to improve using past experiences . The cumulative regret is defined as � � T � � � � � Θ ⋆ , A � − � Θ ⋆ , � Regret( T , Θ ⋆ , π ) := E sup A t � � Θ ⋆ � . A ∈A t i =1 N. Hamidi, M. Bayati (Stanford University) Better Use of Data in Linear Bandits October 2, 2020 8 / 53

Evaluation Metric The objective is to improve using past experiences . The cumulative regret is defined as � � T � � � � � Θ ⋆ , A � − � Θ ⋆ , � Regret( T , Θ ⋆ , π ) := E sup A t � � Θ ⋆ � . A ∈A t i =1 In the Bayesian setting, the Bayesian regret is given by BayesRegret( T , π ) := E Θ ⋆ ∼P [Regret( T , Θ ⋆ , π )] . N. Hamidi, M. Bayati (Stanford University) Better Use of Data in Linear Bandits October 2, 2020 8 / 53

Special Cases Standard multi-armed bandit problem N. Hamidi, M. Bayati (Stanford University) Better Use of Data in Linear Bandits October 2, 2020 9 / 53

Special Cases Standard multi-armed bandit problem k -armed contextual bandit problem N. Hamidi, M. Bayati (Stanford University) Better Use of Data in Linear Bandits October 2, 2020 9 / 53

Special Cases Standard multi-armed bandit problem k -armed contextual bandit problem Dynamic-pricing with demand covariates Expected Demand = α + β p + � Γ , X � Expected Revenue = α p + β p 2 + � Γ , X � p N. Hamidi, M. Bayati (Stanford University) Better Use of Data in Linear Bandits October 2, 2020 9 / 53

Special Cases Standard multi-armed bandit problem k -armed contextual bandit problem Dynamic-pricing with demand covariates Expected Demand = α + β p + � Γ , X � Expected Revenue = α p + β p 2 + � Γ , X � p can be mapped to a linear bandit by setting   α A = { ( p , p 2 , pX ) | p ∈ [ p min , p max ] } and Θ ⋆ =   β Γ N. Hamidi, M. Bayati (Stanford University) Better Use of Data in Linear Bandits October 2, 2020 9 / 53

Related Literature UCB/OFUL : Auer, Cesa-Bianchi, and Fischer 2002; Dani, Hayes, and Kakade 2008; Rusmevichientong and Tsitsiklis 2010; Abbasi-Yadkori, P´ al, and Szepesv´ ari 2011 Thompson sampling: Agrawal and Goyal 2013; Russo and Van Roy 2014, 2016; Abeille and Lazaric 2017 ǫ -Greedy and variants: Langford and Zhang 2008; Goldenshluger and Zeevi 2013 N. Hamidi, M. Bayati (Stanford University) Better Use of Data in Linear Bandits October 2, 2020 10 / 53

Related Literature UCB/OFUL : Auer, Cesa-Bianchi, and Fischer 2002; Dani, Hayes, and Kakade 2008; Rusmevichientong and Tsitsiklis 2010; Abbasi-Yadkori, P´ al, and Szepesv´ ari 2011 Thompson sampling: Agrawal and Goyal 2013; Russo and Van Roy 2014, 2016; Abeille and Lazaric 2017 ǫ -Greedy and variants: Langford and Zhang 2008; Goldenshluger and Zeevi 2013 Learning and earning in operations: Carvalho and Puterman05, Araman and Caldentey09, Besbes and Zeevi0911, Harrison et al.12, den Boer and Zwart14-16, Keskin and Zeevi14-16, Gur et al.’14, Johnson et al.15, Chen et al.15, Cohen et al 16, Bayati and Bastani’15, Kallus and Udell16, Javanmard and Nazerzadeh16, Javanmard17, Elmachtoub et. al.’17, Ban and Keskin17, Cheung et al ’18, Bastani et al.’19, and many more! N. Hamidi, M. Bayati (Stanford University) Better Use of Data in Linear Bandits October 2, 2020 10 / 53

Algorithms N. Hamidi, M. Bayati (Stanford University) Better Use of Data in Linear Bandits October 2, 2020 11 / 53

Greedy At time t = 1 , 2 , · · · , T : Using the set of observations H t − 1 := { ( � A 1 , r 1 ) , · · · , ( � A t − 1 , r t − 1 ) } , Construct an estimate � Θ t − 1 for Θ ⋆ , Choose the action A ∈ A t with largest � A , � Θ t − 1 � . Reward Greedy Decision Update H Estimate Θ ⋆ History N. Hamidi, M. Bayati (Stanford University) Better Use of Data in Linear Bandits October 2, 2020 12 / 53

Greedy The ridge estimator is used to obtain � Θ t (for a fixed λ ): t � A i � � A ⊤ i ∈ R d × d , V t := λ I + (1) i =1 and � t � � � � Θ t := V − 1 ∈ R d . A i r i (2) t i =1 N. Hamidi, M. Bayati (Stanford University) Better Use of Data in Linear Bandits October 2, 2020 13 / 53

Greedy Algorithm 1 Greedy algorithm 1: for t = 1 to T do Pull � A t := arg max A ∈A t � A , � Θ t − 1 � 2: Observe the reward r t 3: Compute V t = λ I + � t i =1 � A i � A ⊤ 4: �� t � i Compute � Θ t = V − 1 i =1 � A i r i 5: t 6: end for N. Hamidi, M. Bayati (Stanford University) Better Use of Data in Linear Bandits October 2, 2020 14 / 53

Greedy Algorithm 2 Greedy algorithm 1: for t = 1 to T do Pull � A t := arg max A ∈A t � A , � Θ t − 1 � 2: Observe the reward r t 3: Compute V t = λ I + � t i =1 � A i � A ⊤ 4: �� t � i Compute � Θ t = V − 1 i =1 � A i r i 5: t 6: end for Greedy makes wrong decisions due to over- or under-estimating the true rewards. The over-estimation is automatically corrected. The under-estimation can cause linear regret . N. Hamidi, M. Bayati (Stanford University) Better Use of Data in Linear Bandits October 2, 2020 14 / 53

Greedy A 1 A 2 A 3 A 4 A 5 N. Hamidi, M. Bayati (Stanford University) Better Use of Data in Linear Bandits October 2, 2020 15 / 53

Greedy Greedy A 1 A 2 A 3 A 4 A 5 N. Hamidi, M. Bayati (Stanford University) Better Use of Data in Linear Bandits October 2, 2020 15 / 53

Optimism in Face of Uncertainty (OFU) Algorithm Key idea: be optimistic when estimating the reward of actions. N. Hamidi, M. Bayati (Stanford University) Better Use of Data in Linear Bandits October 2, 2020 16 / 53

Optimism in Face of Uncertainty (OFU) Algorithm Key idea: be optimistic when estimating the reward of actions. For ρ > 0, define the confidence set C t − 1 ( ρ ) to be C t − 1 ( ρ ) := { Θ | � Θ − � Θ t − 1 � V t − 1 ≤ ρ } , where � X � 2 V t − 1 = X ⊤ V t − 1 X ∈ R + . N. Hamidi, M. Bayati (Stanford University) Better Use of Data in Linear Bandits October 2, 2020 16 / 53

Optimism in Face of Uncertainty (OFU) Algorithm Key idea: be optimistic when estimating the reward of actions. For ρ > 0, define the confidence set C t − 1 ( ρ ) to be C t − 1 ( ρ ) := { Θ | � Θ − � Θ t − 1 � V t − 1 ≤ ρ } , where � X � 2 V t − 1 = X ⊤ V t − 1 X ∈ R + . Theorem (Informal, Abbasi-Yadkori, P´ al, and Szepesv´ ari 2011) √ d ) , we have Θ ⋆ ∈ C t − 1 ( ρ ) with high probability. Letting ρ := � O ( N. Hamidi, M. Bayati (Stanford University) Better Use of Data in Linear Bandits October 2, 2020 16 / 53

Toward Better Use of Data in Contextual and Linear Bandits Nima - PowerPoint PPT Presentation

Toward Better Use of Data in Contextual and Linear Bandits Nima Hamidi and Mohsen Bayati Stanford University October 2, 2020 References: arXiv 2002.05152 & arXiv 2006.06790 N. Hamidi, M. Bayati (Stanford University) Better Use of Data in

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Contextual Inquiry Take Aways Overview of Contextual Design Contextual inquiry

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits Authors: John Langford, Tom Zhang

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

Kernel Methods for Cooperative Contextual Bandits Introduction Motivation UCB Algorithms Basic

Simpler Optimal Algorithm for Contextual Bandits under Realizability Yunzong Xu MIT Joint work

Contextual Analysis SWEN-444 Contextual analysis Systematic analysis of contextual user work

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Linear (and contextual) Bandits: Rich decision sets (and side information) Sham M. Kakade

Neural Contextual Bandits with UCB-based Exploration Dongruo Zhou 1 Lihong Li 2 Quanquan Gu 1 1

Data Poisoning Attack cks on Stoch chastic c Bandits Fang Liu and Ness Shroff Outline

Experimental Design & Evaluation 4. Contextual Inquiry SunyoungKim,PhD Contextual

Contextual Advertising: Contextual Advertising: Semantic Approach Semantic Approach Ekaterina

Serving Contextual Communities Serving Contextual Communities The Evangelical Theological

Anomaly-mediated supersymmetry breaking demystified based on JHEP03(2009)123

INC 342 Lecture 3: Design with Root Locus Dr. Benjamas Panomruttanarug Benjamas.pan@kmutt.ac.th

GCIG CCRN QA Monica Bacon GCIG-CCRN Cervix Cancer Education Symposium, February 2018

Workshop on Constrained Off Payments for Potential Rule Change Proposal 24 October 2018 Workshop

A manifold structure on the set of functional observers Jochen Trumpf University of W urzburg

on various in inflation models ( )

BOSONIC HIGHER-CURVATURE GRAVITY SET 8 G = 1 EINSTEIN ACTION PLUS HIGHER-CURVATURE

Modelling Compensation with Timed Process Algebra Simon Foster S.Foster@dcs.shef.ac.uk

Toward Better Use of Data in Contextual and Linear Bandits Nima - PowerPoint PPT Presentation

Toward Better Use of Data in Contextual and Linear Bandits Nima Hamidi and Mohsen Bayati Stanford University October 2, 2020 References: arXiv 2002.05152 & arXiv 2006.06790 N. Hamidi, M. Bayati (Stanford University) Better Use of Data in

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Contextual Inquiry Take Aways Overview of Contextual Design Contextual inquiry

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits Authors: John Langford, Tom Zhang

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

Kernel Methods for Cooperative Contextual Bandits Introduction Motivation UCB Algorithms Basic

Simpler Optimal Algorithm for Contextual Bandits under Realizability Yunzong Xu MIT Joint work

Contextual Analysis SWEN-444 Contextual analysis Systematic analysis of contextual user work

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Linear (and contextual) Bandits: Rich decision sets (and side information) Sham M. Kakade

Neural Contextual Bandits with UCB-based Exploration Dongruo Zhou 1 Lihong Li 2 Quanquan Gu 1 1

Data Poisoning Attack cks on Stoch chastic c Bandits Fang Liu and Ness Shroff Outline

Experimental Design &amp; Evaluation 4. Contextual Inquiry SunyoungKim,PhD Contextual

Contextual Advertising: Contextual Advertising: Semantic Approach Semantic Approach Ekaterina

Serving Contextual Communities Serving Contextual Communities The Evangelical Theological

Anomaly-mediated supersymmetry breaking demystified based on JHEP03(2009)123

INC 342 Lecture 3: Design with Root Locus Dr. Benjamas Panomruttanarug Benjamas.pan@kmutt.ac.th

GCIG CCRN QA Monica Bacon GCIG-CCRN Cervix Cancer Education Symposium, February 2018

Workshop on Constrained Off Payments for Potential Rule Change Proposal 24 October 2018 Workshop

A manifold structure on the set of functional observers Jochen Trumpf University of W urzburg

on various in inflation models ( )

BOSONIC HIGHER-CURVATURE GRAVITY SET 8 G = 1 EINSTEIN ACTION PLUS HIGHER-CURVATURE

Modelling Compensation with Timed Process Algebra Simon Foster S.Foster@dcs.shef.ac.uk

Experimental Design & Evaluation 4. Contextual Inquiry SunyoungKim,PhD Contextual