Bootstrapping with Models: Confidence Intervals for Off-Policy - PowerPoint PPT Presentation

Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation Josiah Hanna 1 Peter Stone 1 Scott Niekum 2 1 Learning Agents Research Group, UT Austin 2 Personal Autonomous Robotics Lab, UT Austin May 10th, 2017 Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 1

Motivation Determine a lower bound on the expected performance of an autonomous control policy given data generated from a different policy. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 2

Preliminaries The agent samples actions from a policy, A t ∼ π ( ·| S t ). The environment responds with S t +1 ∼ P ( ·| S t , A t ). ... S 0 A 0 S 1 A 1 The policy and environment determine a distribution over trajectories, H : S 1 , A 1 , S 2 , A 2 , ..., S L , A L • H ∼ π . �� L � � � • V ( π ) = E t =1 r ( S t , A t ) � H ∼ π is the expected return of π . Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 3

Confidence Intervals for Off-Policy Evaluation Given: Trajectories generated by a behavior policy, π b , { H , π b } ∈ D . An evaluation policy, π e . δ ∈ [0 , 1] is a confidence level. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 4

Confidence Intervals for Off-Policy Evaluation Given: Trajectories generated by a behavior policy, π b , { H , π b } ∈ D . An evaluation policy, π e . δ ∈ [0 , 1] is a confidence level. ˆ Determine a lower bound V lb ( π e , D ) such that V ( π e ) ≥ ˆ V lb ( π e , D ) with probability 1 − δ . Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 4

Existing Methods Exact confidence intervals Thomas et al. [2015a]. Clip importance weights Bottou et al. [2013] Bootstrap importance-sampling Thomas et al. [2015b]. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 5

Existing Methods Exact confidence intervals Thomas et al. [2015a]. Clip importance Our work weights Bottou et al. [2013] Bootstrap importance-sampling Thomas et al. [2015b]. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 5

Data-Efficient Confidence Intervals We draw on two ideas to reduce the number of trajectories required for tight confidence bounds. Replace exact confidence bounds with bootstrap confidence intervals. Use learned models of the environment’s transition function to reduce variance. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 6

Data-Efficient Confidence Intervals We draw on two ideas to reduce the number of trajectories required for tight confidence bounds. Replace exact confidence bounds with bootstrap confidence intervals. Use learned models of the environment’s transition function to reduce variance. Contributions: 1 Two bootstrap methods that incorporate models for approximate high confidence policy evaluation. 2 Theoretical bound on model bias. 3 Empirical evaluation of proposed methods. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 6

Bootstrap Confidence Intervals D Sample with replacement ... D 0 D m Estimate V ( π e ) ... � � V 0 V m Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 7

Data-Efficient Confidence Intervals We draw on two ideas to reduce the number of trajectories required for tight confidence bounds. � Replace exact confidence bounds with bootstrap confidence intervals. Use learned models of the environment’s transition function to reduce variance. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 8

Model Based Off-Policy Evaluation Trajectories are generated from an MDP, M = �S , A , P , r � . 0 . 5 0 . 5 s 0 s 1 s 2 0 . 5 0 . 5 Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 9

Model Based Off-Policy Evaluation Trajectories are generated from an MDP, M = �S , A , P , r � . 0 . 5 0 . 5 s 0 s 1 s 2 0 . 5 0 . 5 Model Based off-policy estimator use all trajectories to estimate the unknown transition function, P . 0 . 55 0 . 35 s 0 s 1 s 2 0 . 45 0 . 65 Model-Based off-policy estimator: � V ( π e ) := V � M ( π e ) where � M = �S , A , � P , r � where � P is the learned transition function. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 9

Model-Bias Model-Based approaches may have high bias. 1 Lack of Data: When we lack data for a particular ( S , A ) pair then we must make assumptions about the transition probability, P ( ·| S , A ). 2 Model Representation: The true function P may be outside the class of models we consider. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 10

Model-Bias Model-Based approaches may have high bias. 1 Lack of Data: When we lack data for a particular ( S , A ) pair then we must make assumptions about the transition probability, P ( ·| S , A ). 2 Model Representation: The true function P may be outside the class of models we consider. We show theoretically that model bias depends on: The importance-sampled train / test error when building the model. The horizon length. The maximum reward. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 10

Model-Based Bootstrap D Sample with replacement ... D 0 D m Model-based Estimate ... � � V 0 V m Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 11

Existing Methods Importance- sampling based mb- methods. bootstrap Bootstrap (ours) importance- sampling Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 12

Doubly Robust Estimator [Jiang and Li, 2016, Thomas and Brunskill, 2016] � n � L w i q π e ( S i t , A i t ) − w i v π e ( S i DR ( D ) := PDIS ( D ) − t ˆ t − 1 ˆ t ) � �� i =1 t =0 � �� Unbiased estimator Zero in Expectation v ( S ′ )] v π ( S ) := E A ∼ π, S ′ ∼ ˆ ˆ P ( ·| S , A ) [ r ( S , A ) + ˆ State value function. v ( S ′ )] q π ( S , A ) := r ( S , A ) + E S ′ ∼ P ( ·| S , A ) [ˆ ˆ State-action value function. w t is the importance weight of the first t time-steps. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 13

Weighted Doubly Robust Bootstrap D Sample with replacement ... D 0 D m Weighted Doubly Robust Estimate ... � � V 0 V m Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 14

Bootstrapping with Models MB-Bootstrap (Model-Based Bootstrap) Advantages: Low variance. Disadvantages: Potentially high bias. WDR-Bootstrap ( Weighted Doubly Robust Bootstrap) Advantages: Low bias. Disadvantages: Potentially higher variance. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 15

Existing Methods Importance- sampling based wdr- mb- methods. bootstrap bootstrap Bootstrap (ours) (ours) importance- sampling Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 16

MountainCar Domain State and action spaces are discretized. Models use a tabular representation. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 17

Mountain Car Domain Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 18

Bootstrapping with Models: Confidence Intervals for Off-Policy - PowerPoint PPT Presentation

Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation Josiah Hanna 1 Peter Stone 1 Scott Niekum 2 1 Learning Agents Research Group, UT Austin 2 Personal Autonomous Robotics Lab, UT Austin May 10th, 2017 Josiah Hanna , Peter

STAT 113 Confidence Intervals Colin Reimer Dawson Oberlin College October 3, 2017 1 / 51

STAT 113 Bootstrap Confidence Intervals Colin Reimer Dawson Oberlin College 3 March 2017

Creating Confidence Intervals using Excel 2013 XL8A-V0R XL8A-V0R XL8A-V0R Create Confidence

Creating Confidence Intervals using Excel 2010 5/08/2015 V0M V0M V0M Create Confidence

Confidence Intervals for Normal Data 18.05 Spring 2014 Agenda Today Review of critical values

Intro to Confidence Intervals SECTION 10.1 1 Confidence Intervals Slides.notebook December 22,

Confidence Intervals for Normal Data 18.05 Spring 2014 Agenda Today Review of critical values

Confidence Intervals for Normal Data 18.05 Spring 2014 Jeremy Orloff and Jonathan Bloom Agenda

Confidence Intervals for Normal Data 18.05 Spring 2014 Jeremy Orloff and Jonathan Bloom Agenda

Confidence Intervals II 18.05 Spring 2014 Agenda Polling: estimating in Bernoulli( ). CLT

Confidence Intervals II 18.05 Spring 2014 Agenda Polling: estimating in Bernoulli( ). CLT

Confidence Intervals II 18.05 Spring 2014 Jeremy Orloff and Jonathan Bloom Agenda Polling:

M5S1 - Confidence Intervals Professor Jarad Niemi STAT 226 - Iowa State University October 9,

Confidence intervals and power Applied Statistics and Experimental Design Chapter 4 Peter Hoff

I05 - Confidence intervals STAT 587 (Engineering) Iowa State University September 24, 2020

Bootstrapping 18.05 Spring 2018 Agenda Leftover from 5/2 : binomial confidence intervals

Lecture 5 Lecturer: Daniel Russo Scribe: Sharon Huang, Wenjun Wang, Jalaj Bhandari 1 Change of

Measuring for Success: Using Data to Reach Your Target Communities and Improve Enrollment

The Marketing Plan: Your guide to a more successful product launch Spring Semester 2016

Day 1: All About Customers Focusing on the 10% Bullseye Targeting Unique positioning

EVALUATION & QUALITY CARE CONSORTIUM Opening REMARKS Kent Bassett-Spiers Stuart Howe Peter

Privacy, Law, and Engineering & Smartphones Public Policy Rebecca Balebako y & c S a

Digital Marketing Plan Checklist DIY Tourism Marketing Workshop Is it part of some larger

FACEBOOK ADVERTISING 101 #WOWWEBINAR Private and Confidential. Property of Whereoware, LLC.

Bootstrapping with Models: Confidence Intervals for Off-Policy - PowerPoint PPT Presentation

Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation Josiah Hanna 1 Peter Stone 1 Scott Niekum 2 1 Learning Agents Research Group, UT Austin 2 Personal Autonomous Robotics Lab, UT Austin May 10th, 2017 Josiah Hanna , Peter

STAT 113 Confidence Intervals Colin Reimer Dawson Oberlin College October 3, 2017 1 / 51

STAT 113 Bootstrap Confidence Intervals Colin Reimer Dawson Oberlin College 3 March 2017

Creating Confidence Intervals using Excel 2013 XL8A-V0R XL8A-V0R XL8A-V0R Create Confidence

Creating Confidence Intervals using Excel 2010 5/08/2015 V0M V0M V0M Create Confidence

Confidence Intervals for Normal Data 18.05 Spring 2014 Agenda Today Review of critical values

Intro to Confidence Intervals SECTION 10.1 1 Confidence Intervals Slides.notebook December 22,

Confidence Intervals for Normal Data 18.05 Spring 2014 Agenda Today Review of critical values

Confidence Intervals for Normal Data 18.05 Spring 2014 Jeremy Orloff and Jonathan Bloom Agenda

Confidence Intervals for Normal Data 18.05 Spring 2014 Jeremy Orloff and Jonathan Bloom Agenda

Confidence Intervals II 18.05 Spring 2014 Agenda Polling: estimating in Bernoulli( ). CLT

Confidence Intervals II 18.05 Spring 2014 Agenda Polling: estimating in Bernoulli( ). CLT

Confidence Intervals II 18.05 Spring 2014 Jeremy Orloff and Jonathan Bloom Agenda Polling:

M5S1 - Confidence Intervals Professor Jarad Niemi STAT 226 - Iowa State University October 9,

Confidence intervals and power Applied Statistics and Experimental Design Chapter 4 Peter Hoff

I05 - Confidence intervals STAT 587 (Engineering) Iowa State University September 24, 2020

Bootstrapping 18.05 Spring 2018 Agenda Leftover from 5/2 : binomial confidence intervals

Lecture 5 Lecturer: Daniel Russo Scribe: Sharon Huang, Wenjun Wang, Jalaj Bhandari 1 Change of

Measuring for Success: Using Data to Reach Your Target Communities and Improve Enrollment

The Marketing Plan: Your guide to a more successful product launch Spring Semester 2016

Day 1: All About Customers Focusing on the 10% Bullseye Targeting Unique positioning

EVALUATION &amp; QUALITY CARE CONSORTIUM Opening REMARKS Kent Bassett-Spiers Stuart Howe Peter

Privacy, Law, and Engineering &amp; Smartphones Public Policy Rebecca Balebako y &amp; c S a

Digital Marketing Plan Checklist DIY Tourism Marketing Workshop Is it part of some larger

FACEBOOK ADVERTISING 101 #WOWWEBINAR Private and Confidential. Property of Whereoware, LLC.

EVALUATION & QUALITY CARE CONSORTIUM Opening REMARKS Kent Bassett-Spiers Stuart Howe Peter

Privacy, Law, and Engineering & Smartphones Public Policy Rebecca Balebako y & c S a