Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation Josiah Hanna 1 Peter Stone 1 Scott Niekum 2 1 Learning Agents Research Group, UT Austin 2 Personal Autonomous Robotics Lab, UT Austin May 10th, 2017 Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 1
Motivation Determine a lower bound on the expected performance of an autonomous control policy given data generated from a different policy. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 2
Motivation Determine a lower bound on the expected performance of an autonomous control policy given data generated from a different policy. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 2
Motivation Determine a lower bound on the expected performance of an autonomous control policy given data generated from a different policy. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 2
Preliminaries The agent samples actions from a policy, A t ∼ π ( ·| S t ). The environment responds with S t +1 ∼ P ( ·| S t , A t ). ... S 0 A 0 S 1 A 1 The policy and environment determine a distribution over trajectories, H : S 1 , A 1 , S 2 , A 2 , ..., S L , A L • H ∼ π . �� L � � � • V ( π ) = E t =1 r ( S t , A t ) � H ∼ π is the expected return of π . Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 3
Confidence Intervals for Off-Policy Evaluation Given: Trajectories generated by a behavior policy, π b , { H , π b } ∈ D . An evaluation policy, π e . δ ∈ [0 , 1] is a confidence level. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 4
Confidence Intervals for Off-Policy Evaluation Given: Trajectories generated by a behavior policy, π b , { H , π b } ∈ D . An evaluation policy, π e . δ ∈ [0 , 1] is a confidence level. ˆ Determine a lower bound V lb ( π e , D ) such that V ( π e ) ≥ ˆ V lb ( π e , D ) with probability 1 − δ . Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 4
Existing Methods Exact confidence intervals Thomas et al. [2015a]. Clip importance weights Bottou et al. [2013] Bootstrap importance-sampling Thomas et al. [2015b]. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 5
Existing Methods Exact confidence intervals Thomas et al. [2015a]. Clip importance Our work weights Bottou et al. [2013] Bootstrap importance-sampling Thomas et al. [2015b]. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 5
Data-Efficient Confidence Intervals We draw on two ideas to reduce the number of trajectories required for tight confidence bounds. Replace exact confidence bounds with bootstrap confidence intervals. Use learned models of the environment’s transition function to reduce variance. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 6
Data-Efficient Confidence Intervals We draw on two ideas to reduce the number of trajectories required for tight confidence bounds. Replace exact confidence bounds with bootstrap confidence intervals. Use learned models of the environment’s transition function to reduce variance. Contributions: 1 Two bootstrap methods that incorporate models for approximate high confidence policy evaluation. 2 Theoretical bound on model bias. 3 Empirical evaluation of proposed methods. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 6
Bootstrap Confidence Intervals D Sample with replacement ... D 0 D m Estimate V ( π e ) ... � � V 0 V m Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 7
Bootstrap Confidence Intervals D Sample with replacement ... D 0 D m Estimate V ( π e ) ... � � V 0 V m Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 7
Bootstrap Confidence Intervals D Sample with replacement ... D 0 D m Estimate V ( π e ) ... � � V 0 V m Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 7
Data-Efficient Confidence Intervals We draw on two ideas to reduce the number of trajectories required for tight confidence bounds. � Replace exact confidence bounds with bootstrap confidence intervals. Use learned models of the environment’s transition function to reduce variance. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 8
Model Based Off-Policy Evaluation Trajectories are generated from an MDP, M = �S , A , P , r � . 0 . 5 0 . 5 s 0 s 1 s 2 0 . 5 0 . 5 Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 9
Model Based Off-Policy Evaluation Trajectories are generated from an MDP, M = �S , A , P , r � . 0 . 5 0 . 5 s 0 s 1 s 2 0 . 5 0 . 5 Model Based off-policy estimator use all trajectories to estimate the unknown transition function, P . 0 . 55 0 . 35 s 0 s 1 s 2 0 . 45 0 . 65 Model-Based off-policy estimator: � V ( π e ) := V � M ( π e ) where � M = �S , A , � P , r � where � P is the learned transition function. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 9
Model-Bias Model-Based approaches may have high bias. 1 Lack of Data: When we lack data for a particular ( S , A ) pair then we must make assumptions about the transition probability, P ( ·| S , A ). 2 Model Representation: The true function P may be outside the class of models we consider. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 10
Model-Bias Model-Based approaches may have high bias. 1 Lack of Data: When we lack data for a particular ( S , A ) pair then we must make assumptions about the transition probability, P ( ·| S , A ). 2 Model Representation: The true function P may be outside the class of models we consider. We show theoretically that model bias depends on: The importance-sampled train / test error when building the model. The horizon length. The maximum reward. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 10
Model-Based Bootstrap D Sample with replacement ... D 0 D m Model-based Estimate ... � � V 0 V m Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 11
Existing Methods Importance- sampling based mb- methods. bootstrap Bootstrap (ours) importance- sampling Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 12
Doubly Robust Estimator [Jiang and Li, 2016, Thomas and Brunskill, 2016] � n � L w i q π e ( S i t , A i t ) − w i v π e ( S i DR ( D ) := PDIS ( D ) − t ˆ t − 1 ˆ t ) � �� � i =1 t =0 � �� � Unbiased estimator Zero in Expectation v ( S ′ )] v π ( S ) := E A ∼ π, S ′ ∼ ˆ ˆ P ( ·| S , A ) [ r ( S , A ) + ˆ State value function. v ( S ′ )] q π ( S , A ) := r ( S , A ) + E S ′ ∼ P ( ·| S , A ) [ˆ ˆ State-action value function. w t is the importance weight of the first t time-steps. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 13
Weighted Doubly Robust Bootstrap D Sample with replacement ... D 0 D m Weighted Doubly Robust Estimate ... � � V 0 V m Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 14
Bootstrapping with Models MB-Bootstrap (Model-Based Bootstrap) Advantages: Low variance. Disadvantages: Potentially high bias. WDR-Bootstrap ( Weighted Doubly Robust Bootstrap) Advantages: Low bias. Disadvantages: Potentially higher variance. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 15
Existing Methods Importance- sampling based wdr- mb- methods. bootstrap bootstrap Bootstrap (ours) (ours) importance- sampling Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 16
MountainCar Domain State and action spaces are discretized. Models use a tabular representation. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 17
Mountain Car Domain Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 18
Recommend
More recommend