bootstrapping with models confidence intervals for off
play

Bootstrapping with Models: Confidence Intervals for Off-Policy - PowerPoint PPT Presentation

Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation Josiah Hanna 1 Peter Stone 1 Scott Niekum 2 1 Learning Agents Research Group, UT Austin 2 Personal Autonomous Robotics Lab, UT Austin May 10th, 2017 Josiah Hanna , Peter


  1. Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation Josiah Hanna 1 Peter Stone 1 Scott Niekum 2 1 Learning Agents Research Group, UT Austin 2 Personal Autonomous Robotics Lab, UT Austin May 10th, 2017 Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 1

  2. Motivation Determine a lower bound on the expected performance of an autonomous control policy given data generated from a different policy. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 2

  3. Motivation Determine a lower bound on the expected performance of an autonomous control policy given data generated from a different policy. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 2

  4. Motivation Determine a lower bound on the expected performance of an autonomous control policy given data generated from a different policy. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 2

  5. Preliminaries The agent samples actions from a policy, A t ∼ π ( ·| S t ). The environment responds with S t +1 ∼ P ( ·| S t , A t ). ... S 0 A 0 S 1 A 1 The policy and environment determine a distribution over trajectories, H : S 1 , A 1 , S 2 , A 2 , ..., S L , A L • H ∼ π . �� L � � � • V ( π ) = E t =1 r ( S t , A t ) � H ∼ π is the expected return of π . Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 3

  6. Confidence Intervals for Off-Policy Evaluation Given: Trajectories generated by a behavior policy, π b , { H , π b } ∈ D . An evaluation policy, π e . δ ∈ [0 , 1] is a confidence level. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 4

  7. Confidence Intervals for Off-Policy Evaluation Given: Trajectories generated by a behavior policy, π b , { H , π b } ∈ D . An evaluation policy, π e . δ ∈ [0 , 1] is a confidence level. ˆ Determine a lower bound V lb ( π e , D ) such that V ( π e ) ≥ ˆ V lb ( π e , D ) with probability 1 − δ . Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 4

  8. Existing Methods Exact confidence intervals Thomas et al. [2015a]. Clip importance weights Bottou et al. [2013] Bootstrap importance-sampling Thomas et al. [2015b]. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 5

  9. Existing Methods Exact confidence intervals Thomas et al. [2015a]. Clip importance Our work weights Bottou et al. [2013] Bootstrap importance-sampling Thomas et al. [2015b]. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 5

  10. Data-Efficient Confidence Intervals We draw on two ideas to reduce the number of trajectories required for tight confidence bounds. Replace exact confidence bounds with bootstrap confidence intervals. Use learned models of the environment’s transition function to reduce variance. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 6

  11. Data-Efficient Confidence Intervals We draw on two ideas to reduce the number of trajectories required for tight confidence bounds. Replace exact confidence bounds with bootstrap confidence intervals. Use learned models of the environment’s transition function to reduce variance. Contributions: 1 Two bootstrap methods that incorporate models for approximate high confidence policy evaluation. 2 Theoretical bound on model bias. 3 Empirical evaluation of proposed methods. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 6

  12. Bootstrap Confidence Intervals D Sample with replacement ... D 0 D m Estimate V ( π e ) ... � � V 0 V m Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 7

  13. Bootstrap Confidence Intervals D Sample with replacement ... D 0 D m Estimate V ( π e ) ... � � V 0 V m Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 7

  14. Bootstrap Confidence Intervals D Sample with replacement ... D 0 D m Estimate V ( π e ) ... � � V 0 V m Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 7

  15. Data-Efficient Confidence Intervals We draw on two ideas to reduce the number of trajectories required for tight confidence bounds. � Replace exact confidence bounds with bootstrap confidence intervals. Use learned models of the environment’s transition function to reduce variance. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 8

  16. Model Based Off-Policy Evaluation Trajectories are generated from an MDP, M = �S , A , P , r � . 0 . 5 0 . 5 s 0 s 1 s 2 0 . 5 0 . 5 Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 9

  17. Model Based Off-Policy Evaluation Trajectories are generated from an MDP, M = �S , A , P , r � . 0 . 5 0 . 5 s 0 s 1 s 2 0 . 5 0 . 5 Model Based off-policy estimator use all trajectories to estimate the unknown transition function, P . 0 . 55 0 . 35 s 0 s 1 s 2 0 . 45 0 . 65 Model-Based off-policy estimator: � V ( π e ) := V � M ( π e ) where � M = �S , A , � P , r � where � P is the learned transition function. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 9

  18. Model-Bias Model-Based approaches may have high bias. 1 Lack of Data: When we lack data for a particular ( S , A ) pair then we must make assumptions about the transition probability, P ( ·| S , A ). 2 Model Representation: The true function P may be outside the class of models we consider. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 10

  19. Model-Bias Model-Based approaches may have high bias. 1 Lack of Data: When we lack data for a particular ( S , A ) pair then we must make assumptions about the transition probability, P ( ·| S , A ). 2 Model Representation: The true function P may be outside the class of models we consider. We show theoretically that model bias depends on: The importance-sampled train / test error when building the model. The horizon length. The maximum reward. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 10

  20. Model-Based Bootstrap D Sample with replacement ... D 0 D m Model-based Estimate ... � � V 0 V m Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 11

  21. Existing Methods Importance- sampling based mb- methods. bootstrap Bootstrap (ours) importance- sampling Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 12

  22. Doubly Robust Estimator [Jiang and Li, 2016, Thomas and Brunskill, 2016] � n � L w i q π e ( S i t , A i t ) − w i v π e ( S i DR ( D ) := PDIS ( D ) − t ˆ t − 1 ˆ t ) � �� � i =1 t =0 � �� � Unbiased estimator Zero in Expectation v ( S ′ )] v π ( S ) := E A ∼ π, S ′ ∼ ˆ ˆ P ( ·| S , A ) [ r ( S , A ) + ˆ State value function. v ( S ′ )] q π ( S , A ) := r ( S , A ) + E S ′ ∼ P ( ·| S , A ) [ˆ ˆ State-action value function. w t is the importance weight of the first t time-steps. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 13

  23. Weighted Doubly Robust Bootstrap D Sample with replacement ... D 0 D m Weighted Doubly Robust Estimate ... � � V 0 V m Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 14

  24. Bootstrapping with Models MB-Bootstrap (Model-Based Bootstrap) Advantages: Low variance. Disadvantages: Potentially high bias. WDR-Bootstrap ( Weighted Doubly Robust Bootstrap) Advantages: Low bias. Disadvantages: Potentially higher variance. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 15

  25. Existing Methods Importance- sampling based wdr- mb- methods. bootstrap bootstrap Bootstrap (ours) (ours) importance- sampling Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 16

  26. MountainCar Domain State and action spaces are discretized. Models use a tabular representation. Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 17

  27. Mountain Car Domain Josiah Hanna , Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 18

Recommend


More recommend