Data-efficient Policy Evaluation through Behavior Policy Search - PowerPoint PPT Presentation

Data-efficient Policy Evaluation through Behavior Policy Search Josiah Hanna 1 Philip Thomas 2 Peter Stone 1 Scott Niekum 1 1 University of Texas at Austin 2 University of Massachusetts, Amherst August 8th, 2017 Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 1

Policy Evaluation Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 2

Outline 1 Demonstrate that importance-sampling for policy evaluation can outperform on-policy policy evaluation. Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 3

Outline 1 Demonstrate that importance-sampling for policy evaluation can outperform on-policy policy evaluation. 2 Show how to improve the behavior policy for importance-sampling policy evaluation. Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 3

Outline 1 Demonstrate that importance-sampling for policy evaluation can outperform on-policy policy evaluation. 2 Show how to improve the behavior policy for importance-sampling policy evaluation. 3 Empirically evaluate (1) and (2). Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 3

Background Finite-horizon MDP. Agent selects actions with a stochastic policy, π . The policy and environment determine a distribution over trajectories, H : S 0 , A 0 , R 0 , S 1 , A 1 , R 1 , ..., S L , A L , R L Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 4

Policy Evaluation Policy performance: � L � � � � γ t R t � ρ ( π ) := E � H ∼ π � t =0 Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 5

Policy Evaluation Policy performance: � L � � � � γ t R t � ρ ( π ) := E � H ∼ π � t =0 Given a target policy, π e , estimate ρ ( π e ). Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 5

Policy Evaluation Policy performance: � L � � � � γ t R t � ρ ( π ) := E � H ∼ π � t =0 Given a target policy, π e , estimate ρ ( π e ). Let π e ≡ π θ e Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 5

Monte Carlo Policy Evaluation Given a dataset D of trajectories where ∀ H ∈ D , H ∼ π e : L MC( D ) := 1 � � γ t R ( i ) t |D| H i ∈D t =0 Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 6

Action 1 +100 +1 Action 2 Target policy π e samples the high-rewarding first action with probability 0 . 01. Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 7

Action 1 +100 +1 Action 2 Target policy π e samples the high-rewarding first action with probability 0 . 01. Monte Carlo evaluation of π e has high variance . Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 7

Action 1 +100 +1 Action 2 Target policy π e samples the high-rewarding first action with probability 0 . 01. Monte Carlo evaluation of π e has high variance . Importance-sampling with a behavior policy that samples either action with equal probability gives a low variance evaluation. Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 7

Importance-Sampling Policy Evaluation 1 Given a dataset D of trajectories where ∀ H i ∈ D , H i is sampled from a behavior policy π i : L L IS( D ) := 1 π e ( A t | S t ) � � � γ t R ( i ) t |D| π i ( A t | S t ) H i ∈D t =0 t =0 � �� re-weighting factor 1 Precup, Sutton, and Singh (2000) Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 8

Importance-Sampling Policy Evaluation 1 Given a dataset D of trajectories where ∀ H i ∈ D , H i is sampled from a behavior policy π i : L L IS( D ) := 1 π e ( A t | S t ) � � � γ t R ( i ) t |D| π i ( A t | S t ) H i ∈D t =0 t =0 � �� re-weighting factor For convenience: L L π e ( A t | S t ) � � γ t R t IS( H , π ) := π ( A t | S t ) t =0 t =0 1 Precup, Sutton, and Singh (2000) Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 8

The Optimal Behavior Policy Importance-sampling can achieve zero mean-squared error policy evaluation with only a single trajectory! Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 9

The Optimal Behavior Policy Importance-sampling can achieve zero mean-squared error policy evaluation with only a single trajectory! We cannot analytically determine this policy. Requires ρ ( π e ) be known! Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 9

The Optimal Behavior Policy Importance-sampling can achieve zero mean-squared error policy evaluation with only a single trajectory! We cannot analytically determine this policy. Requires ρ ( π e ) be known! Requires the reward function be known. Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 9

The Optimal Behavior Policy Importance-sampling can achieve zero mean-squared error policy evaluation with only a single trajectory! We cannot analytically determine this policy. Requires ρ ( π e ) be known! Requires the reward function be known. Requires deterministic transitions. Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 9

Behavior Policy Search Adapt the behavior policy towards the optimal behavior policy. Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 10

Behavior Policy Search Adapt the behavior policy towards the optimal behavior policy. At each iteration, i : 1 Choose behavior policy parameters, θ i , based on all observed data D . Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 10

Behavior Policy Search Adapt the behavior policy towards the optimal behavior policy. At each iteration, i : 1 Choose behavior policy parameters, θ i , based on all observed data D . 2 Sample m trajectories, H ∼ θ i and add to a data set D . Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 10

Behavior Policy Search Adapt the behavior policy towards the optimal behavior policy. At each iteration, i : 1 Choose behavior policy parameters, θ i , based on all observed data D . 2 Sample m trajectories, H ∼ θ i and add to a data set D . 3 Estimate ρ ( π e ) with trajectories in D . Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 10

Behavior Policy Gradient Key Idea: Adapt the behavior policy parameters, θ , with gradient descent on the mean squared error of importance-sampling. θ i +1 = θ i − α ∂ ∂ θ MSE[IS( H i , θ )] Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 11

Behavior Policy Gradient Key Idea: Adapt the behavior policy parameters, θ , with gradient descent on the mean squared error of importance-sampling. θ i +1 = θ i − α ∂ ∂ θ MSE[IS( H i , θ )] MSE[ IS ( H , θ )] is not computable. ∂ ∂ θ MSE[ IS ( H , θ )] is computable. Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 11

Behavior Policy Gradient Theorem Theorem � � L ∂ ∂ � − IS( H , θ ) 2 ∂ θ MSE(IS( H , θ )) = E π θ ∂ θ log ( π θ ( A t | S t )) t =0 Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 12

Empirical Results Cartpole Swing-up Acrobot Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 13

GridWorld Results High Variance Policy Low Variance Policy Josiah Hanna , Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 14

Data-efficient Policy Evaluation through Behavior Policy Search - PowerPoint PPT Presentation

Data-efficient Policy Evaluation through Behavior Policy Search Josiah Hanna 1 Philip Thomas 2 Peter Stone 1 Scott Niekum 1 1 University of Texas at Austin 2 University of Massachusetts, Amherst August 8th, 2017 Josiah Hanna , Philip Thomas, Peter

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Evidence evaluation for discrete data Evidence evaluation for discrete data Evidence evaluation

BEHAVIOR @ HOME Behavior Basics Simple strategies that can make a big difference! Presented by

Behavior through Data: Behavior Incident Report System Myrna Veguilla Lise Fox University of

APPLIED BEHAVIOR ANALYSIS Specialization Overview Agenda What is Applied Behavior Analysis

Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Progress on

Efficient signal processing using Haskell and LLVM Henning Thielemann 2016-09-15 Efficient

Efficient Scientific Data Efficient Scientific Data Management on Supercomputers Management on

Sedentary Behavior Chair: Peter Katzmarzyk Members: John Jakicic, Ken Powell Sedentary Behavior

Assessment and Treatment of Severe Introduction to Problem Behavior in Problem Behavior Children

Learning From Video Browse Behavior Learning From Video Browse Behavior TRECVID 2009 TRECVID

Behavior Based Saf ety Behavior Based Saf ety PCL INDUSTRIAL CONSTRUCTORS INC. Objectives Upon

Cowpens Mill PPSA Conference 2019 Behavior Based Safety Program Implementation of Behavior

Governance, Behavior & Culture DNBs Supervision on Behavior & Culture within Financial

BEHAVIOR @ HOME Simple Behavior Strategies Simple strategies that can make a big difference!

Philip Durbin @philipdurbin What is Dataverse? #dataversecup research data sharing enthusiasts :

Anthony'J.'Clark,'Jared'M.'Moore,' and'Philip'K.'McKinley' 2nd'Interna:onal'Workshop'on'

SMALL GROUP NOTES WONDER OF TRANSFORMATION Philip began with the observation that the

Nonconcentration, L p -Improving Estimates, and Multilinear Kakeya Philip T. Gressman Department

Matthew Series Lesson #014 December 1, 2013 Dean Bible Ministries www.deanbible.org Dr. Robert

Project: A Further Investigation on the Running Time Last updated: May 25, 2020 May 25, 2020 1

Hermeneutics Good introduction to Bible interpretation for the Average Joe. Hermeneutics We

Online Classification of Photo-Realistic Computer Graphics & Photographs Lessons Learned

Data-efficient Policy Evaluation through Behavior Policy Search - PowerPoint PPT Presentation

Data-efficient Policy Evaluation through Behavior Policy Search Josiah Hanna 1 Philip Thomas 2 Peter Stone 1 Scott Niekum 1 1 University of Texas at Austin 2 University of Massachusetts, Amherst August 8th, 2017 Josiah Hanna , Philip Thomas, Peter

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Evidence evaluation for discrete data Evidence evaluation for discrete data Evidence evaluation

BEHAVIOR @ HOME Behavior Basics Simple strategies that can make a big difference! Presented by

Behavior through Data: Behavior Incident Report System Myrna Veguilla Lise Fox University of

APPLIED BEHAVIOR ANALYSIS Specialization Overview Agenda What is Applied Behavior Analysis

Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Progress on

Efficient signal processing using Haskell and LLVM Henning Thielemann 2016-09-15 Efficient

Efficient Scientific Data Efficient Scientific Data Management on Supercomputers Management on

Sedentary Behavior Chair: Peter Katzmarzyk Members: John Jakicic, Ken Powell Sedentary Behavior

Assessment and Treatment of Severe Introduction to Problem Behavior in Problem Behavior Children

Learning From Video Browse Behavior Learning From Video Browse Behavior TRECVID 2009 TRECVID

Behavior Based Saf ety Behavior Based Saf ety PCL INDUSTRIAL CONSTRUCTORS INC. Objectives Upon

Cowpens Mill PPSA Conference 2019 Behavior Based Safety Program Implementation of Behavior

Governance, Behavior &amp; Culture DNBs Supervision on Behavior &amp; Culture within Financial

BEHAVIOR @ HOME Simple Behavior Strategies Simple strategies that can make a big difference!

Philip Durbin @philipdurbin What is Dataverse? #dataversecup research data sharing enthusiasts :

Anthony'J.'Clark,'Jared'M.'Moore,' and'Philip'K.'McKinley' 2nd'Interna:onal'Workshop'on'

SMALL GROUP NOTES WONDER OF TRANSFORMATION Philip began with the observation that the

Nonconcentration, L p -Improving Estimates, and Multilinear Kakeya Philip T. Gressman Department

Matthew Series Lesson #014 December 1, 2013 Dean Bible Ministries www.deanbible.org Dr. Robert

Project: A Further Investigation on the Running Time Last updated: May 25, 2020 May 25, 2020 1

Hermeneutics Good introduction to Bible interpretation for the Average Joe. Hermeneutics We

Online Classification of Photo-Realistic Computer Graphics &amp; Photographs Lessons Learned

Governance, Behavior & Culture DNBs Supervision on Behavior & Culture within Financial

Online Classification of Photo-Realistic Computer Graphics & Photographs Lessons Learned