Reducing Sampling Error in Batch Temporal Di ff erence Learning - PowerPoint PPT Presentation

Reducing Sampling Error in Batch Temporal Di ff erence Learning Brahma S. Pavse 1 , Ishan Durugkar 1 , Josiah Hanna 2 , Peter Stone 1 3 1 The University of Texas at Austin 2 The University of Edinburgh 3 Sony AI ICML July 2020 brahmasp@cs.utexas.edu Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 1

Reinforcement Learning Successes Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 2

How can RL agents make the most from a finite amount of experience? Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 3

How can RL agents make the most from a finite amount of experience? Learning an accurate estimation of the value function with finite amount data . Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 3

Spotlight Overview Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 4

Spotlight Overview With finite batch of data, on-policy single-step temporal di ff erence learning converges to • the value function for the wrong policy. Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 4

Spotlight Overview With finite batch of data, on-policy single-step temporal di ff erence learning converges to • the value function for the wrong policy. Propose and prove that a more e ffi cient estimator converges to the value function for the • true policy. Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 4

Spotlight Overview: Flaw in Batch TD(0) True policy s 1 +30 a 1 s True value function a 2 +60 s 2 Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 5

Spotlight Overview: Flaw in Batch TD(0) finite-sized batch True policy s 1 +30 a 1 s True value function a 2 +60 s 2 Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 5

Spotlight Overview: Flaw in Batch TD(0) finite-sized batch True policy s 1 Batch TD(0) computes +30 a 1 s True value function a 2 +60 s 2 Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 5

Spotlight Overview: Flaw in Batch TD(0) finite-sized batch True policy s 1 Batch TD(0) computes +30 a 1 s True value function a 2 Batch TD(0) estimates value function for the +60 s 2 wrong policy! Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 5

Spotlight Overview: Flaw in Batch TD(0) finite-sized batch True policy s 1 Batch TD(0) computes +30 a 1 s True value function a 2 Batch TD(0) estimates value function for the +60 s 2 wrong policy! Our estimator will estimate value function for the true policy Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 5

Batch Linear* Value Function Learning *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 6

Batch Linear* Value Function Learning Policy and environment transition dynamics: *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 6

Batch Linear* Value Function Learning Policy and environment transition dynamics: Generates batch of m episodes: where *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 6

Batch Linear* Value Function Learning Policy and environment transition dynamics: Generates batch of m episodes: where Estimate value function: *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 6

Batch Linear* Value Function Learning Policy and environment transition dynamics: Generates batch of m episodes: where Estimate value function: Assumptions: 1. is known (policy we want to learn about). 2. is unknown (model-free). 3. Reward function is unknown. 4. On-policy (focus of talk). *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 6

Batch Linear* TD(0) *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 7

Batch Linear* TD(0) fixed finite batch as input *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 7

Batch Linear* TD(0) for each transition *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 7

Batch Linear* TD(0) accumulate computed TD error *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 7

Batch Linear* TD(0) make aggregated update to weights *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 7

Batch Linear* TD(0) clear aggregation *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 7

Batch Linear* TD(0) until convergence *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 7

Batch TD(0) Value Function finite-sized batch TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 8

Batch TD(0) Value Function finite-sized batch TD(0) certainty-equivalence estimate for MDP* *Sutton (1988) proved a similar result for a Markov reward process Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 8

Batch TD(0) Value Function finite-sized batch TD(0) certainty-equivalence estimate for MDP* maximum-likelihood estimates (MLE) computed from *Sutton (1988) proved a similar result for a Markov reward process Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 8

Batch TD(0) Value Function finite-sized batch TD(0) certainty-equivalence estimate for MDP* maximum-likelihood estimates (MLE) computed from Problem! *Sutton (1988) proved a similar result for a Markov reward process Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 8

Batch TD(0) Value Function finite-sized batch TD(0) certainty-equivalence estimate for MDP* maximum-likelihood estimates (MLE) computed from policy and transition dynamics Problem! sampling error *Sutton (1988) proved a similar result for a Markov reward process Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 8

Policy Sampling Error in Batch TD(0) finite-sized batch s 1 True policy Batch TD(0) computes +30 a 1 s True value function Batch TD(0) estimates a 2 value function for the wrong policy! +60 s 2 Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 9

Policy Sampling Error in Batch TD(0) finite-sized batch s 1 True policy Batch TD(0) computes +30 a 1 s MLE policy True value function a 2 +60 s 2 Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 9

Reducing Sampling Error in Batch Temporal Di ff erence Learning - PowerPoint PPT Presentation

Reducing Sampling Error in Batch Temporal Di ff erence Learning Brahma S. Pavse 1 , Ishan Durugkar 1 , Josiah Hanna 2 , Peter Stone 1 3 1 The University of Texas at Austin 2 The University of Edinburgh 3 Sony AI ICML July 2020

Batch Systems Running calculations on HPC resources Outline What is a batch system? How

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Realistic Image Synthesis - Spatio-temporal Sampling and Reconstruction. Exploiting Temporal

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

Spatio-Temporal Statistics with R Chapter Two: Exploring Spatio-Temporal Data Spatio-Temporal

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits

HEBT Magnet Vacuum Chambers for Batch 2 and Batch 3 PSP Code 2.3.7.1.2.3.2 Lukas Urban /

Batch Systems Running your jobs on an HPC machine Outline What are batch systems? Why are

Batch Metadata Editing in DSpace 1.6+ Maureen P. Walsh, The Ohio State University Libraries

Case 2: Reducing Cardiovascular Risk Type 2 Diabetes Management Case 1: Reducing Hypoglycemic

Predicting Constituency Vote Shares from Pre-Election Polls Chris Hanretty (UEA) Benjamin E.

Measuring together with the continuum large Miguel Angel Mota (ITAM) Joint work with David

10703 Deep Reinforcement Learning Tom Mitchell September 5, 2018 Solving known MDPs Many slides

Implementation Issues More from Interface point of view V Eye Y U N X Z Viewing Coordinate

Do Doing ng Rea eal l De DevOp Ops s with th DC/OS Julien en Stroheke heker Micros

LHC hints and Higgs bosons beyond the (MS)SM Jack Gunion U.C. Davis LHC2TSP, March 27, 2012

Welcome to Third Grade! 2019-2020 Mrs. Frankforter, Mrs. Hagaman, Mrs. Hatch Mrs. Siefert, Dr.

Uniform Federal Policy for Quality Assurance Project Plans Munitions Response QAPP Toolkit

Reducing Sampling Error in Batch Temporal Di ff erence Learning - PowerPoint PPT Presentation

Reducing Sampling Error in Batch Temporal Di ff erence Learning Brahma S. Pavse 1 , Ishan Durugkar 1 , Josiah Hanna 2 , Peter Stone 1 3 1 The University of Texas at Austin 2 The University of Edinburgh 3 Sony AI ICML July 2020

Batch Systems Running calculations on HPC resources Outline What is a batch system? How

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Realistic Image Synthesis - Spatio-temporal Sampling and Reconstruction. Exploiting Temporal

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

Spatio-Temporal Statistics with R Chapter Two: Exploring Spatio-Temporal Data Spatio-Temporal

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

ERROR DETECTON &amp; CORRECTION Error Detection EDC= Error Detection and Correction bits

HEBT Magnet Vacuum Chambers for Batch 2 and Batch 3 PSP Code 2.3.7.1.2.3.2 Lukas Urban /

Batch Systems Running your jobs on an HPC machine Outline What are batch systems? Why are

Batch Metadata Editing in DSpace 1.6+ Maureen P. Walsh, The Ohio State University Libraries

Case 2: Reducing Cardiovascular Risk Type 2 Diabetes Management Case 1: Reducing Hypoglycemic

Predicting Constituency Vote Shares from Pre-Election Polls Chris Hanretty (UEA) Benjamin E.

Measuring together with the continuum large Miguel Angel Mota (ITAM) Joint work with David

10703 Deep Reinforcement Learning Tom Mitchell September 5, 2018 Solving known MDPs Many slides

Implementation Issues More from Interface point of view V Eye Y U N X Z Viewing Coordinate

Do Doing ng Rea eal l De DevOp Ops s with th DC/OS Julien en Stroheke heker Micros

LHC hints and Higgs bosons beyond the (MS)SM Jack Gunion U.C. Davis LHC2TSP, March 27, 2012

Welcome to Third Grade! 2019-2020 Mrs. Frankforter, Mrs. Hagaman, Mrs. Hatch Mrs. Siefert, Dr.

Uniform Federal Policy for Quality Assurance Project Plans Munitions Response QAPP Toolkit

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits