Reducing Sampling Error in Batch Temporal Di ff erence Learning Brahma S. Pavse 1 , Ishan Durugkar 1 , Josiah Hanna 2 , Peter Stone 1 3 1 The University of Texas at Austin 2 The University of Edinburgh 3 Sony AI ICML July 2020 brahmasp@cs.utexas.edu Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 1
Reinforcement Learning Successes Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 2
Reinforcement Learning Successes Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 2
Reinforcement Learning Successes Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 2
Reinforcement Learning Successes Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 2
How can RL agents make the most from a finite amount of experience? Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 3
How can RL agents make the most from a finite amount of experience? Learning an accurate estimation of the value function with finite amount data . Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 3
Spotlight Overview Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 4
Spotlight Overview With finite batch of data, on-policy single-step temporal di ff erence learning converges to • the value function for the wrong policy. Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 4
Spotlight Overview With finite batch of data, on-policy single-step temporal di ff erence learning converges to • the value function for the wrong policy. Propose and prove that a more e ffi cient estimator converges to the value function for the • true policy. Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 4
Spotlight Overview: Flaw in Batch TD(0) True policy s 1 +30 a 1 s True value function a 2 +60 s 2 Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 5
Spotlight Overview: Flaw in Batch TD(0) finite-sized batch True policy s 1 +30 a 1 s True value function a 2 +60 s 2 Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 5
Spotlight Overview: Flaw in Batch TD(0) finite-sized batch True policy s 1 Batch TD(0) computes +30 a 1 s True value function a 2 +60 s 2 Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 5
Spotlight Overview: Flaw in Batch TD(0) finite-sized batch True policy s 1 Batch TD(0) computes +30 a 1 s True value function a 2 Batch TD(0) estimates value function for the +60 s 2 wrong policy! Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 5
Spotlight Overview: Flaw in Batch TD(0) finite-sized batch True policy s 1 Batch TD(0) computes +30 a 1 s True value function a 2 Batch TD(0) estimates value function for the +60 s 2 wrong policy! Our estimator will estimate value function for the true policy Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 5
Batch Linear* Value Function Learning *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 6
Batch Linear* Value Function Learning Policy and environment transition dynamics: *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 6
Batch Linear* Value Function Learning Policy and environment transition dynamics: Generates batch of m episodes: where *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 6
Batch Linear* Value Function Learning Policy and environment transition dynamics: Generates batch of m episodes: where Estimate value function: *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 6
Batch Linear* Value Function Learning Policy and environment transition dynamics: Generates batch of m episodes: where Estimate value function: *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 6
Batch Linear* Value Function Learning Policy and environment transition dynamics: Generates batch of m episodes: where Estimate value function: Assumptions: 1. is known (policy we want to learn about). 2. is unknown (model-free). 3. Reward function is unknown. 4. On-policy (focus of talk). *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 6
Batch Linear* TD(0) *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 7
Batch Linear* TD(0) *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 7
Batch Linear* TD(0) fixed finite batch as input *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 7
Batch Linear* TD(0) for each transition *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 7
Batch Linear* TD(0) accumulate computed TD error *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 7
Batch Linear* TD(0) make aggregated update to weights *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 7
Batch Linear* TD(0) clear aggregation *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 7
Batch Linear* TD(0) until convergence *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 7
Batch TD(0) Value Function finite-sized batch TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 8
Batch TD(0) Value Function finite-sized batch TD(0) certainty-equivalence estimate for MDP* *Sutton (1988) proved a similar result for a Markov reward process Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 8
Batch TD(0) Value Function finite-sized batch TD(0) certainty-equivalence estimate for MDP* maximum-likelihood estimates (MLE) computed from *Sutton (1988) proved a similar result for a Markov reward process Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 8
Batch TD(0) Value Function finite-sized batch TD(0) certainty-equivalence estimate for MDP* maximum-likelihood estimates (MLE) computed from Problem! *Sutton (1988) proved a similar result for a Markov reward process Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 8
Batch TD(0) Value Function finite-sized batch TD(0) certainty-equivalence estimate for MDP* maximum-likelihood estimates (MLE) computed from Problem! *Sutton (1988) proved a similar result for a Markov reward process Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 8
Batch TD(0) Value Function finite-sized batch TD(0) certainty-equivalence estimate for MDP* maximum-likelihood estimates (MLE) computed from policy and transition dynamics Problem! sampling error *Sutton (1988) proved a similar result for a Markov reward process Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 8
Policy Sampling Error in Batch TD(0) finite-sized batch s 1 True policy Batch TD(0) computes +30 a 1 s True value function Batch TD(0) estimates a 2 value function for the wrong policy! +60 s 2 Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 9
Policy Sampling Error in Batch TD(0) finite-sized batch s 1 True policy Batch TD(0) computes +30 a 1 s MLE policy True value function a 2 +60 s 2 Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 9
Policy Sampling Error in Batch TD(0) finite-sized batch s 1 True policy Batch TD(0) computes +30 a 1 s MLE policy True value function a 2 +60 s 2 Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 9
Policy Sampling Error in Batch TD(0) finite-sized batch s 1 True policy Batch TD(0) computes +30 a 1 s MLE policy True value function a 2 +60 s 2 Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 9
Recommend
More recommend