reducing sampling error in batch temporal di ff erence
play

Reducing Sampling Error in Batch Temporal Di ff erence Learning - PowerPoint PPT Presentation

Reducing Sampling Error in Batch Temporal Di ff erence Learning Brahma S. Pavse 1 , Ishan Durugkar 1 , Josiah Hanna 2 , Peter Stone 1 3 1 The University of Texas at Austin 2 The University of Edinburgh 3 Sony AI ICML July 2020


  1. Reducing Sampling Error in Batch Temporal Di ff erence Learning Brahma S. Pavse 1 , Ishan Durugkar 1 , Josiah Hanna 2 , Peter Stone 1 3 1 The University of Texas at Austin 2 The University of Edinburgh 3 Sony AI ICML July 2020 brahmasp@cs.utexas.edu Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 1

  2. Reinforcement Learning Successes Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 2

  3. Reinforcement Learning Successes Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 2

  4. Reinforcement Learning Successes Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 2

  5. Reinforcement Learning Successes Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 2

  6. How can RL agents make the most from a finite amount of experience? Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 3

  7. How can RL agents make the most from a finite amount of experience? Learning an accurate estimation of the value function with finite amount data . Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 3

  8. Spotlight Overview Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 4

  9. Spotlight Overview With finite batch of data, on-policy single-step temporal di ff erence learning converges to • the value function for the wrong policy. Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 4

  10. Spotlight Overview With finite batch of data, on-policy single-step temporal di ff erence learning converges to • the value function for the wrong policy. Propose and prove that a more e ffi cient estimator converges to the value function for the • true policy. Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 4

  11. Spotlight Overview: Flaw in Batch TD(0) True policy s 1 +30 a 1 s True value function a 2 +60 s 2 Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 5

  12. Spotlight Overview: Flaw in Batch TD(0) finite-sized batch True policy s 1 +30 a 1 s True value function a 2 +60 s 2 Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 5

  13. Spotlight Overview: Flaw in Batch TD(0) finite-sized batch True policy s 1 Batch TD(0) computes +30 a 1 s True value function a 2 +60 s 2 Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 5

  14. Spotlight Overview: Flaw in Batch TD(0) finite-sized batch True policy s 1 Batch TD(0) computes +30 a 1 s True value function a 2 Batch TD(0) estimates value function for the +60 s 2 wrong policy! Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 5

  15. Spotlight Overview: Flaw in Batch TD(0) finite-sized batch True policy s 1 Batch TD(0) computes +30 a 1 s True value function a 2 Batch TD(0) estimates value function for the +60 s 2 wrong policy! Our estimator will estimate value function for the true policy Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 5

  16. Batch Linear* Value Function Learning *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 6

  17. Batch Linear* Value Function Learning Policy and environment transition dynamics: *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 6

  18. Batch Linear* Value Function Learning Policy and environment transition dynamics: Generates batch of m episodes: where *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 6

  19. Batch Linear* Value Function Learning Policy and environment transition dynamics: Generates batch of m episodes: where Estimate value function: *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 6

  20. Batch Linear* Value Function Learning Policy and environment transition dynamics: Generates batch of m episodes: where Estimate value function: *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 6

  21. Batch Linear* Value Function Learning Policy and environment transition dynamics: Generates batch of m episodes: where Estimate value function: Assumptions: 1. is known (policy we want to learn about). 2. is unknown (model-free). 3. Reward function is unknown. 4. On-policy (focus of talk). *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 6

  22. Batch Linear* TD(0) *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 7

  23. Batch Linear* TD(0) *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 7

  24. Batch Linear* TD(0) fixed finite batch as input *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 7

  25. Batch Linear* TD(0) for each transition *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 7

  26. Batch Linear* TD(0) accumulate computed TD error *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 7

  27. Batch Linear* TD(0) make aggregated update to weights *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 7

  28. Batch Linear* TD(0) clear aggregation *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 7

  29. Batch Linear* TD(0) until convergence *Empirical analysis also considers non-linear TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 7

  30. Batch TD(0) Value Function finite-sized batch TD(0) Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 8

  31. Batch TD(0) Value Function finite-sized batch TD(0) certainty-equivalence estimate for MDP* *Sutton (1988) proved a similar result for a Markov reward process Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 8

  32. Batch TD(0) Value Function finite-sized batch TD(0) certainty-equivalence estimate for MDP* maximum-likelihood estimates (MLE) computed from *Sutton (1988) proved a similar result for a Markov reward process Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 8

  33. Batch TD(0) Value Function finite-sized batch TD(0) certainty-equivalence estimate for MDP* maximum-likelihood estimates (MLE) computed from Problem! *Sutton (1988) proved a similar result for a Markov reward process Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 8

  34. Batch TD(0) Value Function finite-sized batch TD(0) certainty-equivalence estimate for MDP* maximum-likelihood estimates (MLE) computed from Problem! *Sutton (1988) proved a similar result for a Markov reward process Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 8

  35. Batch TD(0) Value Function finite-sized batch TD(0) certainty-equivalence estimate for MDP* maximum-likelihood estimates (MLE) computed from policy and transition dynamics Problem! sampling error *Sutton (1988) proved a similar result for a Markov reward process Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 8

  36. Policy Sampling Error in Batch TD(0) finite-sized batch s 1 True policy Batch TD(0) computes +30 a 1 s True value function Batch TD(0) estimates a 2 value function for the wrong policy! +60 s 2 Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 9

  37. Policy Sampling Error in Batch TD(0) finite-sized batch s 1 True policy Batch TD(0) computes +30 a 1 s MLE policy True value function a 2 +60 s 2 Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 9

  38. Policy Sampling Error in Batch TD(0) finite-sized batch s 1 True policy Batch TD(0) computes +30 a 1 s MLE policy True value function a 2 +60 s 2 Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 9

  39. Policy Sampling Error in Batch TD(0) finite-sized batch s 1 True policy Batch TD(0) computes +30 a 1 s MLE policy True value function a 2 +60 s 2 Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Di ff erence Learning 9

Recommend


More recommend