renewal monte carlo
play

Renewal Monte Carlo: Renewal theory based reinforcement learning - PowerPoint PPT Presentation

Renewal Monte Carlo: Renewal theory based reinforcement learning Jayakumar Subramanian and Aditya Mahajan 57th IEEE Conference on Decision and Control, Miami Beach, FL, USA, December 17-19, 2018 RL has achieved considerable success


  1. Renewal Monte Carlo: 
 Renewal theory based reinforcement learning Jayakumar Subramanian and Aditya Mahajan 57th IEEE Conference on Decision and Control, Miami Beach, FL, USA, December 17-19, 2018

  2. RL has achieved considerable success… Image credit: Towards Data Science Image credit: MIT Technology review Image credit: Popular Science 19/12/18 RMC: Subramanian and Mahajan �2

  3. RL has achieved considerable success… Image credit: Towards Data Science Image credit: MIT Technology review Image credit: Popular Science Salient features ⊕ Model-free method ⊕ Use policy search 19/12/18 RMC: Subramanian and Mahajan �2

  4. RL has achieved considerable success… Image credit: Towards Data Science Image credit: MIT Technology review Image credit: Popular Science Salient features Limitation ⊕ Model-free method ⊖ Learning is slow (takes ∼ 10 ^9 to 10 ^15 ⊕ Use policy search iterations to converge) 19/12/18 RMC: Subramanian and Mahajan �2

  5. RL has achieved considerable success… Image credit: Towards Data Science Image credit: MIT Technology review Image credit: Popular Science Salient features Limitation ⊕ Model-free method ⊖ Learning is slow (takes ∼ 10 ^9 to 10 ^15 ⊕ Use policy search iterations to converge) ⊕ Can we exploit features of the model to make it learn faster? … ⊕ Without sacrificing generality? 19/12/18 RMC: Subramanian and Mahajan �2

  6. An RL problem can be formulated as… 19/12/18 RMC: Subramanian and Mahajan �3

  7. An RL problem can be formulated as… Agent 19/12/18 RMC: Subramanian and Mahajan �3

  8. An RL problem can be formulated as… Environment Agent 19/12/18 RMC: Subramanian and Mahajan �3

  9. An RL problem can be formulated as… Environment Agent 19/12/18 RMC: Subramanian and Mahajan �3

  10. An RL problem can be formulated as… Environment Agent Infinite horizon Markov decision process (MDP) Model State space Action space Transition probability Per-step reward 19/12/18 RMC: Subramanian and Mahajan �3

  11. An RL problem can be formulated as… Environment Unknown in RL Agent Infinite horizon Markov decision process (MDP) Model State space Action space Transition probability Per-step reward 19/12/18 RMC: Subramanian and Mahajan �3

  12. Policy parametrization 19/12/18 RMC: Subramanian and Mahajan �4

  13. Policy parametrization is a parametrized policy 19/12/18 RMC: Subramanian and Mahajan �4

  14. Policy parametrization is a parametrized policy Gibbs (softmax) policy 19/12/18 RMC: Subramanian and Mahajan �4

  15. Policy parametrization is a parametrized policy Gibbs (softmax) policy Neural network (NN) policy : weights of NN 19/12/18 RMC: Subramanian and Mahajan �4

  16. Policy gradient 19/12/18 RMC: Subramanian and Mahajan �5

  17. Policy gradient Performance Gradient Estimate 19/12/18 RMC: Subramanian and Mahajan �5

  18. Policy gradient Performance Gradient Estimate 19/12/18 RMC: Subramanian and Mahajan �5

  19. Policy gradient Performance Gradient Estimate is an estimate of 19/12/18 RMC: Subramanian and Mahajan �5

  20. Policy gradient Performance Gradient Estimate is an estimate of Stochastic Gradient Ascent 19/12/18 RMC: Subramanian and Mahajan �5

  21. Policy gradient Performance Gradient Estimate is an estimate of Stochastic Gradient Ascent 19/12/18 RMC: Subramanian and Mahajan �5

  22. Policy gradient Performance Gradient Estimate is an estimate of Stochastic Gradient Ascent and 19/12/18 RMC: Subramanian and Mahajan �5

  23. Policy gradient Performance Gradient Estimate is an estimate of How do we estimate this? Stochastic Gradient Ascent and 19/12/18 RMC: Subramanian and Mahajan �5

  24. How to estimate ? 19/12/18 RMC: Subramanian and Mahajan �6

  25. How to estimate ? Monte Carlo estimate (REINFORCE) 19/12/18 RMC: Subramanian and Mahajan �6

  26. How to estimate ? Monte Carlo estimate (REINFORCE) Actor Critic estimate (Temporal difference / SARSA) 19/12/18 RMC: Subramanian and Mahajan �6

  27. How to estimate ? Monte Carlo estimate (REINFORCE) Actor Critic estimate (Temporal difference / SARSA) Actor Critic with eligibility traces estimate (SARSA-λ) 19/12/18 RMC: Subramanian and Mahajan �6

  28. MC vs. TD MC TD 19/12/18 RMC: Subramanian and Mahajan �7

  29. MC vs. TD MC TD ⊕ Unbiased ⊕ Simple & easy to implement ⊕ Discounted & average reward cases ⊖ High variance ⊖ End-of-episode updates ⊖ Not asymptotically optimal for inf. hor. 19/12/18 RMC: Subramanian and Mahajan �7

  30. MC vs. TD MC TD ⊕ Unbiased ⊕ Low variance ⊕ Simple & easy to implement ⊕ Per-step updates ⊕ Discounted & average reward cases ⊕ Asymptotically optimal for inf. hor. ⊖ High variance ⊖ Biased ⊖ End-of-episode updates ⊖ Often requires function approximation ⊖ Not asymptotically optimal for inf. hor. ⊖ Additional effort for average reward 19/12/18 RMC: Subramanian and Mahajan �7

  31. MC vs. TD MC TD ⊕ Unbiased ⊕ Low variance ⊕ Simple & easy to implement ⊕ Per-step updates ⊕ Discounted & average reward cases ⊕ Asymptotically optimal for inf. hor. ⊖ High variance ⊖ Biased ⊖ End-of-episode updates ⊖ Often requires function approximation ⊖ Not asymptotically optimal for inf. hor. ⊖ Additional effort for average reward Can we get the best of both worlds? 19/12/18 RMC: Subramanian and Mahajan �7

  32. Renewal Monte Carlo State 0 1 2 3 4 5 6 7 Time 19/12/18 RMC: Subramanian and Mahajan �8

  33. Renewal Monte Carlo State 0 1 2 3 4 5 6 7 Time 19/12/18 RMC: Subramanian and Mahajan �8

  34. Renewal Monte Carlo State 0 1 2 3 4 5 6 7 Time 19/12/18 RMC: Subramanian and Mahajan �8

  35. Renewal Monte Carlo State 0 1 2 3 4 5 6 7 Time 19/12/18 RMC: Subramanian and Mahajan �8

  36. Renewal Monte Carlo State 0 1 2 3 4 5 6 7 Time 19/12/18 RMC: Subramanian and Mahajan �8

  37. Renewal Monte Carlo State 0 1 2 3 4 5 6 7 Time 19/12/18 RMC: Subramanian and Mahajan �8

  38. Renewal Monte Carlo State 0 1 2 3 4 5 6 7 Time estimated by 19/12/18 RMC: Subramanian and Mahajan �8

  39. RMC based policy gradient 19/12/18 RMC: Subramanian and Mahajan �9

  40. RMC based policy gradient Performance Gradient Estimate 19/12/18 RMC: Subramanian and Mahajan �9

  41. RMC based policy gradient ; Performance Gradient Estimate 19/12/18 RMC: Subramanian and Mahajan �9

  42. RMC based policy gradient ; Performance Gradient Estimate 19/12/18 RMC: Subramanian and Mahajan �9

  43. RMC based policy gradient ; Performance Gradient Estimate with estimate: 19/12/18 RMC: Subramanian and Mahajan �9

  44. RMC based policy gradient ; Performance Gradient Estimate with estimate: Stochastic Gradient Ascent and 19/12/18 RMC: Subramanian and Mahajan �9

  45. RMC based policy gradient ; Performance Gradient Estimate with estimate: estimated using MC / TD using RL policy gradient Stochastic Gradient Ascent and 19/12/18 RMC: Subramanian and Mahajan �9

  46. Convergence 19/12/18 RMC: Subramanian and Mahajan �10

  47. Convergence unbiased estimators of 19/12/18 RMC: Subramanian and Mahajan �10

  48. Convergence unbiased estimators of and 19/12/18 RMC: Subramanian and Mahajan �10

  49. Convergence unbiased estimators of and is an unbiased estimator of 19/12/18 RMC: Subramanian and Mahajan �10

  50. Convergence unbiased estimators of and is an unbiased estimator of has bounded variance and is continuous; 19/12/18 RMC: Subramanian and Mahajan �10

  51. Convergence unbiased estimators of and is an unbiased estimator of has bounded variance and is continuous; has locally asymptotically stable isolated limit points 19/12/18 RMC: Subramanian and Mahajan �10

  52. Convergence unbiased estimators of and is an unbiased estimator of has bounded variance and is continuous; has locally asymptotically stable isolated limit points Iteration for converges a.s. to a value where 19/12/18 RMC: Subramanian and Mahajan �10

  53. E.g. – Randomly generated MDP 300 250 200 Performance 150 Exact 100 50 0 0 . 00 0 . 25 0 . 50 0 . 75 1 . 00 1 . 25 1 . 50 1 . 75 2 . 00 × 10 5 Samples 19/12/18 RMC: Subramanian and Mahajan �11

  54. E.g. – Randomly generated MDP 300 250 200 Performance Exact 150 S-0 100 50 0 0 . 00 0 . 25 0 . 50 0 . 75 1 . 00 1 . 25 1 . 50 1 . 75 2 . 00 × 10 5 Samples 19/12/18 RMC: Subramanian and Mahajan �11

  55. E.g. – Randomly generated MDP 300 250 200 Performance Exact 150 S-0 S-1 100 50 0 0 . 00 0 . 25 0 . 50 0 . 75 1 . 00 1 . 25 1 . 50 1 . 75 2 . 00 × 10 5 Samples 19/12/18 RMC: Subramanian and Mahajan �11

  56. E.g. – Randomly generated MDP 300 250 Exact 200 S-0 Performance S-1 S-0.25 150 S-0.5 S-0.75 RMC 100 RMC-B 50 0 0 . 00 0 . 25 0 . 50 0 . 75 1 . 00 1 . 25 1 . 50 1 . 75 2 . 00 × 10 5 Samples 19/12/18 RMC: Subramanian and Mahajan �11

  57. Related work 19/12/18 RMC: Subramanian and Mahajan �12

Recommend


More recommend