Renewal Monte Carlo: Renewal theory based reinforcement learning Jayakumar Subramanian and Aditya Mahajan 57th IEEE Conference on Decision and Control, Miami Beach, FL, USA, December 17-19, 2018
RL has achieved considerable success… Image credit: Towards Data Science Image credit: MIT Technology review Image credit: Popular Science 19/12/18 RMC: Subramanian and Mahajan �2
RL has achieved considerable success… Image credit: Towards Data Science Image credit: MIT Technology review Image credit: Popular Science Salient features ⊕ Model-free method ⊕ Use policy search 19/12/18 RMC: Subramanian and Mahajan �2
RL has achieved considerable success… Image credit: Towards Data Science Image credit: MIT Technology review Image credit: Popular Science Salient features Limitation ⊕ Model-free method ⊖ Learning is slow (takes ∼ 10 ^9 to 10 ^15 ⊕ Use policy search iterations to converge) 19/12/18 RMC: Subramanian and Mahajan �2
RL has achieved considerable success… Image credit: Towards Data Science Image credit: MIT Technology review Image credit: Popular Science Salient features Limitation ⊕ Model-free method ⊖ Learning is slow (takes ∼ 10 ^9 to 10 ^15 ⊕ Use policy search iterations to converge) ⊕ Can we exploit features of the model to make it learn faster? … ⊕ Without sacrificing generality? 19/12/18 RMC: Subramanian and Mahajan �2
An RL problem can be formulated as… 19/12/18 RMC: Subramanian and Mahajan �3
An RL problem can be formulated as… Agent 19/12/18 RMC: Subramanian and Mahajan �3
An RL problem can be formulated as… Environment Agent 19/12/18 RMC: Subramanian and Mahajan �3
An RL problem can be formulated as… Environment Agent 19/12/18 RMC: Subramanian and Mahajan �3
An RL problem can be formulated as… Environment Agent Infinite horizon Markov decision process (MDP) Model State space Action space Transition probability Per-step reward 19/12/18 RMC: Subramanian and Mahajan �3
An RL problem can be formulated as… Environment Unknown in RL Agent Infinite horizon Markov decision process (MDP) Model State space Action space Transition probability Per-step reward 19/12/18 RMC: Subramanian and Mahajan �3
Policy parametrization 19/12/18 RMC: Subramanian and Mahajan �4
Policy parametrization is a parametrized policy 19/12/18 RMC: Subramanian and Mahajan �4
Policy parametrization is a parametrized policy Gibbs (softmax) policy 19/12/18 RMC: Subramanian and Mahajan �4
Policy parametrization is a parametrized policy Gibbs (softmax) policy Neural network (NN) policy : weights of NN 19/12/18 RMC: Subramanian and Mahajan �4
Policy gradient 19/12/18 RMC: Subramanian and Mahajan �5
Policy gradient Performance Gradient Estimate 19/12/18 RMC: Subramanian and Mahajan �5
Policy gradient Performance Gradient Estimate 19/12/18 RMC: Subramanian and Mahajan �5
Policy gradient Performance Gradient Estimate is an estimate of 19/12/18 RMC: Subramanian and Mahajan �5
Policy gradient Performance Gradient Estimate is an estimate of Stochastic Gradient Ascent 19/12/18 RMC: Subramanian and Mahajan �5
Policy gradient Performance Gradient Estimate is an estimate of Stochastic Gradient Ascent 19/12/18 RMC: Subramanian and Mahajan �5
Policy gradient Performance Gradient Estimate is an estimate of Stochastic Gradient Ascent and 19/12/18 RMC: Subramanian and Mahajan �5
Policy gradient Performance Gradient Estimate is an estimate of How do we estimate this? Stochastic Gradient Ascent and 19/12/18 RMC: Subramanian and Mahajan �5
How to estimate ? 19/12/18 RMC: Subramanian and Mahajan �6
How to estimate ? Monte Carlo estimate (REINFORCE) 19/12/18 RMC: Subramanian and Mahajan �6
How to estimate ? Monte Carlo estimate (REINFORCE) Actor Critic estimate (Temporal difference / SARSA) 19/12/18 RMC: Subramanian and Mahajan �6
How to estimate ? Monte Carlo estimate (REINFORCE) Actor Critic estimate (Temporal difference / SARSA) Actor Critic with eligibility traces estimate (SARSA-λ) 19/12/18 RMC: Subramanian and Mahajan �6
MC vs. TD MC TD 19/12/18 RMC: Subramanian and Mahajan �7
MC vs. TD MC TD ⊕ Unbiased ⊕ Simple & easy to implement ⊕ Discounted & average reward cases ⊖ High variance ⊖ End-of-episode updates ⊖ Not asymptotically optimal for inf. hor. 19/12/18 RMC: Subramanian and Mahajan �7
MC vs. TD MC TD ⊕ Unbiased ⊕ Low variance ⊕ Simple & easy to implement ⊕ Per-step updates ⊕ Discounted & average reward cases ⊕ Asymptotically optimal for inf. hor. ⊖ High variance ⊖ Biased ⊖ End-of-episode updates ⊖ Often requires function approximation ⊖ Not asymptotically optimal for inf. hor. ⊖ Additional effort for average reward 19/12/18 RMC: Subramanian and Mahajan �7
MC vs. TD MC TD ⊕ Unbiased ⊕ Low variance ⊕ Simple & easy to implement ⊕ Per-step updates ⊕ Discounted & average reward cases ⊕ Asymptotically optimal for inf. hor. ⊖ High variance ⊖ Biased ⊖ End-of-episode updates ⊖ Often requires function approximation ⊖ Not asymptotically optimal for inf. hor. ⊖ Additional effort for average reward Can we get the best of both worlds? 19/12/18 RMC: Subramanian and Mahajan �7
Renewal Monte Carlo State 0 1 2 3 4 5 6 7 Time 19/12/18 RMC: Subramanian and Mahajan �8
Renewal Monte Carlo State 0 1 2 3 4 5 6 7 Time 19/12/18 RMC: Subramanian and Mahajan �8
Renewal Monte Carlo State 0 1 2 3 4 5 6 7 Time 19/12/18 RMC: Subramanian and Mahajan �8
Renewal Monte Carlo State 0 1 2 3 4 5 6 7 Time 19/12/18 RMC: Subramanian and Mahajan �8
Renewal Monte Carlo State 0 1 2 3 4 5 6 7 Time 19/12/18 RMC: Subramanian and Mahajan �8
Renewal Monte Carlo State 0 1 2 3 4 5 6 7 Time 19/12/18 RMC: Subramanian and Mahajan �8
Renewal Monte Carlo State 0 1 2 3 4 5 6 7 Time estimated by 19/12/18 RMC: Subramanian and Mahajan �8
RMC based policy gradient 19/12/18 RMC: Subramanian and Mahajan �9
RMC based policy gradient Performance Gradient Estimate 19/12/18 RMC: Subramanian and Mahajan �9
RMC based policy gradient ; Performance Gradient Estimate 19/12/18 RMC: Subramanian and Mahajan �9
RMC based policy gradient ; Performance Gradient Estimate 19/12/18 RMC: Subramanian and Mahajan �9
RMC based policy gradient ; Performance Gradient Estimate with estimate: 19/12/18 RMC: Subramanian and Mahajan �9
RMC based policy gradient ; Performance Gradient Estimate with estimate: Stochastic Gradient Ascent and 19/12/18 RMC: Subramanian and Mahajan �9
RMC based policy gradient ; Performance Gradient Estimate with estimate: estimated using MC / TD using RL policy gradient Stochastic Gradient Ascent and 19/12/18 RMC: Subramanian and Mahajan �9
Convergence 19/12/18 RMC: Subramanian and Mahajan �10
Convergence unbiased estimators of 19/12/18 RMC: Subramanian and Mahajan �10
Convergence unbiased estimators of and 19/12/18 RMC: Subramanian and Mahajan �10
Convergence unbiased estimators of and is an unbiased estimator of 19/12/18 RMC: Subramanian and Mahajan �10
Convergence unbiased estimators of and is an unbiased estimator of has bounded variance and is continuous; 19/12/18 RMC: Subramanian and Mahajan �10
Convergence unbiased estimators of and is an unbiased estimator of has bounded variance and is continuous; has locally asymptotically stable isolated limit points 19/12/18 RMC: Subramanian and Mahajan �10
Convergence unbiased estimators of and is an unbiased estimator of has bounded variance and is continuous; has locally asymptotically stable isolated limit points Iteration for converges a.s. to a value where 19/12/18 RMC: Subramanian and Mahajan �10
E.g. – Randomly generated MDP 300 250 200 Performance 150 Exact 100 50 0 0 . 00 0 . 25 0 . 50 0 . 75 1 . 00 1 . 25 1 . 50 1 . 75 2 . 00 × 10 5 Samples 19/12/18 RMC: Subramanian and Mahajan �11
E.g. – Randomly generated MDP 300 250 200 Performance Exact 150 S-0 100 50 0 0 . 00 0 . 25 0 . 50 0 . 75 1 . 00 1 . 25 1 . 50 1 . 75 2 . 00 × 10 5 Samples 19/12/18 RMC: Subramanian and Mahajan �11
E.g. – Randomly generated MDP 300 250 200 Performance Exact 150 S-0 S-1 100 50 0 0 . 00 0 . 25 0 . 50 0 . 75 1 . 00 1 . 25 1 . 50 1 . 75 2 . 00 × 10 5 Samples 19/12/18 RMC: Subramanian and Mahajan �11
E.g. – Randomly generated MDP 300 250 Exact 200 S-0 Performance S-1 S-0.25 150 S-0.5 S-0.75 RMC 100 RMC-B 50 0 0 . 00 0 . 25 0 . 50 0 . 75 1 . 00 1 . 25 1 . 50 1 . 75 2 . 00 × 10 5 Samples 19/12/18 RMC: Subramanian and Mahajan �11
Related work 19/12/18 RMC: Subramanian and Mahajan �12
Recommend
More recommend