Optimistic Policy Optimization via Multiple Importance Sampling Matteo Papini Alberto Maria Metelli Lorenzo Lupo Marcello Restelli 11th June 2019 Thirty-sixth International Conference on Machine Learning, Long Beach, CA, USA
1 Policy Optimization 1 Θ Parameter space Θ Ď R d A parametric policy for each θ P Θ Each inducing a distribution p θ over trajectories A return R p τ q for every trajectory τ Goal: max θ P Θ J p θ q “ E τ „ p θ r R p τ qs Iterative optimization (e.g., gradient ascent) M. Papini Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019
1 Policy Optimization 1 Θ Parameter space Θ Ď R d θ A parametric policy for each θ P Θ Each inducing a distribution p θ over trajectories A return R p τ q for every trajectory τ Goal: max θ P Θ J p θ q “ E τ „ p θ r R p τ qs Iterative optimization (e.g., gradient ascent) M. Papini Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019
1 Policy Optimization 1 Θ Parameter space Θ Ď R d θ A parametric policy for each θ P Θ Each inducing a distribution p θ over trajectories T A return R p τ q for every trajectory τ p θ Goal: max θ P Θ J p θ q “ E τ „ p θ r R p τ qs Iterative optimization (e.g., gradient ascent) M. Papini Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019
1 Policy Optimization 1 Θ Parameter space Θ Ď R d θ A parametric policy for each θ P Θ Each inducing a distribution p θ over trajectories T A return R p τ q for every trajectory τ p θ τ Goal: max θ P Θ J p θ q “ E τ „ p θ r R p τ qs R p τ q Iterative optimization (e.g., gradient ascent) M. Papini Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019
1 Policy Optimization 1 Θ Parameter space Θ Ď R d θ A parametric policy for each θ P Θ Each inducing a distribution p θ over trajectories T A return R p τ q for every trajectory τ p θ Goal: max θ P Θ J p θ q “ E τ „ p θ r R p τ qs J p θ q Iterative optimization (e.g., gradient ascent) M. Papini Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019
1 Policy Optimization 1 Θ Parameter space Θ Ď R d θ 1 A parametric policy for each θ P Θ Each inducing a distribution p θ over trajectories p θ 1 T A return R p τ q for every trajectory τ Goal: max θ P Θ J p θ q “ E τ „ p θ r R p τ qs J p θ 1 q Iterative optimization (e.g., gradient ascent) M. Papini Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019
2 Exploration in Policy Optimization 2 Continuous decision process ù ñ difficult Policy gradient methods tend to be greedy (e.g., TRPO [6], PGPE [7]) Mainly undirected (e.g., entropy bonus [2]) Lack of theoretical guarantees M. Papini Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019
2 Exploration in Policy Optimization 2 Continuous decision process ù ñ difficult Policy gradient methods tend to be greedy (e.g., TRPO [6], PGPE [7]) Mainly undirected (e.g., entropy bonus [2]) Lack of theoretical guarantees M. Papini Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019
2 Exploration in Policy Optimization 2 Continuous decision process ù ñ difficult Policy gradient methods tend to be greedy (e.g., TRPO [6], PGPE [7]) Mainly undirected (e.g., entropy bonus [2]) Lack of theoretical guarantees M. Papini Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019
2 Exploration in Policy Optimization 2 Continuous decision process ù ñ difficult Policy gradient methods tend to be greedy (e.g., TRPO [6], PGPE [7]) Mainly undirected (e.g., entropy bonus [2]) Lack of theoretical guarantees M. Papini Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019
2 Exploration in Policy Optimization 2 Continuous decision process ù ñ difficult Policy gradient methods tend to be greedy (e.g., TRPO [6], PGPE [7]) Mainly undirected (e.g., entropy bonus [2]) Lack of theoretical guarantees If only this were a Multi-Armed Bandit... M. Papini Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019
2 Exploration in Policy Optimization 2 Continuous decision process ù ñ difficult Policy gradient methods tend to be greedy (e.g., TRPO [6], PGPE [7]) Mainly undirected (e.g., entropy bonus [2]) Lack of theoretical guarantees If only this were a Correlated Multi-Armed Bandit... M. Papini Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019
3 Policy Optimization as a Correlated MAB 3 θ B θ A Arms: parameters θ Payoff: expected return J p θ q Continuous MAB [3]: we need structure Arm correlation [5] through trajectory distributions Importance Sampling (IS) M. Papini Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019
3 Policy Optimization as a Correlated MAB 3 θ B θ A Arms: parameters θ Payoff: expected return J p θ q MAB Continuous MAB [3] Arm correlation [5] through trajectory distributions J p θ A q J p θ B q Importance Sampling (IS) M. Papini Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019
3 Policy Optimization as a Correlated MAB 3 Θ θ B θ A Arms: parameters θ Payoff: expected return J p θ q MAB Continuous MAB [3] Arm correlation [5] through trajectory distributions J p θ A q J p θ B q Importance Sampling (IS) M. Papini Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019
3 Policy Optimization as a Correlated MAB 3 Θ θ B θ A Arms: parameters θ Payoff: expected return J p θ q T T Continuous MAB [3] p θ B p θ A Arm correlation [5] through trajectory distributions J p θ A q J p θ B q Importance Sampling (IS) M. Papini Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019
3 Policy Optimization as a Correlated MAB 3 Θ θ B θ A Arms: parameters θ Payoff: expected return J p θ q IS T T Continuous MAB [3] p θ B p θ A Arm correlation [5] through trajectory distributions J p θ A q J p θ B q Importance Sampling (IS) M. Papini Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019
4 4 OPTIMIST A UCB-like index [4]: q B t p θ q “ J t p θ q lo omo on ESTIMATE a truncated multiple importance sampling estimator [8, 1] MIS T p θ 1 p θ t ´ 1 p θ p θ 2 M. Papini Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019
4 4 OPTIMIST A UCB-like index [4]: d d 2 p p θ } Φ t q log 1 q δ t B t p θ q “ J t p θ q ` C lo omo on t looooooooooomooooooooooon ESTIMATE EXPLORATION BONUS: a truncated multiple importance sampling estimator [8, 1] distributional distance from previous solutions T Φ t d 2 p θ M. Papini Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019
4 4 OPTIMIST A UCB-like index [4]: d d 2 p p θ } Φ t q log 1 q δ t B t p θ q “ J t p θ q ` C lo omo on t looooooooooomooooooooooon ESTIMATE EXPLORATION BONUS: a truncated multiple importance sampling estimator [8, 1] distributional distance from previous solutions Select θ t “ arg max θ P Θ B t p θ q M. Papini Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019
5 Sublinear Regret 5 Regret p T q “ ř T t “ 0 J p θ ˚ q ´ J p θ t q Compact , d -dimensional parameter space Θ Under mild assumptions on the policy class, with high probability: ´ ? ¯ Regret p T q “ r O dT M. Papini Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019
5 Sublinear Regret 5 Regret p T q “ ř T t “ 0 J p θ ˚ q ´ J p θ t q Compact , d -dimensional parameter space Θ Under mild assumptions on the policy class, with high probability: ´ ? ¯ Regret p T q “ r O dT M. Papini Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019
5 Sublinear Regret 5 Regret p T q “ ř T t “ 0 J p θ ˚ q ´ J p θ t q Compact , d -dimensional parameter space Θ Under mild assumptions on the policy class, with high probability: ´ ? ¯ Regret p T q “ r O dT M. Papini Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019
6 Empirical Results 6 River Swim Cumulative Return 1 OPTIMIST 0 . 5 PGPE 0 0 1 , 000 2 , 000 3 , 000 4 , 000 5 , 000 Episodes M. Papini Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019
6 Empirical Results 6 River Swim Caveats Easy implementation only for parameter-based exploration [7] Cumulative Return 1 Difficult optimization OPTIMIST ù ñ discretization 0 . 5 PGPE ... 0 0 1 , 000 2 , 000 3 , 000 4 , 000 5 , 000 Episodes M. Papini Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019
Thank You for Your Attention! Poster #103 Code: github.com/WolfLo/optimist Contact: matteo.papini@polimi.it Web page: t3p.github.io/icml19
Recommend
More recommend