Reinforcement Learning in Configurable Continuous Environments Alberto Maria Metelli, Emanuele Ghelfi and Marcello Restelli 36th International Conference on Machine Learning 13th June 2019
1 Non-Configurable Environments 1 Configurable Markov Decision Process Configurable (MDP, Puterman, 2014) M = ( S , A , r, γ, µ, p ) S 0 ∼ µ, A t ∼ π θ ( ·| S t ) , S t +1 ∼ p ( ·| S t , A t ) Agent Learn the policy parameters θ under the action A t (policy) fixed environment p � + ∞ � � θ ∗ = arg max γ t R t +1 J ( θ ) = E reward R t +1 θ ∈ Θ t =0 Environment state S t +1 A. M. Metelli Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863
2 Configurable Environments 2 Configurable Markov Decision Process (Conf-MDP, Metelli et al., 2018) CM = ( S , A , r, γ, µ, P , Π) S 0 ∼ µ, A t ∼ π θ ( ·| S t ) , S t +1 ∼ p ω ( ·| S t , A t ) Agent Learn the policy parameters θ together action A t (policy) with the environment configuration ω � + ∞ � � θ ∗ , ω ∗ = arg max γ t R t +1 J ( θ , ω ) = E reward R t +1 Environment θ ∈ Θ , ω ∈ Ω t =0 (configuration) state S t +1 A. M. Metelli Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863
3 3 State of the Art Safe Policy Model Iteration (SPMI, Metelli et al., 2018) Optimize a lower bound of the performance improvement Limitations Finite state-actions spaces Full knowledge of the environment dynamics Similar approaches Keren et al. (2017) and Silva et al. (2018) A. M. Metelli Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863
4 Relative Entropy Model Policy Search (REMPS) 4 Projection Optimization Find a policy π θ ′ and configuration p ω ′ Find a new stationary distribution d ′ in a inducing a stationary distribution close to d ′ trust region centered in d π θ ,p ω J d ′ = S,A,S ′ ∼ d ′ [ r ( S, A, S ′ )] � � max E d ′ � d π θ ′ ,p ω ′ θ ′ ∈ Θ , ω ′ ∈ Ω D KL min d ′ s.t. D KL ( d ′ � d π θ ,p ω ) ≤ κ, d π θ ,p ω ↑ ( π θ , p ω ) Θ × Ω A. M. Metelli Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863
4 Relative Entropy Model Policy Search (REMPS) 4 Projection Optimization Find a policy π θ ′ and configuration p ω ′ Find a new stationary distribution d ′ in a inducing a stationary distribution close to d ′ trust region centered in d π θ ,p ω J d ′ = S,A,S ′ ∼ d ′ [ r ( S, A, S ′ )] � � max E d ′ � d π θ ′ ,p ω ′ θ ′ ∈ Θ , ω ′ ∈ Ω D KL min d ′ s.t. D KL ( d ′ � d π θ ,p ω ) ≤ κ, d ′ D KL ≤ κ optimization d π θ ,p ω ↑ ( π θ , p ω ) Θ × Ω A. M. Metelli Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863
4 Relative Entropy Model Policy Search (REMPS) 4 Projection Optimization Find a policy π θ ′ and configuration p ω ′ Find a new stationary distribution d ′ in a inducing a stationary distribution close to d ′ trust region centered in d π θ ,p ω J d ′ = S,A,S ′ ∼ d ′ [ r ( S, A, S ′ )] � � max E d ′ � d π θ ′ ,p ω ′ θ ′ ∈ Θ , ω ′ ∈ Ω D KL min d ′ s.t. D KL ( d ′ � d π θ ,p ω ) ≤ κ, d ′ D KL ≤ κ optimization projection d π θ ,p ω d π θ ′ ,p ω ′ ↑ ↓ ( π θ , p ω ) ( π θ ′ , p ω ′ ) Θ × Ω A. M. Metelli Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863
4 Relative Entropy Model Policy Search (REMPS) 4 Projection Optimization Find a policy π θ ′ and configuration p ω ′ Find a new stationary distribution d ′ in a inducing a stationary distribution close to d ′ trust region centered in d π θ ,p ω J d ′ = S,A,S ′ ∼ d ′ [ r ( S, A, S ′ )] � � max E d ′ � d π θ ′ ,p ω ′ θ ′ ∈ Θ , ω ′ ∈ Ω D KL min d ′ s.t. D KL ( d ′ � d π θ ,p ω ) ≤ κ, Can also be an d ′ D KL ≤ κ optimization approximated model � p projection d π θ ,p ω d π θ ′ ,p ω ′ ↑ ↓ ( π θ , p ω ) ( π θ ′ , p ω ′ ) Θ × Ω A. M. Metelli Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863
5 Experiments 5 TORCS Cartpole Chain Domain Configure the front-rear wing Configure the cart force orientation and brake repartition 2000 100 10 average reward average return average reward 1500 80 8 60 1000 6 40 500 4 20 2 0 0 20 40 0 500 1000 1500 2000 0 5 10 15 iteration iteration iteration REMPS REPS REMPS (0 . 01) REMPS (0 . 1) REMPS G(PO)MDP Bot REMPS (10) G(PO)MDP A. M. Metelli Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863
Thank You for Your Attention! Poster Pacific Ballroom #37 Code: github.com/albertometelli/remps Web page: albertometelli.github.io/ICML2019-REMPS Contact: albertomaria.metelli@polimi.it
7 7 References Keren, S., Pineda, L., Gal, A., Karpas, E., and Zilberstein, S. (2017). Equi-reward utility maximizing design in stochastic environments. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17 , pages 4353–4360. Metelli, A. M., Mutti, M., and Restelli, M. (2018). Configurable markov decision processes. In Dy, J. G. and Krause, A., editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm¨ assan, Stockholm, Sweden, July 10-15, 2018 , volume 80 of Proceedings of Machine Learning Research , pages 3488–3497. PMLR. Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming . John Wiley & Sons. Silva, R., Melo, F. S., and Veloso, M. (2018). What if the world were different? gradient-based exploration for new optimal policies. EPiC Series in Computing , 55:229–242. A. M. Metelli Reinforcement Learning in Configurable Continuous Environments POLITECNICO DI MILANO 1863
Recommend
More recommend