Proximal Policy Optimization Ruifan Yu (ruifan.yu@uwaterloo.ca) CS - PowerPoint PPT Presentation

Proximal Policy Optimization Ruifan Yu (ruifan.yu@uwaterloo.ca) CS 885 June 20

Pro roximal l Poli licy Optim timization (O (OpenAI) I) ” PPO has become th the default rein inforcement t le learnin ing alg lgorit ithm at t OpenAI I bec ecause of f its its ea ease of f use and good performance ” Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 . https://arxiv.org/pdf/1707.06347 https://blog.openai.com/openai-baselines-ppo/

Policy Gradient (REINFORCE) In practice, update on each batch(trajectory) * Use the same notation in the paper

Problem? • Uns nstable le up update Step size is very important: • If step size is too large: • Large step  bad policy • Next batch is generated from current bad policy  collect bad samples • Bad samples  worse policy • (compare to supervised learning: the correct label and data in the following batches may correct it) If step size is too small: the learning process is slow • • Data ata Ine Ineff fficiency On-policy method: for each new policy, we need to generate a completely new trajectory • The data is thrown out after just one gradient update • As complex neural networks need many updates, this makes the training process very slow •

Importance Sampling Estimate one distribution by sampling from another distribution 𝑂 𝐹 𝑦~𝑞 [𝑔 𝑦 ] ≈ 1 𝑔 𝑦 𝑗 𝑂 ෍ 𝑗=1,𝑦 𝑗 ∈𝑞 𝐹 𝑦~𝑞 [𝑔 𝑦 ] = න 𝑔 𝑦 𝑞 𝑦 𝑒𝑦 = න 𝑔 𝑦 𝑞 𝑦 𝑟 𝑦 𝑟 𝑦 𝑒𝑦 = 𝐹 𝑦~𝑟 [𝑔 𝑦 𝑞 𝑦 𝑟 𝑦 ] 𝑔 𝑦 𝑗 𝑞 𝑦 𝑗 𝑂 ≈ 1 𝑂 ෍ 𝑟 𝑦 𝑗 𝑗=1,𝑦 𝑗 ∈𝑟

Data Inefficiency Evaluate the gradient Use previous of current policy samples? Like replay buffer in DQN Data Make it Inefficiency efficient Avoid sampling from current policy Can we estimate an expectation of one distribution without taking samples from it?

Importance Sampling in Policy Gradient = 𝐹 𝑦~𝑟 [𝑔 𝑦 𝑞 𝑦 𝐹 𝑦~𝑞 𝑔 𝑦 𝑟 𝑦 ] 𝛼𝐾 𝜄 = 𝐹 𝑡 𝑢 , 𝑏 𝑢 ~ 𝜌 𝜄 [𝛼 log 𝜌 𝜄 𝑏 𝑢 𝑡 𝑢 𝐵(𝑡 𝑢 , 𝑏 𝑢 )] = 𝐹 𝑡 𝑢 , 𝑏 𝑢 ~𝜌 𝜄𝑝𝑚𝑒 [ 𝜌 𝜄 (𝑡 𝑢 , 𝑏 𝑢 ) 𝜌 𝜄 𝑝𝑚𝑒 (𝑡 𝑢 , 𝑏 𝑢 ) 𝛼 log 𝜌 𝜄 𝑏 𝑢 𝑡 𝑢 𝐵(𝑡 𝑢 , 𝑏 𝑢 )] 𝐾 𝜄 = 𝐹 𝑡 𝑢 , 𝑏 𝑢 ~𝜌 𝜄𝑝𝑚𝑒 [ 𝜌 𝜄 (𝑡 𝑢 , 𝑏 𝑢 ) 𝜌 𝜄 𝑝𝑚𝑒 (𝑡 𝑢 , 𝑏 𝑢 ) 𝐵(𝑡 𝑢 , 𝑏 𝑢 )] Surrogate objective function

Importance Sampling Problem? No free lunch! Two expectations are same, but we are using sampling method to estimate them  variance is also important 𝐹 𝑦~𝑟 [𝑔 𝑦 𝑞 𝑦 𝑊𝐵𝑆 𝑌 = 𝐹 𝑌 2 − 𝐹 𝑌 2 𝐹 𝑦~𝑞 [𝑔 𝑦 ] = 𝑟 𝑦 ] 𝑦~𝑟 [𝑔 𝑦 𝑞 𝑦 V𝑏𝑠 𝑟 𝑦 ] V𝑏𝑠 𝑦~𝑞 𝑔 𝑦 2 2 𝑔 𝑦 𝑞 𝑦 − 𝐹 𝑦~𝑟 𝑔 𝑦 𝑞 𝑦 = 𝐹 𝑦~𝑟 𝑟 𝑦 𝑟 𝑦 2 = 𝐹 𝑦~𝑞 𝑔 𝑦 2 − 𝐹 𝑦~𝑞 𝑔 𝑦 = 𝐹 𝑦~𝑞 𝑔 𝑦 2 𝑞 𝑦 2 − 𝐹 𝑦~𝑞 𝑔 𝑦 𝑟 𝑦 𝑞 𝑦 Price (Tradeoff): we may need to sample more data, if is far away from 1 𝑟 𝑦

Unstable Update Unstable Stable Adaptive learning rate Make confident updates limit the policy update range Can we measure the distance between two distributions?

KL Divergence Measure the distance of two distributions 𝑄(𝑦) 𝐸 𝐿𝑀 (𝑄||𝑅) = σ 𝑦 𝑄 𝑦 𝑚𝑝𝑕 𝑅(𝑦) KL divergence of two policies 𝜌 1 (𝑏|𝑡) 𝐸 𝐿𝑀 (𝜌 1 ||𝜌 2 )[𝑡] = σ 𝑏∈𝐵 𝜌 1 𝑏|𝑡 𝑚𝑝𝑕 𝜌 2 (𝑏|𝑡) * image: Kullback – Leibler divergence (Wikipedia) https://en.wikipedia.org/wiki/Kullback – Leibler_divergence

Trust Region Policy Optimization (TRPO) Common trick in optimization: Lagrangian Dual TRPO uses a hard constraint rather than a penalty because it is hard to choose a single value of β that performs well across different problems — or even within a single problem, where the characteristics change over the course of learning

Proximal Policy Optimization (PPO) TRPO use conjugate gradient decent to handle the constraint Hessian Matrix  expensive both in computation and space Idea: The constraint helps in the training process. However, maybe the constraint is not a strict constraint: Does it matter if we only break the constraint just a few times? What if we treat it as a “soft” constraint? Add proximal value to objective function?

PPO with Adaptive KL Penalty Hard to pick 𝛾 value  use adaptive 𝛾 Still need to set up a KL divergence target value …

PPO with Adaptive KL Penalty * CS294 Fall 2017, Lecture 13 http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_13_advanced_pg.pdf

PPO with Clipped Objective Fluctuation happens when r changes too quickly  limit r within a range? 1 + Ɛ 1 + Ɛ 1 1 1 - Ɛ 1 - Ɛ 1 1 + Ɛ r 1 1 + Ɛ r 1 - Ɛ 1 - Ɛ

PPO with Clipped Objective * CS294 Fall 2017, Lecture 13 http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_13_advanced_pg.pdf

PPO in practice a squared-error loss entropy bonus to ensure Surrogate objective function for “critic” sufficient exploration encourage “diversity” * c1, c2: empirical values, in the paper, c1=1, c2=0.01

Performance Results from continuous control benchmark. Average normalized scores (over 21 runs of the algorithm, on 7 environments)

Performance Results in MuJoCo environments, training for one million timesteps

Related Works [1] Emergence of f Lo Loco comotion Be Behaviours in in Ric ich Env nviro ronments Distributed PPO Interesting fact: this paper is published before PPO paper DeepMind got this idea from OpenAI’s talking in NIPS 2016 [2] An Adaptive Clip lipping Approach fo for r Pro Proximal l Po Policy Opti timization PPO- 𝜇 Change the clipping range adaptively [1] https://arxiv.org/abs/1707.02286 [2] https://arxiv.org/abs/1804.06461

END Thank you

Proximal Policy Optimization Ruifan Yu (ruifan.yu@uwaterloo.ca) CS - PowerPoint PPT Presentation

Proximal Policy Optimization Ruifan Yu (ruifan.yu@uwaterloo.ca) CS 885 June 20 Pro roximal l Poli licy Optim timization (O (OpenAI) I) PPO has become th the default rein inforcement t le learnin ing alg lgorit ithm at t

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the

Convergence of perturbed Proximal Gradient algorithms Gersende Fort Institut de Math ematiques

Asymmetric Proximal Point Algorithms with Moving Proximal Centers Deren Han

Proximal Method with Contractions for Smooth Convex Optimization Nikita Doikov Yurii Nesterov

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey

On the Equivalence of Inexact Proximal ALM and ADMM for a Class of Convex Composite Programming

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Distributed nonsmooth composite optimization via the proximal augmented Lagrangian Neil K.

PT Considerations for the Nonoperatively Treated Proximal Humerus Fractures John Cavanaugh PT

Risk Factors In Proximal Humerus Fractures: Males Vs. Females Jerjes, W / Callear, J /

Proximal point algorithm in Hadamard spaces Miroslav Bacak T el ecom ParisTech

Deep Unfolded Proximal Interior Point Algorithm for Image Restoration C. Bertocchi 1 , E.

Lecture: Fast Proximal Gradient Methods http://bicmr.pku.edu.cn/~wenzw/opt-2018-fall.html

Deep Unfolding of a Proximal Interior Point Method for Image Restoration M.-C. Corbineau 1 in

Nonnegative Tensor Factorization using a proximal algorithm: application to 3D fluorescence

Efficient Meta Learning via Minibatch Proximal Update Pan Zhou Joint work with Xiao-Tong Yuan,

CS 285 Instructor: Sergey Levine UC Berkeley Recap: policy gradients fit a model to estimate

CENG4480 Lecture 07: PID Control Bei Yu byu@cse.cuhk.edu.hk (Latest update: October 10, 2018)

GARCH models without positivity constraints: Exponential or Log GARCH ? C. Francq, O.

Some applications of proximal methods Caroline CHAUX Joint work with P. L. Combettes, L. Duval,

Inertial Block Proximal Methods for Non-Convex Non-Smooth Optimization L. T. K. Hien 1 N. Gillis 1

Meta-Learning of Structured Representation by Proximal Mapping Mao Li, Yingyi Ma, Xinhua

iPiano: Inertial Proximal Algorithm for Non-Convex Optimization David Stutz June 2, 2016 David

Efficient Bayesian computation by proximal Markov chain Monte Carlo: when Langevin meets Moreau