trust region policy optimization
play

Trust Region Policy Optimization John Schulman, Sergey Levine, - PowerPoint PPT Presentation

Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML 2015 Presenter: Shivam Kalra Shivam.kalra@uwaterloo.ca CS 885 (Reinforcement Learning) Prof. Pascal Poupart June 20 th 2018


  1. Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML 2015 Presenter: Shivam Kalra Shivam.kalra@uwaterloo.ca CS 885 (Reinforcement Learning) Prof. Pascal Poupart June 20 th 2018

  2. Reinforcement Learning Action Value Function Policy Gradients Q-Learning Actor Critic TRPO PPO A3C ACKTR Ref: https://www.youtube.com/watch?v=CKaN5PgkSBc

  3. Policy Gradient For i =1,2,… Collect N trajectories for policy 𝜌 πœ„ Estimate advantage function 𝐡 Compute policy gradient 𝑕 Update policy parameter πœ„ = πœ„ π‘π‘šπ‘’ + 𝛽𝑕

  4. Problems of Policy Gradient For i=1,2 ,… Collect N trajectories for policy 𝜌 πœ„ Non stationary input data Estimate advantage function 𝐡 due to changing policy and Compute policy gradient 𝑕 reward distributions change Update policy parameter πœ„ = πœ„ π‘π‘šπ‘’ + 𝛽𝑕

  5. Problems of Policy Gradient For i =1,2,… Collect N trajectories for policy 𝜌 πœ„ Advantage is very random Estimate advantage function 𝐡 cv initially Compute policy gradient 𝑕 Update policy parameter πœ„ = πœ„ π‘π‘šπ‘’ + 𝛽𝑕 You’re bad Advantage Policy

  6. Problems of Policy Gradient For i =1,2,… Collect N trajectories for policy 𝜌 πœ„ Estimate advantage function 𝐡 Compute policy gradient 𝑕 We need more carefully Update policy parameter πœ„ = πœ„ π‘π‘šπ‘’ + 𝛽𝑕 crafted policy update We want improvement and not degradation Idea: We can update old policy 𝜌 π‘π‘šπ‘’ to a new policy ΰ·€ 𝜌 such that they are β€œtrusted” distance apart. Such conservative policy update allows improvement instead of degradation.

  7. RL to Optimization β€’ Most of ML is optimization β€’ Supervised learning is reducing training loss β€’ RL: what is policy gradient optimizing? β€’ Favoring (𝑑, 𝑏) that gave more advantage 𝐡 . β€’ Can we write down optimization problem that allows to do small update on a policy 𝜌 based on data sampled from 𝜌 (on-policy data) Ref: https://www.youtube.com/watch?v=xvRrgxcpaHY (6:40)

  8. What loss to optimize? β€’ Optimize πœƒ(𝜌) i.e., expected return of a policy 𝜌 ∞ 𝛿 𝑒 𝑠 πœƒ 𝜌 = 𝔽 𝑑 0 ~𝜍 0 ,𝑏 𝑒 ~𝜌 . 𝑑 𝑒 ෍ 𝑒 𝑒=0 β€’ We collect data with 𝜌 π‘π‘šπ‘’ and optimize the objective to get a new policy ΰ·€ 𝜌 .

  9. What loss to optimize? β€’ We can express πœƒ ΰ·€ 𝜌 in terms of the advantage over the original policy 1 . ∞ 𝛿 𝑒 𝐡 𝜌 π‘π‘šπ‘’ (𝑑 𝑒 , 𝑏 𝑒 ) πœƒ ΰ·€ 𝜌 = πœƒ 𝜌 π‘π‘šπ‘’ + 𝔽 𝜐~ΰ·₯ 𝜌 ෍ 𝑒=0 Expected return of old policy Expected return of Sample from new new policy policy [1] Kakade, Sham, and John Langford. "Approximately optimal approximate reinforcement learning." ICML. Vol. 2. 2002.

  10. What loss to optimize? β€’ Previous equation can be rewritten as 1 : πœƒ ΰ·€ 𝜌 = πœƒ 𝜌 π‘π‘šπ‘’ + ෍ 𝜍 ΰ·₯ 𝜌 (𝑑) ෍ 𝜌 𝑏 𝑑 𝐡 𝜌 (𝑑, 𝑏) ΰ·€ 𝑑 𝑏 Expected return of old policy Expected return of Discounted visitation frequency 𝜍 𝜌 𝑑 = 𝑄 𝑑 0 = 𝑑 + 𝛿𝑄 𝑑 1 = 𝑑 + 𝛿 2 𝑄 + β‹― new policy [1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.

  11. What loss to optimize? Old Expected Return πœƒ ΰ·€ 𝜌 = πœƒ 𝜌 π‘π‘šπ‘’ + ෍ 𝜍 ΰ·₯ 𝜌 (𝑑) ෍ 𝜌 𝑏 𝑑 𝐡 𝜌 (𝑑, 𝑏) ΰ·€ β‰₯ 𝟏 𝑑 𝑏 New Expected Return

  12. What loss to optimize? πœƒ ΰ·€ 𝜌 = πœƒ 𝜌 π‘π‘šπ‘’ + ෍ 𝜍 ΰ·₯ 𝜌 (𝑑) ෍ 𝜌 𝑏 𝑑 𝐡 𝜌 (𝑑, 𝑏) ΰ·€ β‰₯ 𝟏 𝑑 𝑏 > New Expected Return Old Expected Return Guaranteed Improvement from 𝜌 π‘π‘šπ‘’ β†’ ΰ·€ 𝜌

  13. New State Visitation is Difficult State visitation based on new policy πœƒ ΰ·€ 𝜌 = πœƒ 𝜌 π‘π‘šπ‘’ + ෍ 𝜍 ΰ·₯ 𝜌 (𝑑) ෍ 𝜌 𝑏 𝑑 𝐡 𝜌 (𝑑, 𝑏) ΰ·€ 𝑑 𝑏 New policy β€œComplex dependency of 𝜍 ΰ·₯ 𝜌 (𝑑) on 𝜌 makes the equation difficult to ΰ·€ optimize directly. ” [1] [1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.

  14. New State Visitation is Difficult πœƒ ΰ·€ 𝜌 = πœƒ 𝜌 π‘π‘šπ‘’ + ෍ 𝜍 ΰ·₯ 𝜌 (𝑑) ෍ 𝜌 𝑏 𝑑 𝐡 𝜌 (𝑑, 𝑏) ΰ·€ 𝑑 𝑏 𝑀 ΰ·€ 𝜌 = πœƒ 𝜌 π‘π‘šπ‘’ + ෍ 𝜍 𝜌 (𝑑) ෍ 𝜌 𝑏 𝑑 𝐡 𝜌 (𝑑, 𝑏) ΰ·€ 𝑑 𝑏 Local approximation of 𝜽(ΰ·₯ 𝝆) [1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.

  15. Local approximation of πœƒ(ΰ·€ 𝜌) 𝑀 ΰ·€ 𝜌 = πœƒ 𝜌 π‘π‘šπ‘’ + ෍ 𝜍 𝜌 (𝑑) ෍ 𝜌 𝑏 𝑑 𝐡 𝜌 π‘π‘šπ‘’ (𝑑, 𝑏) ΰ·€ 𝑑 𝑏 The approximation is accurate within step size πœ€ (trust region) Trust region Monotonic improvement πœ„ guaranteed πœ„ β€² 𝜌 πœ„ β€² 𝑑 𝑏 does not change dramatically. [1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.

  16. Local approximation of πœƒ(ΰ·€ 𝜌) β€’ The following bound holds: 𝑛𝑏𝑦 (𝜌, ΰ·€ πœƒ ΰ·€ 𝜌 β‰₯ 𝑀 ΰ·€ 𝜌 βˆ’ 𝐷𝐸 𝐿𝑀 𝜌) 4πœ—π›Ώ Where, 𝐷 = 1βˆ’π›Ώ 2 β€’ Monotonically improving policies can be generated by: 𝑛𝑏𝑦 𝜌, ΰ·€ 𝜌 = arg max [𝑀 ΰ·€ 𝜌 βˆ’ 𝐷𝐸 𝐿𝑀 𝜌 ] 𝜌 4πœ—π›Ώ Where, 𝐷 = 1βˆ’π›Ώ 2

  17. Minorization Maximization (MM) algorithm 𝑛𝑏𝑦 (𝜌, ΰ·€ Surrogate function 𝑀 𝜌 βˆ’ 𝐷𝐸 𝐿𝑀 𝜌) Actual function πœƒ(𝜌)

  18. Optimization of Parameterized Policies β€’ Now policies are parameterized 𝜌 πœ„ 𝑏 𝑑 with parameters πœ„ β€’ Accordingly surrogate function changes to 𝑛𝑏𝑦 πœ„ π‘π‘šπ‘’ , πœ„ ] arg max [𝑀 πœ„ βˆ’ 𝐷𝐸 𝐿𝑀 πœ„

  19. Optimization of Parameterized Policies 𝑛𝑏𝑦 πœ„ π‘π‘šπ‘’ , πœ„ ] arg max [𝑀 πœ„ βˆ’ 𝐷𝐸 𝐿𝑀 πœ„ In practice 𝐷 results in very small step sizes One way to take larger step size is to constraint KL divergence between the new policy and the old policy, i.e., a trust region constraint: π’π’ƒπ’šπ’‹π’π’‹π’œπ’‡ 𝑴 𝜾 (𝜾) 𝜾 π’π’ƒπ’š 𝜾 π’‘π’Žπ’† , 𝜾 ≀ 𝜺 subject to, 𝑬 𝑳𝑴

  20. Solving KL-Penalized Problem 𝑛𝑏𝑦 (πœ„ π‘π‘šπ‘’ , πœ„) β€’ max𝑗𝑛𝑗𝑨𝑓 πœ„ 𝑀 πœ„ βˆ’ 𝐷. 𝐸 𝐿𝑀 β€’ Use mean KL divergence instead of max. β€’ i.e., max𝑗𝑛𝑗𝑨𝑓 𝑀 πœ„ βˆ’ 𝐷. 𝐸 𝐿𝑀 (πœ„ π‘π‘šπ‘’ , πœ„) πœ„ β€’ Make linear approximation to 𝑀 and quadratic to KL term: 𝑕 . πœ„ βˆ’ πœ„ π‘π‘šπ‘’ βˆ’ 𝑑 2 πœ„ βˆ’ πœ„ π‘π‘šπ‘’ π‘ˆ 𝐺(πœ„ βˆ’ πœ„ π‘π‘šπ‘’ ) max𝑗𝑛𝑗𝑨𝑓 πœ„ πœ– 2 πœ– πœ–πœ„ 𝑀 πœ„ ȁ πœ„=πœ„ π‘π‘šπ‘’ , πœ– 2 πœ„ 𝐸 𝐿𝑀 πœ„ π‘π‘šπ‘’ , πœ„ ȁ πœ„=πœ„ π‘π‘šπ‘’ where, 𝑕 = 𝐺 =

  21. Solving KL-Penalized Problem β€’ Make linear approximation to 𝑀 and quadratic to KL term: πœ„ βˆ’ πœ„ π‘π‘šπ‘’ βˆ’ 𝑑 2 πœ„ βˆ’ πœ„ π‘π‘šπ‘’ π‘ˆ 𝐺(πœ„ βˆ’ πœ„ π‘π‘šπ‘’ ) max𝑗𝑛𝑗𝑨𝑓 𝑕 . πœ„ 𝐺 = πœ– 2 πœ– πœ–πœ„ 𝑀 πœ„ ȁ πœ„=πœ„ π‘π‘šπ‘’ , πœ– 2 πœ„ 𝐸 𝐿𝑀 πœ„ π‘π‘šπ‘’ , πœ„ ȁ πœ„=πœ„ π‘π‘šπ‘’ where, 𝑕 = 1 𝑑 𝐺 βˆ’1 𝑕 . Don’t want to form full Hessian matrix β€’ Solution: πœ„ βˆ’ πœ„ π‘π‘šπ‘’ = πœ– 2 πœ– 2 πœ„ 𝐸 𝐿𝑀 πœ„ π‘π‘šπ‘’ , πœ„ ȁ πœ„=πœ„ π‘π‘šπ‘’ . 𝐺 = β€’ Can compute 𝐺 βˆ’1 𝑕 approximately using conjugate gradient algorithm without forming 𝐺 explicitly.

  22. Conjugate Gradient (CG) β€’ Conjugate gradient algorithm approximately solves for 𝑦 = 𝐡 βˆ’1 𝑐 without explicitly forming matrix 𝐡 1 2 𝑦 π‘ˆ 𝐡𝑦 βˆ’ 𝑐𝑦 β€’ After 𝑙 iterations, CG has minimized

  23. TRPO: KL-Constrained β€’ Unconstrained problem: max𝑗𝑛𝑗𝑨𝑓 𝑀 πœ„ βˆ’ 𝐷. 𝐸 𝐿𝑀 (πœ„ π‘π‘šπ‘’ , πœ„) πœ„ β€’ Constrained problem: max𝑗𝑛𝑗𝑨𝑓 𝑀 πœ„ subject to 𝐷. 𝐸 𝐿𝑀 πœ„ π‘π‘šπ‘’ , πœ„ ≀ πœ€ πœ„ β€’ πœ€ is a hyper-parameter, remains fixed over whole learning process β€’ Solve constrained quadratic problem: compute 𝐺 βˆ’1 𝑕 and then rescale step to get correct KL πœ„ βˆ’ πœ„ π‘π‘šπ‘’ subject to 1 2 πœ„ βˆ’ πœ„ π‘π‘šπ‘’ π‘ˆ 𝐺 πœ„ βˆ’ πœ„ π‘π‘šπ‘’ ≀ πœ€ β€’ max𝑗𝑛𝑗𝑨𝑓 𝑕 . πœ„ πœ„ βˆ’ πœ„ π‘π‘šπ‘’ βˆ’ πœ‡ 2 [ πœ„ βˆ’ πœ„ π‘π‘šπ‘’ π‘ˆ 𝐺 πœ„ βˆ’ πœ„ π‘π‘šπ‘’ βˆ’ πœ€] β€’ Lagrangian: β„’ πœ„, πœ‡ = 𝑕 . β€’ Differentiate wrt πœ„ and get πœ„ βˆ’ πœ„ π‘π‘šπ‘’ = 1 πœ‡ 𝐺 βˆ’1 𝑕 β€’ We want 1 2 𝑑 π‘ˆ 𝐺𝑑 = πœ€ 2πœ€ β€’ Given candidate step 𝑑 π‘£π‘œπ‘‘π‘‘π‘π‘šπ‘“π‘’ rescale to 𝑑 = 𝑑 π‘£π‘œπ‘‘π‘‘π‘π‘šπ‘“π‘’ .(𝐺𝑑 π‘£π‘œπ‘‘π‘‘π‘π‘šπ‘“π‘’ ) 𝑑 π‘£π‘œπ‘‘π‘‘π‘π‘šπ‘“π‘’

  24. TRPO Algorithm For i =1,2,… Collect N trajectories for policy 𝜌 πœ„ Estimate advantage function 𝐡 Compute policy gradient 𝑕 Use CG to compute 𝐼 βˆ’1 𝑕 Compute rescaled step 𝑑 = 𝛽𝐼 βˆ’1 𝑕 with rescaling and line search Apply update: πœ„ = πœ„ π‘π‘šπ‘’ + 𝛽𝐼 βˆ’1 𝑕 max𝑗𝑛𝑗𝑨𝑓 𝑀 πœ„ subject to 𝐷. 𝐸 𝐿𝑀 πœ„ π‘π‘šπ‘’ , πœ„ ≀ πœ€ πœ„

  25. Questions?

Recommend


More recommend