Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML 2015 Presenter: Shivam Kalra Shivam.kalra@uwaterloo.ca CS 885 (Reinforcement Learning) Prof. Pascal Poupart June 20 th 2018
Reinforcement Learning Action Value Function Policy Gradients Q-Learning Actor Critic TRPO PPO A3C ACKTR Ref: https://www.youtube.com/watch?v=CKaN5PgkSBc
Policy Gradient For i =1,2,β¦ Collect N trajectories for policy π π Estimate advantage function π΅ Compute policy gradient π Update policy parameter π = π πππ + π½π
Problems of Policy Gradient For i=1,2 ,β¦ Collect N trajectories for policy π π Non stationary input data Estimate advantage function π΅ due to changing policy and Compute policy gradient π reward distributions change Update policy parameter π = π πππ + π½π
Problems of Policy Gradient For i =1,2,β¦ Collect N trajectories for policy π π Advantage is very random Estimate advantage function π΅ cv initially Compute policy gradient π Update policy parameter π = π πππ + π½π Youβre bad Advantage Policy
Problems of Policy Gradient For i =1,2,β¦ Collect N trajectories for policy π π Estimate advantage function π΅ Compute policy gradient π We need more carefully Update policy parameter π = π πππ + π½π crafted policy update We want improvement and not degradation Idea: We can update old policy π πππ to a new policy ΰ·€ π such that they are βtrustedβ distance apart. Such conservative policy update allows improvement instead of degradation.
RL to Optimization β’ Most of ML is optimization β’ Supervised learning is reducing training loss β’ RL: what is policy gradient optimizing? β’ Favoring (π‘, π) that gave more advantage π΅ . β’ Can we write down optimization problem that allows to do small update on a policy π based on data sampled from π (on-policy data) Ref: https://www.youtube.com/watch?v=xvRrgxcpaHY (6:40)
What loss to optimize? β’ Optimize π(π) i.e., expected return of a policy π β πΏ π’ π π π = π½ π‘ 0 ~π 0 ,π π’ ~π . π‘ π’ ΰ· π’ π’=0 β’ We collect data with π πππ and optimize the objective to get a new policy ΰ·€ π .
What loss to optimize? β’ We can express π ΰ·€ π in terms of the advantage over the original policy 1 . β πΏ π’ π΅ π πππ (π‘ π’ , π π’ ) π ΰ·€ π = π π πππ + π½ π~ΰ·₯ π ΰ· π’=0 Expected return of old policy Expected return of Sample from new new policy policy [1] Kakade, Sham, and John Langford. "Approximately optimal approximate reinforcement learning." ICML. Vol. 2. 2002.
What loss to optimize? β’ Previous equation can be rewritten as 1 : π ΰ·€ π = π π πππ + ΰ· π ΰ·₯ π (π‘) ΰ· π π π‘ π΅ π (π‘, π) ΰ·€ π‘ π Expected return of old policy Expected return of Discounted visitation frequency π π π‘ = π π‘ 0 = π‘ + πΏπ π‘ 1 = π‘ + πΏ 2 π + β― new policy [1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.
What loss to optimize? Old Expected Return π ΰ·€ π = π π πππ + ΰ· π ΰ·₯ π (π‘) ΰ· π π π‘ π΅ π (π‘, π) ΰ·€ β₯ π π‘ π New Expected Return
What loss to optimize? π ΰ·€ π = π π πππ + ΰ· π ΰ·₯ π (π‘) ΰ· π π π‘ π΅ π (π‘, π) ΰ·€ β₯ π π‘ π > New Expected Return Old Expected Return Guaranteed Improvement from π πππ β ΰ·€ π
New State Visitation is Difficult State visitation based on new policy π ΰ·€ π = π π πππ + ΰ· π ΰ·₯ π (π‘) ΰ· π π π‘ π΅ π (π‘, π) ΰ·€ π‘ π New policy βComplex dependency of π ΰ·₯ π (π‘) on π makes the equation difficult to ΰ·€ optimize directly. β [1] [1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.
New State Visitation is Difficult π ΰ·€ π = π π πππ + ΰ· π ΰ·₯ π (π‘) ΰ· π π π‘ π΅ π (π‘, π) ΰ·€ π‘ π π ΰ·€ π = π π πππ + ΰ· π π (π‘) ΰ· π π π‘ π΅ π (π‘, π) ΰ·€ π‘ π Local approximation of π½(ΰ·₯ π) [1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.
Local approximation of π(ΰ·€ π) π ΰ·€ π = π π πππ + ΰ· π π (π‘) ΰ· π π π‘ π΅ π πππ (π‘, π) ΰ·€ π‘ π The approximation is accurate within step size π (trust region) Trust region Monotonic improvement π guaranteed π β² π π β² π‘ π does not change dramatically. [1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.
Local approximation of π(ΰ·€ π) β’ The following bound holds: πππ¦ (π, ΰ·€ π ΰ·€ π β₯ π ΰ·€ π β π·πΈ πΏπ π) 4ππΏ Where, π· = 1βπΏ 2 β’ Monotonically improving policies can be generated by: πππ¦ π, ΰ·€ π = arg max [π ΰ·€ π β π·πΈ πΏπ π ] π 4ππΏ Where, π· = 1βπΏ 2
Minorization Maximization (MM) algorithm πππ¦ (π, ΰ·€ Surrogate function π π β π·πΈ πΏπ π) Actual function π(π)
Optimization of Parameterized Policies β’ Now policies are parameterized π π π π‘ with parameters π β’ Accordingly surrogate function changes to πππ¦ π πππ , π ] arg max [π π β π·πΈ πΏπ π
Optimization of Parameterized Policies πππ¦ π πππ , π ] arg max [π π β π·πΈ πΏπ π In practice π· results in very small step sizes One way to take larger step size is to constraint KL divergence between the new policy and the old policy, i.e., a trust region constraint: ππππππππ π΄ πΎ (πΎ) πΎ πππ πΎ πππ , πΎ β€ πΊ subject to, π¬ π³π΄
Solving KL-Penalized Problem πππ¦ (π πππ , π) β’ maxππππ¨π π π π β π·. πΈ πΏπ β’ Use mean KL divergence instead of max. β’ i.e., maxππππ¨π π π β π·. πΈ πΏπ (π πππ , π) π β’ Make linear approximation to π and quadratic to KL term: π . π β π πππ β π 2 π β π πππ π πΊ(π β π πππ ) maxππππ¨π π π 2 π ππ π π Θ π=π πππ , π 2 π πΈ πΏπ π πππ , π Θ π=π πππ where, π = πΊ =
Solving KL-Penalized Problem β’ Make linear approximation to π and quadratic to KL term: π β π πππ β π 2 π β π πππ π πΊ(π β π πππ ) maxππππ¨π π . π πΊ = π 2 π ππ π π Θ π=π πππ , π 2 π πΈ πΏπ π πππ , π Θ π=π πππ where, π = 1 π πΊ β1 π . Donβt want to form full Hessian matrix β’ Solution: π β π πππ = π 2 π 2 π πΈ πΏπ π πππ , π Θ π=π πππ . πΊ = β’ Can compute πΊ β1 π approximately using conjugate gradient algorithm without forming πΊ explicitly.
Conjugate Gradient (CG) β’ Conjugate gradient algorithm approximately solves for π¦ = π΅ β1 π without explicitly forming matrix π΅ 1 2 π¦ π π΅π¦ β ππ¦ β’ After π iterations, CG has minimized
TRPO: KL-Constrained β’ Unconstrained problem: maxππππ¨π π π β π·. πΈ πΏπ (π πππ , π) π β’ Constrained problem: maxππππ¨π π π subject to π·. πΈ πΏπ π πππ , π β€ π π β’ π is a hyper-parameter, remains fixed over whole learning process β’ Solve constrained quadratic problem: compute πΊ β1 π and then rescale step to get correct KL π β π πππ subject to 1 2 π β π πππ π πΊ π β π πππ β€ π β’ maxππππ¨π π . π π β π πππ β π 2 [ π β π πππ π πΊ π β π πππ β π] β’ Lagrangian: β π, π = π . β’ Differentiate wrt π and get π β π πππ = 1 π πΊ β1 π β’ We want 1 2 π‘ π πΊπ‘ = π 2π β’ Given candidate step π‘ π£ππ‘πππππ rescale to π‘ = π‘ π£ππ‘πππππ .(πΊπ‘ π£ππ‘πππππ ) π‘ π£ππ‘πππππ
TRPO Algorithm For i =1,2,β¦ Collect N trajectories for policy π π Estimate advantage function π΅ Compute policy gradient π Use CG to compute πΌ β1 π Compute rescaled step π‘ = π½πΌ β1 π with rescaling and line search Apply update: π = π πππ + π½πΌ β1 π maxππππ¨π π π subject to π·. πΈ πΏπ π πππ , π β€ π π
Questions?
Recommend
More recommend