natural policy gradients cont
play

Natural Policy Gradients (cont.) Katerina Fragkiadaki Revision - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Natural Policy Gradients (cont.) Katerina Fragkiadaki Revision Policy Gradients 1. Collect trajectories for policy 2. Estimate advantages A 3.


  1. Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Natural Policy Gradients (cont.) Katerina Fragkiadaki

  2. Revision

  3. ̂ Policy Gradients π θ 1. Collect trajectories for policy 2. Estimate advantages A 3. Compute policy gradient g 4. Update policy parameters θ new = θ + ϵ ⋅ ̂ g How to estimate this gradient 5. GOTO 1 μ θ ( s ) θ old σ θ ( s ) μ θ new ( s ) θ new σ θ new ( s )

  4. ̂ Policy Gradients π θ 1. Collect trajectories for policy 2. Estimate advantages A 3. Compute policy gradient g 4. Update policy parameters θ new = θ + ϵ ⋅ ̂ g How to estimate the stepsize 5. GOTO 1 μ θ ( s ) θ old σ θ ( s ) μ θ new ( s ) θ new σ θ new ( s )

  5. ̂ Policy Gradients • Step too big Bad policy->data collected under bad π θ 1. Collect trajectories for policy policy-> we cannot recover 2. Estimate advantages A (in Supervised Learning, data does not 3. Compute policy gradient g depend on neural network weights) 4. Update policy parameters θ new = θ + ϵ ⋅ ̂ g • Step too small 5. GOTO 1 Not efficient use of experience (in Supervised Learning, data can be trivially re-used) μ θ ( s ) θ old σ θ ( s ) μ θ new ( s ) θ new σ θ new ( s )

  6. ̂ ̂ \ What is the underlying optimization problem? . U ( θ ) = 𝔽 τ ∼ P ( τ ; θ ) [ R ( τ )] = ∑ We started here: max P ( τ ; θ ) R ( τ ) θ τ N T g ≈ 1 ∑ ∑ ∇ θ log π θ ( α ( i ) t | s ( i ) t ) A ( s ( i ) t , a ( i ) Policy gradients: τ i ∼ π θ t ), N i =1 t =1 g = 𝔽 t [ ∇ θ log π θ ( α t | s t ) A ( s t , a t ) ] This result from differentiating the following objective function: U PG ( θ ) = 𝔽 t [ log π θ ( α t | s t ) A ( s t , a t ) ] This is not the right objective: we can’t optimize too far . U PG ( θ ) max (as the advantage values become invalid), and this θ constraint shows up nowhere in the optimization: Compare this to supervised learning using expert actions and a maximum a ∼ π * ˜ likelihood objective: N T U SL ( θ ) = 1 ∑ ∑ α ( i ) t | s ( i ) log π θ ( ˜ t ), τ i ∼ π * (+regularization) N i =1 t =1

  7. ̂ Hard to choose stepsizes π θ 1. Collect trajectories for policy 2. Estimate advantages A 3. Compute policy gradient g 4. Update policy parameters θ new = θ + ϵ ⋅ ̂ g 5. GOTO 1 Consider a family of policies with parametrization: ⇢ σ ( θ ) a = 1 π θ ( a ) = 1 − σ ( θ ) a = 2 ⇢ − The same parameter step changes the policy distribution more or less Δ θ = − 2 dramatically depending on where in the parameter space we are.

  8. Notation We will use the following to denote values of parameters and corresponding policies before and after an update: θ old → θ new π old → π new θ → θ ′ � π → π ′ �

  9. Gradient Descent in Distribution Space The stepwise in gradient descent results from solving the following optimization problem, e.g., using line search: d * = arg max ∥ d ∥≤ ϵ U ( θ + d ) SGD: θ new = θ old + d * Euclidean distance in parameter space It is hard to predict the result on the parameterized distribution.. hard to pick the threshold epsilon Natural gradient descent: the stepwise in parameter space is determined by considering the KL divergence in the distributions before and after the update: d * = arg d , s . t . KL( π θ ∥ π θ + d ) ≤ ϵ U ( θ + d ) max KL divergence in distribution space Easier to pick the distance threshold (and we made the constraint explicit of ``don’t optimize too much”)

  10. Solving the KL Constrained Problem U ( θ ) = 𝔽 t [ log π θ ( α t | s t ) A ( s t , a t ) ] Unconstrained penalized objective: U ( θ + d ) − λ (D KL [ π θ ∥ π θ + d ] − ϵ ) d * = arg max d Let’s solve it: first order Taylor expansion for the loss and second order for the KL: U ( θ old ) + ∇ θ U ( θ ) | θ = θ old ⋅ d − 1 θ D KL [ π θ old ∥ π θ ] | θ = θ old d ) + λϵ 2 λ ( d ⊤ ∇ 2 d * ≈ arg max d Q: How will you compute this?

  11. KL Taylor expansion D KL ( p θ old | p θ ) ≈ D KL ( p θ old | p θ old ) + d ⊤ ∇ θ D KL ( p θ old | p θ ) | θ = θ old + 1 2 d ⊤ ∇ 2 θ D KL ( p θ old | p θ ) | θ = θ old d

  12. KL Taylor expansion D KL ( p θ old | p θ ) ≈ 1 2 d ⊤ ∇ 2 θ D KL ( p θ old | p θ ) | θ = θ old d = 1 2 d ⊤ F ( θ old ) d = 1 2( θ − θ old ) ⊤ F ( θ old )( θ − θ old ) Fisher Information matrix: F ( θ ) = 𝔽 θ [ ∇ θ log p θ ( x ) ∇ θ log p θ ( x ) ⊤ ] F ( θ old ) = ∇ 2 θ D KL ( p θ old | p θ ) | θ = θ old Since KL divergence is roughly analogous to a distance measure between distributions, Fisher information serves as a local distance metric between distributions : how much you change the distribution if you move the parameters a little bit in a given direction.

  13. Solving the KL Constrained Problem Unconstrained penalized objective: U ( θ + d ) − λ (D KL [ π θ ∥ π θ + d ] − ϵ ) d * = arg max d First order Taylor expansion for the loss and second order for the KL: U ( θ old ) + ∇ θ U ( θ ) | θ = θ old ⋅ d − 1 θ D KL [ π θ old ∥ π θ ] | θ = θ old d ) + λϵ 2 λ ( d ⊤ ∇ 2 ≈ arg max d Substitute for the information matrix: ∇ θ U ( θ ) | θ = θ old ⋅ d − 1 2 λ ( d ⊤ F ( θ old ) d ) = arg max d d − ∇ θ U ( θ ) | θ = θ old ⋅ d + 1 2 λ ( d ⊤ F ( θ old ) d ) = arg min

  14. Natural Gradient Descent ∂ d ( −∇ θ U ( θ ) | θ = θ old ⋅ d + 1 2 λ ( d ⊤ F ( θ old ) d ) ) 0 = ∂ Setting the gradient to zero: = −∇ θ U ( θ ) | θ = θ old + 1 2 λ ( F ( θ old )) d d = 2 λ F − 1 ( θ old ) ∇ θ U ( θ ) | θ = θ old g N = F − 1 ( θ old ) ∇ θ U ( θ ) The natural gradient: θ new = θ old + α ⋅ g N Let’s solve for the stepzise along the natural gradient direction: D KL ( π θ old | π θ ) ≈ 1 2 ( θ − θ old ) ⊤ F ( θ old )( θ − θ old ) 1 2 ( α g N ) ⊤ F ( α g N ) = ϵ 2 ϵ α = ( g ⊤ F g )

  15. Stepsize along the Natural Gradient direction g N = F − 1 ( θ old ) ∇ θ U ( θ ) The natural gradient: θ new = θ old + α ⋅ g N Let’s solve for the stepzise along the natural gradient direction! D KL ( π θ old | π θ ) ≈ 1 2 ( θ − θ old ) ⊤ F ( θ old )( θ − θ old ) = 1 2( α g N ) ⊤ F ( α g N ) I want the KL between old and new policies to be \epsilon: 1 2 ( α g N ) ⊤ F ( α g N ) = ϵ 2 ϵ α = ( g ⊤ N F g N )

  16. Natural Gradient Descent ϵ Both use samples from the current policy π k = π ( θ k )

  17. Natural Gradient Descent ϵ very expensive to compute for a large number of parameters!

  18. ̂ ̂ \ What is the underlying optimization problem? . U ( θ ) = 𝔽 τ ∼ P ( τ ; θ ) [ R ( τ )] = ∑ max P ( τ ; θ ) R ( τ ) We started here: θ τ N T g ≈ 1 ∑ ∑ ∇ θ log π θ ( α ( i ) t | s ( i ) t ) A ( s ( i ) t , a ( i ) τ i ∼ π θ t ), Policy gradients: N i =1 t =1 g = 𝔽 t [ ∇ θ log π θ ( α t | s t ) A ( s t , a t ) ] This result from differentiating the following objective function: U PG ( θ ) = 𝔽 t [ log π θ ( α t | s t ) A ( s t , a t ) ] ``don’t optimize too much” constraint: . 𝔽 t [ log π θ + d ( α t | s t ) A ( s t , a t ) ] − λ D KL [ π θ ∥ π θ + d ] max d We used the 1st order approximation for the 1st term, but what if d is large??

  19. Alternative derivation U ( θ ) = 𝔽 τ ∼ π θ ( τ ) [ R ( τ ) ] = ∑ π θ ( τ ) R ( τ ) τ = ∑ π θ old ( τ ) π θ ( τ ) π θ old ( τ ) R ( τ ) τ . 𝔽 t [ π θ ( τ ) π θ old ( a t | s t ) A ( s t , a t ) ] − λ D KL [ π θ old ∥ π θ ] = 𝔽 τ ∼ π θ old π θ old ( τ ) R ( τ ) π θ ( a t | s t ) max θ ∇ θ π θ ( τ ) ∇ θ U ( θ ) = 𝔽 τ ∼ π θ old π θ old ( τ ) R ( τ ) ∇ θ U ( θ ) | θ = θ old = 𝔽 τ ∼ π θ old ∇ θ log π θ ( τ ) | θ = θ old R ( τ ) <-Gradient evaluated at theta_old is unchanged

  20. Trust region Policy Optimization Constrained objective: . 𝔽 t [ π θ old ( a t | s t ) A ( s t , a t ) ] π θ ( a t | s t ) max θ subject to 𝔽 t [ D KL [ π θ old ( ⋅ | s t ) ∥ π θ ( ⋅ | s t ) ] ] ≤ δ Or unonstrained objective: . 𝔽 t [ π θ old ( a t | s t ) A ( s t , a t ) ] − β 𝔽 t [ D KL [ π θ old ( ⋅ | s t ) ∥ π θ ( ⋅ | s t ) ] ] π θ ( a t | s t ) max θ I J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. ICML I

  21. Proximal Policy Optimization Can I achieve similar performance without second order information (no Fisher matrix!) r t ( θ ) = π θ ( a t | s t ) π θ old ( a t | s t ) . L CLIP = 𝔽 t [ min ( r t ( θ ) A ( s t , a t ), clip ( r t ( θ ),1 − ϵ ,1 + ϵ ) A ( s t , a t ) ) ] max θ I J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal Policy Optimization Algorithms”. (2017) I

  22. PPO: Clipped Objective Empirical Performance of PPO Figure: Performance comparison between PPO with clipped objective and various other deep RL methods on a slate of MuJoCo tasks. 10

  23. Training linear policies to solve control tasks with natural policy gradients https://youtu.be/frojcskMkkY

  24. State s: joint positions, joint velocities, contact info

  25. observations: joint positions, joint velocities, contact info

  26. Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Multigoal RL Katerina Fragkiadaki

Recommend


More recommend