Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Natural Policy Gradients (cont.) Katerina Fragkiadaki
Revision
̂ Policy Gradients π θ 1. Collect trajectories for policy 2. Estimate advantages A 3. Compute policy gradient g 4. Update policy parameters θ new = θ + ϵ ⋅ ̂ g How to estimate this gradient 5. GOTO 1 μ θ ( s ) θ old σ θ ( s ) μ θ new ( s ) θ new σ θ new ( s )
̂ Policy Gradients π θ 1. Collect trajectories for policy 2. Estimate advantages A 3. Compute policy gradient g 4. Update policy parameters θ new = θ + ϵ ⋅ ̂ g How to estimate the stepsize 5. GOTO 1 μ θ ( s ) θ old σ θ ( s ) μ θ new ( s ) θ new σ θ new ( s )
̂ Policy Gradients • Step too big Bad policy->data collected under bad π θ 1. Collect trajectories for policy policy-> we cannot recover 2. Estimate advantages A (in Supervised Learning, data does not 3. Compute policy gradient g depend on neural network weights) 4. Update policy parameters θ new = θ + ϵ ⋅ ̂ g • Step too small 5. GOTO 1 Not efficient use of experience (in Supervised Learning, data can be trivially re-used) μ θ ( s ) θ old σ θ ( s ) μ θ new ( s ) θ new σ θ new ( s )
̂ ̂ \ What is the underlying optimization problem? . U ( θ ) = 𝔽 τ ∼ P ( τ ; θ ) [ R ( τ )] = ∑ We started here: max P ( τ ; θ ) R ( τ ) θ τ N T g ≈ 1 ∑ ∑ ∇ θ log π θ ( α ( i ) t | s ( i ) t ) A ( s ( i ) t , a ( i ) Policy gradients: τ i ∼ π θ t ), N i =1 t =1 g = 𝔽 t [ ∇ θ log π θ ( α t | s t ) A ( s t , a t ) ] This result from differentiating the following objective function: U PG ( θ ) = 𝔽 t [ log π θ ( α t | s t ) A ( s t , a t ) ] This is not the right objective: we can’t optimize too far . U PG ( θ ) max (as the advantage values become invalid), and this θ constraint shows up nowhere in the optimization: Compare this to supervised learning using expert actions and a maximum a ∼ π * ˜ likelihood objective: N T U SL ( θ ) = 1 ∑ ∑ α ( i ) t | s ( i ) log π θ ( ˜ t ), τ i ∼ π * (+regularization) N i =1 t =1
̂ Hard to choose stepsizes π θ 1. Collect trajectories for policy 2. Estimate advantages A 3. Compute policy gradient g 4. Update policy parameters θ new = θ + ϵ ⋅ ̂ g 5. GOTO 1 Consider a family of policies with parametrization: ⇢ σ ( θ ) a = 1 π θ ( a ) = 1 − σ ( θ ) a = 2 ⇢ − The same parameter step changes the policy distribution more or less Δ θ = − 2 dramatically depending on where in the parameter space we are.
Notation We will use the following to denote values of parameters and corresponding policies before and after an update: θ old → θ new π old → π new θ → θ ′ � π → π ′ �
Gradient Descent in Distribution Space The stepwise in gradient descent results from solving the following optimization problem, e.g., using line search: d * = arg max ∥ d ∥≤ ϵ U ( θ + d ) SGD: θ new = θ old + d * Euclidean distance in parameter space It is hard to predict the result on the parameterized distribution.. hard to pick the threshold epsilon Natural gradient descent: the stepwise in parameter space is determined by considering the KL divergence in the distributions before and after the update: d * = arg d , s . t . KL( π θ ∥ π θ + d ) ≤ ϵ U ( θ + d ) max KL divergence in distribution space Easier to pick the distance threshold (and we made the constraint explicit of ``don’t optimize too much”)
Solving the KL Constrained Problem U ( θ ) = 𝔽 t [ log π θ ( α t | s t ) A ( s t , a t ) ] Unconstrained penalized objective: U ( θ + d ) − λ (D KL [ π θ ∥ π θ + d ] − ϵ ) d * = arg max d Let’s solve it: first order Taylor expansion for the loss and second order for the KL: U ( θ old ) + ∇ θ U ( θ ) | θ = θ old ⋅ d − 1 θ D KL [ π θ old ∥ π θ ] | θ = θ old d ) + λϵ 2 λ ( d ⊤ ∇ 2 d * ≈ arg max d Q: How will you compute this?
KL Taylor expansion D KL ( p θ old | p θ ) ≈ D KL ( p θ old | p θ old ) + d ⊤ ∇ θ D KL ( p θ old | p θ ) | θ = θ old + 1 2 d ⊤ ∇ 2 θ D KL ( p θ old | p θ ) | θ = θ old d
KL Taylor expansion D KL ( p θ old | p θ ) ≈ 1 2 d ⊤ ∇ 2 θ D KL ( p θ old | p θ ) | θ = θ old d = 1 2 d ⊤ F ( θ old ) d = 1 2( θ − θ old ) ⊤ F ( θ old )( θ − θ old ) Fisher Information matrix: F ( θ ) = 𝔽 θ [ ∇ θ log p θ ( x ) ∇ θ log p θ ( x ) ⊤ ] F ( θ old ) = ∇ 2 θ D KL ( p θ old | p θ ) | θ = θ old Since KL divergence is roughly analogous to a distance measure between distributions, Fisher information serves as a local distance metric between distributions : how much you change the distribution if you move the parameters a little bit in a given direction.
Solving the KL Constrained Problem Unconstrained penalized objective: U ( θ + d ) − λ (D KL [ π θ ∥ π θ + d ] − ϵ ) d * = arg max d First order Taylor expansion for the loss and second order for the KL: U ( θ old ) + ∇ θ U ( θ ) | θ = θ old ⋅ d − 1 θ D KL [ π θ old ∥ π θ ] | θ = θ old d ) + λϵ 2 λ ( d ⊤ ∇ 2 ≈ arg max d Substitute for the information matrix: ∇ θ U ( θ ) | θ = θ old ⋅ d − 1 2 λ ( d ⊤ F ( θ old ) d ) = arg max d d − ∇ θ U ( θ ) | θ = θ old ⋅ d + 1 2 λ ( d ⊤ F ( θ old ) d ) = arg min
Natural Gradient Descent ∂ d ( −∇ θ U ( θ ) | θ = θ old ⋅ d + 1 2 λ ( d ⊤ F ( θ old ) d ) ) 0 = ∂ Setting the gradient to zero: = −∇ θ U ( θ ) | θ = θ old + 1 2 λ ( F ( θ old )) d d = 2 λ F − 1 ( θ old ) ∇ θ U ( θ ) | θ = θ old g N = F − 1 ( θ old ) ∇ θ U ( θ ) The natural gradient: θ new = θ old + α ⋅ g N Let’s solve for the stepzise along the natural gradient direction: D KL ( π θ old | π θ ) ≈ 1 2 ( θ − θ old ) ⊤ F ( θ old )( θ − θ old ) 1 2 ( α g N ) ⊤ F ( α g N ) = ϵ 2 ϵ α = ( g ⊤ F g )
Stepsize along the Natural Gradient direction g N = F − 1 ( θ old ) ∇ θ U ( θ ) The natural gradient: θ new = θ old + α ⋅ g N Let’s solve for the stepzise along the natural gradient direction! D KL ( π θ old | π θ ) ≈ 1 2 ( θ − θ old ) ⊤ F ( θ old )( θ − θ old ) = 1 2( α g N ) ⊤ F ( α g N ) I want the KL between old and new policies to be \epsilon: 1 2 ( α g N ) ⊤ F ( α g N ) = ϵ 2 ϵ α = ( g ⊤ N F g N )
Natural Gradient Descent ϵ Both use samples from the current policy π k = π ( θ k )
Natural Gradient Descent ϵ very expensive to compute for a large number of parameters!
̂ ̂ \ What is the underlying optimization problem? . U ( θ ) = 𝔽 τ ∼ P ( τ ; θ ) [ R ( τ )] = ∑ max P ( τ ; θ ) R ( τ ) We started here: θ τ N T g ≈ 1 ∑ ∑ ∇ θ log π θ ( α ( i ) t | s ( i ) t ) A ( s ( i ) t , a ( i ) τ i ∼ π θ t ), Policy gradients: N i =1 t =1 g = 𝔽 t [ ∇ θ log π θ ( α t | s t ) A ( s t , a t ) ] This result from differentiating the following objective function: U PG ( θ ) = 𝔽 t [ log π θ ( α t | s t ) A ( s t , a t ) ] ``don’t optimize too much” constraint: . 𝔽 t [ log π θ + d ( α t | s t ) A ( s t , a t ) ] − λ D KL [ π θ ∥ π θ + d ] max d We used the 1st order approximation for the 1st term, but what if d is large??
Alternative derivation U ( θ ) = 𝔽 τ ∼ π θ ( τ ) [ R ( τ ) ] = ∑ π θ ( τ ) R ( τ ) τ = ∑ π θ old ( τ ) π θ ( τ ) π θ old ( τ ) R ( τ ) τ . 𝔽 t [ π θ ( τ ) π θ old ( a t | s t ) A ( s t , a t ) ] − λ D KL [ π θ old ∥ π θ ] = 𝔽 τ ∼ π θ old π θ old ( τ ) R ( τ ) π θ ( a t | s t ) max θ ∇ θ π θ ( τ ) ∇ θ U ( θ ) = 𝔽 τ ∼ π θ old π θ old ( τ ) R ( τ ) ∇ θ U ( θ ) | θ = θ old = 𝔽 τ ∼ π θ old ∇ θ log π θ ( τ ) | θ = θ old R ( τ ) <-Gradient evaluated at theta_old is unchanged
Trust region Policy Optimization Constrained objective: . 𝔽 t [ π θ old ( a t | s t ) A ( s t , a t ) ] π θ ( a t | s t ) max θ subject to 𝔽 t [ D KL [ π θ old ( ⋅ | s t ) ∥ π θ ( ⋅ | s t ) ] ] ≤ δ Or unonstrained objective: . 𝔽 t [ π θ old ( a t | s t ) A ( s t , a t ) ] − β 𝔽 t [ D KL [ π θ old ( ⋅ | s t ) ∥ π θ ( ⋅ | s t ) ] ] π θ ( a t | s t ) max θ I J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. ICML I
Proximal Policy Optimization Can I achieve similar performance without second order information (no Fisher matrix!) r t ( θ ) = π θ ( a t | s t ) π θ old ( a t | s t ) . L CLIP = 𝔽 t [ min ( r t ( θ ) A ( s t , a t ), clip ( r t ( θ ),1 − ϵ ,1 + ϵ ) A ( s t , a t ) ) ] max θ I J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal Policy Optimization Algorithms”. (2017) I
PPO: Clipped Objective Empirical Performance of PPO Figure: Performance comparison between PPO with clipped objective and various other deep RL methods on a slate of MuJoCo tasks. 10
Training linear policies to solve control tasks with natural policy gradients https://youtu.be/frojcskMkkY
State s: joint positions, joint velocities, contact info
observations: joint positions, joint velocities, contact info
Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Multigoal RL Katerina Fragkiadaki
Recommend
More recommend