refresh your knowledge 7
play

Refresh Your Knowledge 7 Select all that are true about policy - PowerPoint PPT Presentation

Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2020 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John Schulman and Pieter Abbeel Lecture 9: Policy


  1. Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2020 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John Schulman and Pieter Abbeel Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 1 / 59

  2. Refresh Your Knowledge 7 Select all that are true about policy gradients: ∇ θ V ( θ ) = E π θ [ ∇ θ log π θ ( s , a ) Q π θ ( s , a )] 1 θ is always increased in the direction of ∇ θ ln( π ( S t , A t , θ ). 2 State-action pairs with higher estimated Q values will increase in 3 probability on average Are guaranteed to converge to the global optima of the policy class 4 Not sure 5 Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 2 / 59

  3. Class Structure Last time: Policy Search This time: Policy Search Next time: Midterm Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 3 / 59

  4. Midterm Covers material for all lectures before midterm To prepare, encourage you to (1) take past midterms (2) review slides and the refresh and check your understandings (3) review the homeworks We will have office hours this weekend for midterm prep: see piazza post for details Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 4 / 59

  5. Recall: Policy-Based RL Policy search: directly parametrize the policy π θ ( s , a ) = P [ a | s ; θ ] Goal is to find a policy π with the highest value function V π (Pure) Policy based methods No Value Function Learned Policy Actor-Critic methods Learned Value Function Learned Policy Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 5 / 59

  6. Recall: Advantages of Policy-Based RL Advantages: Better convergence properties Effective in high-dimensional or continuous action spaces Can learn stochastic policies Disadvantages: Typically converge to a local rather than global optimum Evaluating a policy is typically inefficient and high variance Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 6 / 59

  7. Recall: Policy Gradient Defined V ( θ ) = V π θ ( s 0 ) = V ( s 0 , θ ) to make explicit the dependence of the value on the policy parameters Assumed episodic MDPs Policy gradient algorithms search for a local maximum of V ( θ ) by ascending the gradient of the policy, w.r.t parameters θ ∆ θ = α ∇ θ V ( θ ) Where ∇ θ V ( θ ) is the policy gradient   ∂ V ( θ ) ∂θ 1 .  .  ∇ θ V ( θ ) = .     ∂ V ( θ ) ∂θ n and α is a step-size hyperparameter Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 7 / 59

  8. Desired Properties of a Policy Gradient RL Algorithm Goal: Converge as quickly as possible to a local optima Incurring reward / cost as execute policy, so want to minimize number of iterations / time steps until reach a good policy Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 8 / 59

  9. Desired Properties of a Policy Gradient RL Algorithm Goal: Converge as quickly as possible to a local optima Incurring reward / cost as execute policy, so want to minimize number of iterations / time steps until reach a good policy During policy search alternating between evaluating policy and changing (improving) policy (just like in policy iteration) Would like each policy update to be a monotonic improvement Only guaranteed to reach a local optima with gradient descent Monotonic improvement will achieve this And in the real world, monotonic improvement is often beneficial Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 9 / 59

  10. Desired Properties of a Policy Gradient RL Algorithm Goal: Obtain large monotonic improvements to policy at each update Techniques to try to achieve this: Last time and today: Get a better estimate of the gradient (intuition: should improve updating policy parameters) Today: Change, how to update the policy parameters given the gradient Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 10 / 59

  11. Table of Contents Better Gradient Estimates 1 Policy Gradient Algorithms and Reducing Variance 2 Need for Automatic Step Size Tuning 3 Updating the Parameters Given the Gradient: Local Approximation 4 Updating the Parameters Given the Gradient: Trust Regions 5 Updating the Parameters Given the Gradient: TRPO Algorithm 6 Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 11 / 59

  12. Likelihood Ratio / Score Function Policy Gradient Recall last time ( m is a set of trajectories): T − 1 m ∇ θ log π θ ( a ( i ) t | s ( i ) � � R ( τ ( i ) ) ∇ θ V ( s 0 , θ ) ≈ (1 / m ) t ) i =1 t =0 Unbiased estimate of gradient but very noisy Fixes that can make it practical Temporal structure (discussed last time) Baseline Alternatives to using Monte Carlo returns R ( τ ( i ) ) as targets Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 12 / 59

  13. Policy Gradient: Introduce Baseline Reduce variance by introducing a baseline b ( s ) � T − 1 � T − 1 �� � � ∇ θ E τ [ R ] = E τ ∇ θ log π ( a t | s t ; θ ) r t ′ − b ( s t ) t =0 t ′ = t For any choice of b , gradient estimator is unbiased. Near optimal choice is the expected return, b ( s t ) ≈ E [ r t + r t +1 + · · · + r T − 1 ] Interpretation: increase logprob of action a t proportionally to how much returns � T − 1 t ′ = t r t ′ are better than expected Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 13 / 59

  14. Baseline b ( s ) Does Not Introduce Bias–Derivation E τ [ ∇ θ log π ( a t | s t ; θ ) b ( s t )] � � = E s 0: t , a 0:( t − 1) E s ( t +1): T , a t :( T − 1) [ ∇ θ log π ( a t | s t ; θ ) b ( s t )] Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 14 / 59

  15. Baseline b ( s ) Does Not Introduce Bias–Derivation E τ [ ∇ θ log π ( a t | s t ; θ ) b ( s t )] � � = E s 0: t , a 0:( t − 1) E s ( t +1): T , a t :( T − 1) [ ∇ θ log π ( a t | s t ; θ ) b ( s t )] (break up expectation) � � = E s 0: t , a 0:( t − 1) b ( s t ) E s ( t +1): T , a t :( T − 1) [ ∇ θ log π ( a t | s t ; θ )] (pull baseline term out) = E s 0: t , a 0:( t − 1) [ b ( s t ) E a t [ ∇ θ log π ( a t | s t ; θ )]] (remove irrelevant variables) � � π θ ( a t | s t ) ∇ θ π ( a t | s t ; θ ) � = E s 0: t , a 0:( t − 1) b ( s t ) (likelihood ratio) π θ ( a t | s t ) a � � � = E s 0: t , a 0:( t − 1) b ( s t ) ∇ θ π ( a t | s t ; θ ) a � � � = E s 0: t , a 0:( t − 1) b ( s t ) ∇ θ π ( a t | s t ; θ ) a = E s 0: t , a 0:( t − 1) [ b ( s t ) ∇ θ 1] = E s 0 : t , a 0 :( t − 1) [ b ( s t ) · 0] = 0 Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 15 / 59

  16. ”Vanilla” Policy Gradient Algorithm Initialize policy parameter θ , baseline b for iteration=1 , 2 , · · · do Collect a set of trajectories by executing the current policy At each timestep t in each trajectory τ i , compute t = � T − 1 Return G i t ′ = t r i t ′ , and Advantage estimate ˆ A i t = G i t − b ( s t ). t || 2 , Re-fit the baseline, by minimizing � � t || b ( s t ) − G i i Update the policy, using a policy gradient estimate ˆ g , Which is a sum of terms ∇ θ log π ( a t | s t , θ ) ˆ A t . (Plug ˆ g into SGD or ADAM) endfor Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 16 / 59

  17. Practical Implementation with Auto differentiation t ∇ θ log π ( a t | s t ; θ ) ˆ Usual formula � A t is inefficient–want to batch data Define ”surrogate” function using data from current batch � log π ( a t | s t ; θ ) ˆ L ( θ ) = A t t Then policy gradient estimator ˆ g = ∇ θ L ( θ ) Can also include value function fit error � G t || 2 � � log π ( a t | s t ; θ ) ˆ A t − || V ( s t ) − ˆ L ( θ ) = t Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 17 / 59

  18. Other Choices for Baseline? Initialize policy parameter θ , baseline b for iteration=1 , 2 , · · · do Collect a set of trajectories by executing the current policy At each timestep t in each trajectory τ i , compute t = � T − 1 Return G i t ′ = t r i t ′ , and Advantage estimate ˆ A i t = G i t − b ( s t ). t || 2 , Re-fit the baseline, by minimizing � � t || b ( s t ) − G i i Update the policy, using a policy gradient estimate ˆ g , Which is a sum of terms ∇ θ log π ( a t | s t , θ ) ˆ A t . (Plug ˆ g into SGD or ADAM) endfor Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 18 / 59

  19. Choosing the Baseline: Value Functions Recall Q-function / state-action-value function: Q π ( s , a ) = E π r 0 + γ r 1 + γ 2 r 2 · · · | s 0 = s , a 0 = a � � State-value function can serve as a great baseline V π ( s ) = E π r 0 + γ r 1 + γ 2 r 2 · · · | s 0 = s � � = E a ∼ π [ Q π ( s , a )] Advantage function: Combining Q with baseline V A π ( s , a ) = Q π ( s , a ) − V π ( s ) Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 19 / 59

Recommend


More recommend