the provable e ff ectiveness of policy gradient methods
play

The Provable E ff ectiveness of Policy Gradient Methods in - PowerPoint PPT Presentation

The Provable E ff ectiveness of Policy Gradient Methods in Reinforcement Learning Sham Kakade University of Washington & Microsoft Research (with Alekh Agarwal, Jason Lee, and Gaurav Mahajan) r Policy Optimization in RL [AlphaZero,


  1. The Provable E ff ectiveness of Policy Gradient Methods in Reinforcement Learning 
 Sham Kakade University of Washington & Microsoft Research (with Alekh Agarwal, Jason Lee, and Gaurav Mahajan) r

  2. Policy Optimization in RL [AlphaZero, Silver et.al, 17] [OpenAI Five, 18] [OpenAI,19]

  3. Markov Decision Processes: 
 a framework for RL

  4. Markov Decision Processes: 
 a framework for RL • A policy: 
 π : States → Actions

  5. Markov Decision Processes: 
 a framework for RL • A policy: 
 π : States → Actions • We execute to obtain a trajectory: 
 π s 0 , a 0 , r 0 , s 1 , a 1 , r 1 …

  6. 
 Markov Decision Processes: 
 a framework for RL • A policy: 
 π : States → Actions • We execute to obtain a trajectory: 
 π s 0 , a 0 , r 0 , s 1 , a 1 , r 1 … • Total -discounted reward: 
 γ V π ( s 0 ) = 피 π [ γ t r t ] ∞ ∑ t =0

  7. 
 Markov Decision Processes: 
 a framework for RL • A policy: 
 π : States → Actions • We execute to obtain a trajectory: 
 π s 0 , a 0 , r 0 , s 1 , a 1 , r 1 … • Total -discounted reward: 
 γ V π ( s 0 ) = 피 π [ γ t r t ] ∞ ∑ t =0 • Goal: Find a policy that maximizes our value V π ( s 0 ) π

  8. Challenges in RL

  9. Challenges in RL 1. Exploration 
 (the environment may be unknown)

  10. Challenges in RL 1. Exploration 
 (the environment may be unknown) 2. Credit assignment problem 
 (due to delayed rewards)

  11. Dexterous Robotic Hand Manipulation OpenAI, 2019 Challenges in RL 1. Exploration 
 (the environment may be unknown) 2. Credit assignment problem 
 (due to delayed rewards) 3. Large state/action spaces: 
 hand state: joint angles/velocities 
 cube state: configuration 
 actions: forces applied to actuators

  12. Part 0: Background RL, Deep RL, and Supervised Learning (SL)

  13. 
 The “Tabular” Dynamic Programming approach • “Tabular” dynamic programming approach: (with known model) 1. For every entry in the table, compute the state-action value: 
 ∞ Q π ( s , a ) = 피 π [ γ t r t | s 0 = s , a 0 = a ] ∑ t =0 π ( s ) ← argmax a Q π ( s , a ) 2. Update the policy to be greedy: π

  14. 
 The “Tabular” Dynamic Programming approach • “Tabular” dynamic programming approach: (with known model) 1. For every entry in the table, compute the state-action value: 
 ∞ Q π ( s , a ) = 피 π [ γ t r t | s 0 = s , a 0 = a ] ∑ t =0 π ( s ) ← argmax a Q π ( s , a ) 2. Update the policy to be greedy: π • Generalization: how can we deal with this infinite table? 
 Use sampling/supervised learning + deep learning.

  15. 
 
 The “Tabular” Dynamic Programming approach “ deep RL”? 
 [Bertsekas & Tsitsiklis ’97] provides first systematic analysis of RL with (worst case) “function approximation”. • “Tabular” dynamic programming approach: (with known model) 1. For every entry in the table, compute the state-action value: 
 ∞ Q π ( s , a ) = 피 π [ γ t r t | s 0 = s , a 0 = a ] ∑ t =0 π ( s ) ← argmax a Q π ( s , a ) 2. Update the policy to be greedy: π • Generalization: how can we deal with this infinite table? 
 Use sampling/supervised learning + deep learning.

  16. In practice, policy gradient methods rule…

  17. 
 
 In practice, policy gradient methods rule… • They are the most e ff ective method for 
 obtaining state of the art. 
 θ ← θ + η ∇ θ V π θ ( s 0 )

  18. 
 
 In practice, policy gradient methods rule… • They are the most e ff ective method for 
 obtaining state of the art. 
 θ ← θ + η ∇ θ V π θ ( s 0 ) • Why do we like them?

  19. 
 
 In practice, policy gradient methods rule… • They are the most e ff ective method for 
 obtaining state of the art. 
 θ ← θ + η ∇ θ V π θ ( s 0 ) • Why do we like them? • they easily deal with large state/action spaces 
 (through the neural net parameterization)

  20. 
 
 
 In practice, policy gradient methods rule… • They are the most e ff ective method for 
 obtaining state of the art. 
 θ ← θ + η ∇ θ V π θ ( s 0 ) • Why do we like them? • they easily deal with large state/action spaces 
 (through the neural net parameterization) • We can estimate the gradient using only simulation of our current policy π θ (the expectation is under the state actions visited under ) π θ

  21. 
 
 
 In practice, policy gradient methods rule… • They are the most e ff ective method for 
 obtaining state of the art. 
 θ ← θ + η ∇ θ V π θ ( s 0 ) • Why do we like them? • they easily deal with large state/action spaces 
 (through the neural net parameterization) • We can estimate the gradient using only simulation of our current policy π θ (the expectation is under the state actions visited under ) π θ • They directly optimize the cost function of interest!

  22. The Optimization Landscape Supervised Learning: Reinforcement Learning: • Gradient descent tends to ‘just work’ 
 • In many real RL problems, we have in practice (not sensitive to initialization) “very” flat regions. • Saddle points not a problem… • Gradients can be exponentially small in the “horizon” due to lack of exploration.

  23. 
 
 The Optimization Landscape s ! Thrun ’92 Supervised Learning: Lemma: [Higher order vanishing gradients] Reinforcement Learning: • Gradient descent tends to ‘just work’ 
 • In many real RL problems, we have Suppose there are S ≤ 1/(1 − γ ) states in the MDP . With random initialization, in practice (not sensitive to initialization) “very” flat regions. all -th higher-order gradients, for , the spectral norm of the k k < S /log( S ) • Saddle points not a problem… • Gradients can be exponentially small in 2 − S /2 gradients are bounded by . the “horizon” due to lack of exploration.

  24. 
 
 The Optimization Landscape s ! Thrun ’92 Supervised Learning: Lemma: [Higher order vanishing gradients] Reinforcement Learning: • Gradient descent tends to ‘just work’ 
 • In many real RL problems, we have Suppose there are S ≤ 1/(1 − γ ) states in the MDP . With random initialization, in practice (not sensitive to initialization) “very” flat regions. all -th higher-order gradients, for , the spectral norm of the k k < S /log( S ) • Saddle points not a problem… • Gradients can be exponentially small in 2 − S /2 gradients are bounded by . the “horizon” due to lack of exploration. This talk: Can we get any handle on policy gradient methods 
 because they are one of the most widely used practical tools?

  25. This talk We provide provable global convergence and generalization guarantees of (nonconvex) policy gradient methods. 


  26. This talk We provide provable global convergence and generalization guarantees of (nonconvex) policy gradient methods. 
 • Part – I: small state spaces + exact gradients 
 curvature + non-convexity • Vanilla PG • PG with regularization • Natural Policy Gradient 


  27. This talk We provide provable global convergence and generalization guarantees of (nonconvex) policy gradient methods. 
 • Part – I: small state spaces + exact gradients 
 curvature + non-convexity • Vanilla PG • PG with regularization • Natural Policy Gradient 
 • Part – II: large state spaces 
 generalization and distribution shift • Function approximation/deep nets? Why use PG?

  28. Part I: Small State Spaces (and the softmax policy class)

  29. Policy Optimization over the ”softmax” policy class (let’s start simple!) • Simplest way to parameterize the simplex, without constraints. 


  30. Policy Optimization over the ”softmax” policy class (let’s start simple!) • Simplest way to parameterize the simplex, without constraints. 
 • is the probability of action given state 
 π θ ( a | s ) a s exp( θ s , a ) π θ ( a | s ) = ∑ a ′ exp( θ s , a ′ )

  31. 
 Policy Optimization over the ”softmax” policy class (let’s start simple!) • Simplest way to parameterize the simplex, without constraints. 
 • is the probability of action given state 
 π θ ( a | s ) a s exp( θ s , a ) π θ ( a | s ) = ∑ a ′ exp( θ s , a ′ ) • Complete class: contains every stationary policy 


  32. 
 Policy Optimization over the ”softmax” policy class (let’s start simple!) • Simplest way to parameterize the simplex, without constraints. 
 • is the probability of action given state 
 π θ ( a | s ) a s exp( θ s , a ) assume π θ ( a | s ) = ∑ a ′ exp( θ s , a ′ ) son PV • Complete class: contains every stationary policy 
 exactly V π θ ( s 0 ) The policy optimization problem is non-convex. 
 max θ Do we have global convergence?

  33. 
 Global Convergence of PG for Softmax V θ ( μ ) = E s ∼ μ [ V θ ( s )] starting u w θ ← θ + η ∇ θ V θ ( μ ) state dist Theorem [Vanilla PG for Softmax Policy class] Suppose has full support over the state space. Then, for all states , 
 μ s V θ ( s ) → V ⋆ ( s )

Recommend


More recommend