policy gradient
play

Policy Gradient Prof. Kuan-Ting Lai 2020/5/22 Advantages of - PowerPoint PPT Presentation

Policy Gradient Prof. Kuan-Ting Lai 2020/5/22 Advantages of Policy-based RL Previously we focused on approximating value or action-value function: Policy Gradient methods focus on parameterize the policy: 3 Types of Reinforcement


  1. Policy Gradient Prof. Kuan-Ting Lai 2020/5/22

  2. Advantages of Policy-based RL • Previously we focused on approximating value or action-value function: • Policy Gradient methods focus on parameterize the policy:

  3. 3 Types of Reinforcement Learning • Value-based − Learn value function Model-based − Implicit policy • Policy-based − No value function − Learn Policy directly • Actor-critic Value- Policy- − Learn both value and policy Actor based based function -critic Policy DQN Gradient

  4. Lex Fridman, MIT Deep Learning, https://deeplearning.mit.edu/

  5. Policy Objective Function • Goal: given policy 𝜌 𝜄 (𝑡, 𝑏) with parameters θ , find best θ • How to measure the quality of a policy? 𝐾 𝜄 ← 𝑤 𝜌 𝑡 0 = 𝐹[σ 𝜌(𝑏|𝑡)𝑟 𝜌 (𝑡, 𝑏) ]

  6. Short Corridor with Switched Actions

  7. Policy Optimization • Policy-based RL is an optimization problem that can be solved by: − Hill climbing − Simplex / amoeba / Nelder Mead − Genetic algorithms − Gradient descent − Conjugate gradient − Quasi-newton

  8. Computing Gradients By Finite Differences • Estimate kth partial derivative of objective function w.r.t. Θ • By perturbing by small amount in k -th dimension where 𝑣 𝑙 is unit vector with 1 in k- th component, 0 elsewhere • Simple, noisy, inefficient but sometime work! • Works for all kinds of policy, even if policy is not differentiable

  9. Score Function • Assume 𝜌 𝜄 is differentiable whenever it is non-zero • Score function is ∇ 𝜄 log 𝜌 𝜄 (𝑡, 𝑏)

  10. Softmax Policy • Softmax function • Use linear approximation function

  11. Policy Gradient Theorem • Generalized policy gradient (proof @ Sutton’s book, pg.325)

  12. Proof of Policy Gradient Theorem (2-1)

  13. Proof of Policy Gradient Theorem (2-1)

  14. REINFOCE: Monte Carlo Policy Gradient REINFORCE Update

  15. Pseudo Code of REINFORCE

  16. REINFORCE on Short Corridor

  17. REINFORCE with Baseline • Include an arbitrary baseline function b(s) − Equation is valid because

  18. Gradient of REINFORCE with Baseline

  19. Baseline Can Help to Learn Faster

  20. Actor-Critic Methods • Baseline cannot bootstrap − Use learned state-value function as baseline -> Actor-Critic

  21. Policy Gradient for Continuing Problems • Continuing problem (No episode boundaries) − Use average reward per time step: TD( λ )

  22. Actor-Critic with Eligibility Traces

  23. Policy Parameterization for Continuous Action

  24. Reference 1. David Silver, Lecture 7: Policy Gradient 2. Chapter 13, Richard S. Sutton and Andrew G. Barto , “Reinforcement Learning: An Introduction,” 2 nd edition, Nov. 2018

Recommend


More recommend