dynamic programming
play

Dynamic Programming Prof. Kuan-Ting Lai 2020/4/10 Dynamic - PowerPoint PPT Presentation

Dynamic Programming Prof. Kuan-Ting Lai 2020/4/10 Dynamic Programming Dynamic Programming is for problems with two properties: 1. Optimal substructure Optimal solution can be decomposed into subproblems 2. Overlapping subproblems


  1. Dynamic Programming Prof. Kuan-Ting Lai 2020/4/10

  2. Dynamic Programming • Dynamic Programming is for problems with two properties: 1. Optimal substructure • Optimal solution can be decomposed into subproblems 2. Overlapping subproblems • Subproblems recur many times • Solutions can be cached and reused • Examples: − Shortest Path, Hanoi Tower ,……. − Markov Decision Process

  3. Sutton, Richard S.; Barto, Andrew G.. Reinforcement Learning (Adaptive Computation and Machine Learning series) (p. 189)

  4. Dynamic Programming for MDP • Bellman equation gives recursive decomposition • Value function stores and reuses solutions • Dynamic programming assumes full knowledge of the MDP • Used for Model-based Planning

  5. Policy Evaluation (Prediction) • Calculate the state-action function 𝑊 𝜌 for an arbitrary policy 𝜌 • Can be solved iteratively 𝑤 𝑙+1 𝑇 ← 𝐹 𝜌 𝑆 𝑢+1 + 𝛿𝑤 𝑙 𝑇 𝑢+1

  6. Policy Evaluation in Small Grid World • One terminal state (shown twice as shaded squares) • Actions leading out of the grid leave state unchanged • Reward is -1 until the terminal state is reached

  7. How to Improve a Policy 1. Evaluate the policy − 𝑤 𝜌 𝑡 = 𝐹[𝑆 𝑢+1 + 𝑆 𝑢+2 + ⋯ |𝑇 𝑢 = 𝑡] 2. Improve the policy by acting greedily with respect to v − 𝜌′ = 𝑕𝑠𝑓𝑓𝑒𝑧(𝑤 𝜌 ) • This process of policy iteration always converges to 𝜌′

  8. Policy Iteration • Policy evaluation Estimate 𝑤 𝜌 • Policy improvement Generate 𝜌′ ≥ 𝜌

  9. Jack’s Car Rental

  10. Policy Improvement (1)

  11. Policy Improvement (2)

  12. Modified Policy Iteration • Do we need to iteratively evaluate until convergence of 𝑤 𝜌 ? • Can we simply stop after k iteration? − Example: Small grid world achieves optimal policy after k=3 iterations • Update policy every iteration? => Value Iteration

  13. Value Iteration • Updating value function 𝑤 only, don’t calculate policy function 𝜌 • Policy is implicit built using 𝑤

  14. Shortest Path Example

  15. Policy Iteration vs. Value Iteration • Policy iteration • Value iteration

  16. Reference • David Silver, Lecture 3: Planning by Dynamic Programming (https://www.youtube.com/watch?v=Nd1-UUMVfz4&list=PLqYmG7hTraZDM- OYHWgPebj2MfCFzFObQ&index=3) • Chapter 4, Richard S. Sutton and Andrew G. Barto , “Reinforcement Learning: An Introduction,” 2 nd edition, Nov. 2018

Recommend


More recommend