processes mdp
play

Processes (MDP) Prof. Kuan-Ting Lai 2020/3/20 Markov Decision - PowerPoint PPT Presentation

Finite Markov Decision Processes (MDP) Prof. Kuan-Ting Lai 2020/3/20 Markov Decision Process (MDP) https://en.wikipedia.org/wiki/Markov_decision_process Markov Property Current state can represent all information from the past states


  1. Finite Markov Decision Processes (MDP) Prof. Kuan-Ting Lai 2020/3/20

  2. Markov Decision Process (MDP) https://en.wikipedia.org/wiki/Markov_decision_process

  3. Markov Property • Current state can represent all information from the past states • i.e. memoryless • Let bygones be bygones

  4. Markov Process • A Markov process is a memoryless random process, i.e. a sequence of random states S 1 , S 2 , … with Markov property • Transition probability P(s, s’) is the probability of moving from state s to state s’

  5. Student Markov Chain

  6. Student Markov Chain Episodes

  7. Example: Student Markov Chain Transition Matrix

  8. Adding Reward to Markov Process • A Markov reward process is a Markov chain with values.

  9. Student MRP

  10. Discounted Future Return G t • The discount 𝛿 ∈ [0,1] is the present value of future rewards − 𝛿 close to 0 leads to “short - sighed” evaluation − 𝛿 close to 1 leads to “far -sighed ” evaluation

  11. Why add discount factor 𝛿 ? • Uncertainty about the future • Avoids infinite returns in cyclic Markov processes • Animal/human behaviour shows preference for immediate reward

  12. Value Function • The value function v(s) estimates the long-term value of state s

  13. Student MRP Returns 1 • 𝛿 = 2

  14. State-Value Function for Student MRP (1)

  15. State-Value Function for Student MRP (2)

  16. State-Value Function for Student MRP (3)

  17. Bellman Equation for MRPs • The value function can be decomposed into two parts: − immediate reward R t+1 − discounted value of next state 𝛿 v(S t+1 )

  18. Backup Diagram for Bellman Equation

  19. Calculating Student MDP using Bellman Equation

  20. Markov Decision Process • A Markov decision process (MDP) is a Markov reward process with decisions.

  21. Student MDP with Actions

  22. Policy • MDP Policies only depend on the current state, i.e. stationary

  23. Policies

  24. Value Function

  25. State-Value Function for Student MDP

  26. Backup Diagram for 𝑤 𝜌 and 𝑟 𝜌

  27. Bellman Expectation Equation for Student MDP

  28. Optimal Value Function

  29. Optimal Value Function for Student MDP

  30. Optimal Action-Value Function for Student MDP

  31. Reference • Davlid Silver, Lecture 2: Markov Decision Processes, Reinforcement Learning (https://www.youtube.com/watch?v=lfHX2hHRMVQ&list=PLqYmG7hTraZDM- OYHWgPebj2MfCFzFObQ&index=2) • Chapter 3, Richard S. Sutton and Andrew G. Barto , “Reinforcement Learning: An Introduction,” 2 nd edition, Nov. 2018

Recommend


More recommend