markov decision processes
play

Markov Decision Processes CS60077: Reinforcement Learning Abir Das - PowerPoint PPT Presentation

Markov Decision Processes CS60077: Reinforcement Learning Abir Das IIT Kharagpur July 26, Aug 01, 02, 08, 2019 Agenda Terminology Markov Decision Process Agenda Understand definitions and notation to be used in the course. Understand


  1. Markov Decision Processes CS60077: Reinforcement Learning Abir Das IIT Kharagpur July 26, Aug 01, 02, 08, 2019

  2. Agenda Terminology Markov Decision Process Agenda § Understand definitions and notation to be used in the course. § Understand definition and setup of sequential decision problems. Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 2 / 43

  3. Agenda Terminology Markov Decision Process Resources § Reinforcement Learning by David Silver [Link] § Deep Reinforcement Learning by Sergey Levine [Link] § SB: Chapter 3 Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 3 / 43

  4. Agenda Terminology Markov Decision Process Terminology and Notation Figure credit: S. Levine - CS 294-112 Course, UC Berkeley Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 4 / 43

  5. Agenda Terminology Markov Decision Process Terminology and Notation Figure credit: S. Levine - CS 294-112 Course, UC Berkeley Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 4 / 43

  6. Agenda Terminology Markov Decision Process Terminology and Notation Figure credit: S. Levine - CS 294-112 Course, UC Berkeley Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 4 / 43

  7. Agenda Terminology Markov Decision Process Terminology and Notation 1. run away 2. ignore 3. pet Figure credit: S. Levine - CS 294-112 Course, UC Berkeley Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 4 / 43

  8. Agenda Terminology Markov Decision Process Terminology and Notation Figure credit: S. Levine - CS 294-112 Course, UC Berkeley Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 4 / 43

  9. Agenda Terminology Markov Decision Process Terminology and Notation Figure credit: S. Levine - CS 294-112 Course, UC Berkeley Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 4 / 43

  10. Agenda Terminology Markov Decision Process Terminology and Notation Figure credit: S. Levine - CS 294-112 Course, UC Berkeley Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 4 / 43

  11. Agenda Terminology Markov Decision Process Terminology and Notation Figure credit: S. Levine - CS 294-112 Course, UC Berkeley Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 4 / 43

  12. Agenda Terminology Markov Decision Process Terminology and Notation Figure credit: S. Levine - CS 294-112 Course, UC Berkeley Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 4 / 43

  13. Agenda Terminology Markov Decision Process Terminology and Notation Figure credit: S. Levine - CS 294-112 Course, UC Berkeley Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 4 / 43

  14. Agenda Terminology Markov Decision Process Markov Property The future is independent of the past given the present. Definition A state S t is Markov if and only if P ( S t +1 | S t ) = P ( S t +1 | S t , S t − 1 , S t − 2 , · · · , S 1 ) Andrey Markov § Once the present state is known, the history may be thrown away § The current state is a sufficient statistic of the future Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 5 / 43

  15. Agenda Terminology Markov Decision Process Markov Chain A Markov Chain or Markov Process is temporal process i.e. , a sequence of random states S 1 , S 2 , · · · where the states obey the Markov property. Definition A Markov Process is a tuple �S , P� , where § S is the state space (can be continuous or discrete) § P is the state transition probability matrix. P also called an operator   P 11 P 12 · · · P 1 n P 21 P 22 · · · P 2 n   P = . . .  ...  . . .   . . .   P n 1 P n 2 · · · P nn where P ss ′ = P ( S t +1 = s ′ | S t = s ) Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 6 / 43

  16. Agenda Terminology Markov Decision Process Markov Chain  P 11 P 12 · · · P 1 n  P 21 P 22 · · · P 2 n   P =  . . .  ... . . .   . . .   P n 1 P n 2 · · · P nn � T , i.e. , µ t is a vector � Let µ t,i = P ( S t = s i ) and µ t = µ t, 1 , µ t, 2 , · · · , µ t,n of probabilities, then µ t +1 = P T µ t T       µ t +1 , 1 P 11 P 12 · · · P 1 n µ t, 1 µ t +1 , 2 P 21 P 22 · · · P 2 n µ t, 2        = . . . . .    ...    . . . . .       . . . . .      µ t +1 ,n P n 1 P n 2 · · · P nn µ t,n Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 7 / 43

  17. Agenda Terminology Markov Decision Process Markov Chain  P 11 P 12 · · · P 1 n  P 21 P 22 · · · P 2 n   P =  . . .  ... . . .   . . .   P n 1 P n 2 · · · P nn � T , i.e. , µ t is a vector � Let µ t,i = P ( S t = s i ) and µ t = µ t, 1 , µ t, 2 , · · · , µ t,n of probabilities, then µ t +1 = P T µ t T       µ t +1 , 1 P 11 P 12 · · · P 1 n µ t, 1 µ t +1 , 2 P 21 P 22 · · · P 2 n µ t, 2        = . . . . .    ...    . . . . .       . . . . .      µ t +1 ,n P n 1 P n 2 · · · P nn µ t,n 𝑞 𝑡 # 𝑡 #$% ) Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 7 / 43

  18. Agenda Terminology Markov Decision Process Student Markov Process 0.9 Facebook Sleep 0.1 0.5 1.0 0.2 0.6 0.5 0.8 Class 1 Class 2 Class 3 Pass 0.2 0.4 0.4 0.4 Pub Figure credit: David Silver, DeepMind Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 8 / 43

  19. Agenda Terminology Markov Decision Process Student Markov Process - Episodes Sample episodes for Student Markov 0.9 process starting from S 1 = C1 Facebook Sleep § C1 C2 C3 Pass Sleep 0.1 0.5 1.0 0.2 § C1 FB FB C1 C2 Sleep 0.5 0.8 0.6 Class 1 Class 2 Class 3 Pass § C1 C2 C3 Pub C2 C3 Pass Sleep 0.2 0.4 0.4 0.4 § C1 FB FB C1 C2 C3 Pub C1 FB Pub FB FB C1 C2 C3 Pub C2 Sleep Figure credit: David Silver, DeepMind Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 9 / 43

  20. Agenda Terminology Markov Decision Process Student Markov Process - Transition Matrix 0.9 C 1 C 2 C 3 P ass P ub F B Sleep Facebook Sleep   0 . 5 0 . 5 C 1 0 . 8 0 . 2 0.1 0.5 1.0 C 2   0.2   0 . 6 0 . 4 C 3 0.5 0.8 0.6   Class 1 Class 2 Class 3 Pass   1 . 0 P ass   0.2 0.4 0.4 0.4   0 . 2 0 . 4 0 . 4 P ub     0 . 1 0 . 9 F B Pub   1 . 0 Figure credit: David Silver, DeepMind Sleep Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 10 / 43

  21. Agenda Terminology Markov Decision Process Markov Reward Process A Markov reward process is a Markov process with rewards. Definition A Markov Reward Process is a tuple �S , P , R , γ � , where § S is the state space (can be continuous or discrete) § P is the state transition probability matrix. P also called an operator. P ss ′ = P ( S t +1 = s ′ | S t = s ) � � § R is a reward function, R = E R t +1 | S t = s = R ( s ) � � § γ is a discount factor, γ ∈ 0 , 1 Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 11 / 43

  22. Agenda Terminology Markov Decision Process Markov Reward Process A Markov reward process is a Markov process with rewards. Definition A Markov Reward Process is a tuple �S , P , R , γ � , where § S is the state space (can be continuous or discrete) § P is the state transition probability matrix. P also called an operator. P ss ′ = P ( S t +1 = s ′ | S t = s ) � � § R is a reward function, R = E R t +1 | S t = s = R ( s ) � � § γ is a discount factor, γ ∈ 0 , 1 Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 11 / 43

  23. Agenda Terminology Markov Decision Process Student Markov Reward Process 0.9 Facebook Sleep R=-1 R=0 0.1 0.5 1.0 0.2 R=+10 0.6 0.5 0.8 Class 1 Class 2 Class 3 Pass R=-2 R=-2 R=-2 0.2 0.4 0.4 0.4 Pub R=+1 Figure credit: David Silver, DeepMind Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 12 / 43

  24. Agenda Terminology Markov Decision Process Return Definition The return G t is the total discounted reward from timestep t . ∞ � γ k R t + k +1 G t = R t +1 + γR t +2 + · · · = (1) k =0 � � § γ ∈ 0 , 1 is the discounted present value of the future rewards. § Immediate rewards are valued above delayed rewards. ◮ γ close to 0 leads to “myopic” evaluation. ◮ γ close to 1 leads to “far-sighted” evaluation. Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 13 / 43

  25. Agenda Terminology Markov Decision Process Why Discount? Most Markov reward and decision processes are discounted. Why? Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 14 / 43

  26. Agenda Terminology Markov Decision Process Why Discount? Most Markov reward and decision processes are discounted. Why? § Uncertainty about the future may not be fully represented Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 14 / 43

Recommend


More recommend