Markov Decision Processes CS60077: Reinforcement Learning Abir Das IIT Kharagpur Sep 14 and 15, 2020
Agenda Terminology Markov Decision Process Agenda § Understand definitions and notation to be used in the course. § Understand definition and setup of sequential decision problems. Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 2 / 43
Agenda Terminology Markov Decision Process Resources § Reinforcement Learning by David Silver [Link] § Deep Reinforcement Learning by Sergey Levine [Link] § SB: Chapter 3 Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 3 / 43
Agenda Terminology Markov Decision Process Terminology and Notation Figure credit: S. Levine - CS 294-112 Course, UC Berkeley Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 4 / 43
Agenda Terminology Markov Decision Process Terminology and Notation Figure credit: S. Levine - CS 294-112 Course, UC Berkeley Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 4 / 43
Agenda Terminology Markov Decision Process Terminology and Notation Figure credit: S. Levine - CS 294-112 Course, UC Berkeley Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 4 / 43
Agenda Terminology Markov Decision Process Terminology and Notation 1. run away 2. ignore 3. pet Figure credit: S. Levine - CS 294-112 Course, UC Berkeley Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 4 / 43
Agenda Terminology Markov Decision Process Terminology and Notation Figure credit: S. Levine - CS 294-112 Course, UC Berkeley Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 4 / 43
Agenda Terminology Markov Decision Process Terminology and Notation Figure credit: S. Levine - CS 294-112 Course, UC Berkeley Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 4 / 43
Agenda Terminology Markov Decision Process Terminology and Notation Figure credit: S. Levine - CS 294-112 Course, UC Berkeley Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 4 / 43
Agenda Terminology Markov Decision Process Terminology and Notation Figure credit: S. Levine - CS 294-112 Course, UC Berkeley Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 4 / 43
Agenda Terminology Markov Decision Process Terminology and Notation Figure credit: S. Levine - CS 294-112 Course, UC Berkeley Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 4 / 43
Agenda Terminology Markov Decision Process Terminology and Notation Figure credit: S. Levine - CS 294-112 Course, UC Berkeley Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 4 / 43
Agenda Terminology Markov Decision Process Markov Property The future is independent of the past given the present. Definition A state S t is Markov if and only if P ( S t +1 | S t ) = P ( S t +1 | S t , S t − 1 , S t − 2 , · · · , S 1 ) Andrey Markov § Once the present state is known, the history may be thrown away § The current state is a sufficient statistic of the future Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 5 / 43
Agenda Terminology Markov Decision Process Markov Chain A Markov Chain or Markov Process is temporal process i.e. , a sequence of random states S 1 , S 2 , · · · where the states obey the Markov property. Definition A Markov Process is a tuple �S , P� , where § S is the state space (can be continuous or discrete) § P is the state transition probability matrix. P also called an operator P 11 P 12 · · · P 1 n P 21 P 22 · · · P 2 n P = . . . ... . . . . . . P n 1 P n 2 · · · P nn where P ss ′ = P ( S t +1 = s ′ | S t = s ) Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 6 / 43
Agenda Terminology Markov Decision Process Markov Chain P 11 P 12 · · · P 1 n P 21 P 22 · · · P 2 n P = . . . ... . . . . . . P n 1 P n 2 · · · P nn � T , i.e. , µ t is a vector � Let µ t,i = P ( S t = s i ) and µ t = µ t, 1 , µ t, 2 , · · · , µ t,n of probabilities, then µ t +1 = P T µ t T µ t +1 , 1 P 11 P 12 · · · P 1 n µ t, 1 µ t +1 , 2 P 21 P 22 · · · P 2 n µ t, 2 = . . . . . ... . . . . . . . . . . µ t +1 ,n P n 1 P n 2 · · · P nn µ t,n Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 7 / 43
-1 Agenda Terminology Markov Decision Process Markov Chain P 11 P 12 · · · P 1 n P 21 P 22 · · · P 2 n P = . . . ... . . . . . . P n 1 P n 2 · · · P nn � T , i.e. , µ t is a vector � Let µ t,i = P ( S t = s i ) and µ t = µ t, 1 , µ t, 2 , · · · , µ t,n of probabilities, then µ t +1 = P T µ t T µ t +1 , 1 P 11 P 12 · · · P 1 n µ t, 1 µ t +1 , 2 P 21 P 22 · · · P 2 n µ t, 2 = . . . . . ... . . . . . . . . . . µ t +1 ,n P n 1 P n 2 · · · P nn µ t,n 𝑞 𝑡 # 𝑡 #$% ) Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 7 / 43
Agenda Terminology Markov Decision Process Student Markov Process 0.9 Facebook Sleep 0.1 0.5 1.0 0.2 0.6 0.5 0.8 Class 1 Class 2 Class 3 Pass 0.2 0.4 0.4 0.4 Pub Figure credit: David Silver, DeepMind Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 8 / 43
Agenda Terminology Markov Decision Process Student Markov Process - Episodes Sample episodes for Student Markov 0.9 process starting from S 1 = C1 Facebook Sleep § C1 C2 C3 Pass Sleep 0.1 0.5 1.0 0.2 § C1 FB FB C1 C2 Sleep 0.5 0.8 0.6 Class 1 Class 2 Class 3 Pass § C1 C2 C3 Pub C2 C3 Pass Sleep 0.2 0.4 0.4 0.4 § C1 FB FB C1 C2 C3 Pub C1 FB Pub FB FB C1 C2 C3 Pub C2 Sleep Figure credit: David Silver, DeepMind Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 9 / 43
Agenda Terminology Markov Decision Process Student Markov Process - Transition Matrix 0.9 C 1 C 2 C 3 P ass P ub F B Sleep Facebook Sleep 0 . 5 0 . 5 C 1 0 . 8 0 . 2 0.1 0.5 1.0 C 2 0.2 0 . 6 0 . 4 C 3 0.5 0.8 0.6 Class 1 Class 2 Class 3 Pass 1 . 0 P ass 0.2 0.4 0.4 0.4 0 . 2 0 . 4 0 . 4 P ub 0 . 1 0 . 9 F B Pub 1 . 0 Figure credit: David Silver, DeepMind Sleep Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 10 / 43
Agenda Terminology Markov Decision Process Markov Reward Process A Markov reward process is a Markov process with rewards. Definition A Markov Reward Process is a tuple �S , P , R , γ � , where § S is the state space (can be continuous or discrete) § P is the state transition probability matrix. P also called an operator. P ss ′ = P ( S t +1 = s ′ | S t = s ) � � § R is a reward function, R = E R t +1 | S t = s = R ( s ) � � § γ is a discount factor, γ ∈ 0 , 1 Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 11 / 43
Agenda Terminology Markov Decision Process Markov Reward Process A Markov reward process is a Markov process with rewards. Definition A Markov Reward Process is a tuple �S , P , R , γ � , where § S is the state space (can be continuous or discrete) § P is the state transition probability matrix. P also called an operator. P ss ′ = P ( S t +1 = s ′ | S t = s ) � � § R is a reward function, R = E R t +1 | S t = s = R ( s ) � � § γ is a discount factor, γ ∈ 0 , 1 Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 11 / 43
Agenda Terminology Markov Decision Process Student Markov Reward Process 0.9 Facebook Sleep R=-1 R=0 0.1 0.5 1.0 0.2 R=+10 0.6 0.5 0.8 Class 1 Class 2 Class 3 Pass R=-2 R=-2 R=-2 0.2 0.4 0.4 0.4 Pub R=+1 Figure credit: David Silver, DeepMind Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 12 / 43
Agenda Terminology Markov Decision Process Return Definition The return G t is the total discounted reward from timestep t . ∞ � γ k R t + k +1 G t = R t +1 + γR t +2 + · · · = (1) k =0 � � § γ ∈ 0 , 1 is the discounted present value of the future rewards. § Immediate rewards are valued above delayed rewards. ◮ γ close to 0 leads to “myopic” evaluation. ◮ γ close to 1 leads to “far-sighted” evaluation. Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 13 / 43
Agenda Terminology Markov Decision Process Why Discount? Most Markov reward and decision processes are discounted. Why? Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 14 / 43
Agenda Terminology Markov Decision Process Why Discount? Most Markov reward and decision processes are discounted. Why? § Uncertainty about the future may not be fully represented Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 14 / 43
Agenda Terminology Markov Decision Process Why Discount? Most Markov reward and decision processes are discounted. Why? § Uncertainty about the future may not be fully represented § Immediate rewards are valued above delayed rewards. Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 14 / 43
Recommend
More recommend