Markov Decision Processes CS60077: Reinforcement Learning Abir Das - PowerPoint PPT Presentation

Markov Decision Processes CS60077: Reinforcement Learning Abir Das IIT Kharagpur July 26, Aug 01, 02, 08, 2019

Agenda Terminology Markov Decision Process Agenda § Understand definitions and notation to be used in the course. § Understand definition and setup of sequential decision problems. Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 2 / 43

Agenda Terminology Markov Decision Process Resources § Reinforcement Learning by David Silver [Link] § Deep Reinforcement Learning by Sergey Levine [Link] § SB: Chapter 3 Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 3 / 43

Agenda Terminology Markov Decision Process Terminology and Notation Figure credit: S. Levine - CS 294-112 Course, UC Berkeley Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 4 / 43

Agenda Terminology Markov Decision Process Terminology and Notation 1. run away 2. ignore 3. pet Figure credit: S. Levine - CS 294-112 Course, UC Berkeley Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 4 / 43

Agenda Terminology Markov Decision Process Terminology and Notation Figure credit: S. Levine - CS 294-112 Course, UC Berkeley Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 4 / 43

Agenda Terminology Markov Decision Process Markov Property The future is independent of the past given the present. Definition A state S t is Markov if and only if P ( S t +1 | S t ) = P ( S t +1 | S t , S t − 1 , S t − 2 , · · · , S 1 ) Andrey Markov § Once the present state is known, the history may be thrown away § The current state is a sufficient statistic of the future Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 5 / 43

Agenda Terminology Markov Decision Process Markov Chain A Markov Chain or Markov Process is temporal process i.e. , a sequence of random states S 1 , S 2 , · · · where the states obey the Markov property. Definition A Markov Process is a tuple �S , P� , where § S is the state space (can be continuous or discrete) § P is the state transition probability matrix. P also called an operator   P 11 P 12 · · · P 1 n P 21 P 22 · · · P 2 n   P = . . .  ...  . . .   . . .   P n 1 P n 2 · · · P nn where P ss ′ = P ( S t +1 = s ′ | S t = s ) Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 6 / 43

Agenda Terminology Markov Decision Process Markov Chain  P 11 P 12 · · · P 1 n  P 21 P 22 · · · P 2 n   P =  . . .  ... . . .   . . .   P n 1 P n 2 · · · P nn � T , i.e. , µ t is a vector � Let µ t,i = P ( S t = s i ) and µ t = µ t, 1 , µ t, 2 , · · · , µ t,n of probabilities, then µ t +1 = P T µ t T       µ t +1 , 1 P 11 P 12 · · · P 1 n µ t, 1 µ t +1 , 2 P 21 P 22 · · · P 2 n µ t, 2        = . . . . .    ...    . . . . .       . . . . .      µ t +1 ,n P n 1 P n 2 · · · P nn µ t,n Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 7 / 43

Agenda Terminology Markov Decision Process Markov Chain  P 11 P 12 · · · P 1 n  P 21 P 22 · · · P 2 n   P =  . . .  ... . . .   . . .   P n 1 P n 2 · · · P nn � T , i.e. , µ t is a vector � Let µ t,i = P ( S t = s i ) and µ t = µ t, 1 , µ t, 2 , · · · , µ t,n of probabilities, then µ t +1 = P T µ t T       µ t +1 , 1 P 11 P 12 · · · P 1 n µ t, 1 µ t +1 , 2 P 21 P 22 · · · P 2 n µ t, 2        = . . . . .    ...    . . . . .       . . . . .      µ t +1 ,n P n 1 P n 2 · · · P nn µ t,n 𝑞 𝑡 # 𝑡 #$% ) Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 7 / 43

Agenda Terminology Markov Decision Process Student Markov Process 0.9 Facebook Sleep 0.1 0.5 1.0 0.2 0.6 0.5 0.8 Class 1 Class 2 Class 3 Pass 0.2 0.4 0.4 0.4 Pub Figure credit: David Silver, DeepMind Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 8 / 43

Agenda Terminology Markov Decision Process Student Markov Process - Episodes Sample episodes for Student Markov 0.9 process starting from S 1 = C1 Facebook Sleep § C1 C2 C3 Pass Sleep 0.1 0.5 1.0 0.2 § C1 FB FB C1 C2 Sleep 0.5 0.8 0.6 Class 1 Class 2 Class 3 Pass § C1 C2 C3 Pub C2 C3 Pass Sleep 0.2 0.4 0.4 0.4 § C1 FB FB C1 C2 C3 Pub C1 FB Pub FB FB C1 C2 C3 Pub C2 Sleep Figure credit: David Silver, DeepMind Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 9 / 43

Agenda Terminology Markov Decision Process Student Markov Process - Transition Matrix 0.9 C 1 C 2 C 3 P ass P ub F B Sleep Facebook Sleep   0 . 5 0 . 5 C 1 0 . 8 0 . 2 0.1 0.5 1.0 C 2   0.2   0 . 6 0 . 4 C 3 0.5 0.8 0.6   Class 1 Class 2 Class 3 Pass   1 . 0 P ass   0.2 0.4 0.4 0.4   0 . 2 0 . 4 0 . 4 P ub     0 . 1 0 . 9 F B Pub   1 . 0 Figure credit: David Silver, DeepMind Sleep Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 10 / 43

Agenda Terminology Markov Decision Process Markov Reward Process A Markov reward process is a Markov process with rewards. Definition A Markov Reward Process is a tuple �S , P , R , γ � , where § S is the state space (can be continuous or discrete) § P is the state transition probability matrix. P also called an operator. P ss ′ = P ( S t +1 = s ′ | S t = s ) � � § R is a reward function, R = E R t +1 | S t = s = R ( s ) � � § γ is a discount factor, γ ∈ 0 , 1 Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 11 / 43

Agenda Terminology Markov Decision Process Student Markov Reward Process 0.9 Facebook Sleep R=-1 R=0 0.1 0.5 1.0 0.2 R=+10 0.6 0.5 0.8 Class 1 Class 2 Class 3 Pass R=-2 R=-2 R=-2 0.2 0.4 0.4 0.4 Pub R=+1 Figure credit: David Silver, DeepMind Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 12 / 43

Agenda Terminology Markov Decision Process Return Definition The return G t is the total discounted reward from timestep t . ∞ � γ k R t + k +1 G t = R t +1 + γR t +2 + · · · = (1) k =0 � � § γ ∈ 0 , 1 is the discounted present value of the future rewards. § Immediate rewards are valued above delayed rewards. ◮ γ close to 0 leads to “myopic” evaluation. ◮ γ close to 1 leads to “far-sighted” evaluation. Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 13 / 43

Agenda Terminology Markov Decision Process Why Discount? Most Markov reward and decision processes are discounted. Why? Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 14 / 43

Agenda Terminology Markov Decision Process Why Discount? Most Markov reward and decision processes are discounted. Why? § Uncertainty about the future may not be fully represented Abir Das (IIT Kharagpur) CS60077 July 26, Aug 01, 02, 08, 2019 14 / 43

Markov Decision Processes CS60077: Reinforcement Learning Abir Das - PowerPoint PPT Presentation

Markov Decision Processes CS60077: Reinforcement Learning Abir Das IIT Kharagpur July 26, Aug 01, 02, 08, 2019 Agenda Terminology Markov Decision Process Agenda Understand definitions and notation to be used in the course. Understand

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Model Repair for Markov Decision Model Repair for Markov Decision Model Repair for Markov

Outline Md Md Markov Markov Decision Decision Processes Processes Grid World Example

Stochastic Processes Markov Processes Hamid R. Rabiee 1 Overview o Markov Property o Markov

Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Markov Systems, Markov Decision Processes, and Dynamic Programming Andrew W. Moore Note to

Introduction to Partially Observable Markov Decision Processes CS 886 Sequential Decision Making

The simplex method is strongly polynomial for deterministic Markov decision processes Ian Post

Discrete Time Markov Chains Discrete-Time Markov Chains Books - Introduction to Stochastic

Markov processes (Markov chains) Construct a Bayes net from these variables: parents? Markov

Markov processes (Markov chains) Construct a Bayes net from these variables: parents? CPTs?

Markov Decision Processes Philipp Koehn 7 April 2020 Philipp Koehn Artificial Intelligence:

Semi-Markov PEPA: Compositional Modelling and Analysis with General Distributions Jeremy Bradley

Math 20, Fall 2017 Edgar Costa Week 8 Dartmouth College Edgar Costa Math 20, Fall 2017 Week 8

CS70: Lecture 36. Markov Chains 1. Markov Process: Motivation, Definition 2. Examples 3.

Markov Decision Processes and Reinforcement Learning Marco Chiarandini Department of Mathematics

Markov Decision Processes 2/23/18 Recall: State Space Search Problems A set of discrete

Image Segmentation Philipp Kr ahenb uhl Stanford University April 24, 2013 Philipp Kr

Discrete Markov Random Fields the Inference story Pradeep Ravikumar Graphical Models, The

Models CMSC 678 UMBC Announcement 1: Progress Report on Project Due Monday April 16 th , 11:59