Markov Decision Processes CS60077: Reinforcement Learning Abir Das - PowerPoint PPT Presentation

Markov Decision Processes CS60077: Reinforcement Learning Abir Das IIT Kharagpur Sep 14 and 15, 2020

Agenda Terminology Markov Decision Process Agenda § Understand definitions and notation to be used in the course. § Understand definition and setup of sequential decision problems. Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 2 / 43

Agenda Terminology Markov Decision Process Resources § Reinforcement Learning by David Silver [Link] § Deep Reinforcement Learning by Sergey Levine [Link] § SB: Chapter 3 Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 3 / 43

Agenda Terminology Markov Decision Process Terminology and Notation Figure credit: S. Levine - CS 294-112 Course, UC Berkeley Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 4 / 43

Agenda Terminology Markov Decision Process Terminology and Notation 1. run away 2. ignore 3. pet Figure credit: S. Levine - CS 294-112 Course, UC Berkeley Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 4 / 43

Agenda Terminology Markov Decision Process Terminology and Notation Figure credit: S. Levine - CS 294-112 Course, UC Berkeley Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 4 / 43

Agenda Terminology Markov Decision Process Markov Property The future is independent of the past given the present. Definition A state S t is Markov if and only if P ( S t +1 | S t ) = P ( S t +1 | S t , S t − 1 , S t − 2 , · · · , S 1 ) Andrey Markov § Once the present state is known, the history may be thrown away § The current state is a sufficient statistic of the future Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 5 / 43

Agenda Terminology Markov Decision Process Markov Chain A Markov Chain or Markov Process is temporal process i.e. , a sequence of random states S 1 , S 2 , · · · where the states obey the Markov property. Definition A Markov Process is a tuple �S , P� , where § S is the state space (can be continuous or discrete) § P is the state transition probability matrix. P also called an operator   P 11 P 12 · · · P 1 n P 21 P 22 · · · P 2 n   P = . . .  ...  . . .   . . .   P n 1 P n 2 · · · P nn where P ss ′ = P ( S t +1 = s ′ | S t = s ) Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 6 / 43

Agenda Terminology Markov Decision Process Markov Chain  P 11 P 12 · · · P 1 n  P 21 P 22 · · · P 2 n   P =  . . .  ... . . .   . . .   P n 1 P n 2 · · · P nn � T , i.e. , µ t is a vector � Let µ t,i = P ( S t = s i ) and µ t = µ t, 1 , µ t, 2 , · · · , µ t,n of probabilities, then µ t +1 = P T µ t T       µ t +1 , 1 P 11 P 12 · · · P 1 n µ t, 1 µ t +1 , 2 P 21 P 22 · · · P 2 n µ t, 2        = . . . . .    ...    . . . . .       . . . . .      µ t +1 ,n P n 1 P n 2 · · · P nn µ t,n Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 7 / 43

-1 Agenda Terminology Markov Decision Process Markov Chain  P 11 P 12 · · · P 1 n  P 21 P 22 · · · P 2 n   P =  . . .  ... . . .   . . .   P n 1 P n 2 · · · P nn � T , i.e. , µ t is a vector � Let µ t,i = P ( S t = s i ) and µ t = µ t, 1 , µ t, 2 , · · · , µ t,n of probabilities, then µ t +1 = P T µ t T       µ t +1 , 1 P 11 P 12 · · · P 1 n µ t, 1 µ t +1 , 2 P 21 P 22 · · · P 2 n µ t, 2        = . . . . .    ...    . . . . .       . . . . .      µ t +1 ,n P n 1 P n 2 · · · P nn µ t,n 𝑞 𝑡 # 𝑡 #$% ) Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 7 / 43

Agenda Terminology Markov Decision Process Student Markov Process 0.9 Facebook Sleep 0.1 0.5 1.0 0.2 0.6 0.5 0.8 Class 1 Class 2 Class 3 Pass 0.2 0.4 0.4 0.4 Pub Figure credit: David Silver, DeepMind Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 8 / 43

Agenda Terminology Markov Decision Process Student Markov Process - Episodes Sample episodes for Student Markov 0.9 process starting from S 1 = C1 Facebook Sleep § C1 C2 C3 Pass Sleep 0.1 0.5 1.0 0.2 § C1 FB FB C1 C2 Sleep 0.5 0.8 0.6 Class 1 Class 2 Class 3 Pass § C1 C2 C3 Pub C2 C3 Pass Sleep 0.2 0.4 0.4 0.4 § C1 FB FB C1 C2 C3 Pub C1 FB Pub FB FB C1 C2 C3 Pub C2 Sleep Figure credit: David Silver, DeepMind Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 9 / 43

Agenda Terminology Markov Decision Process Student Markov Process - Transition Matrix 0.9 C 1 C 2 C 3 P ass P ub F B Sleep Facebook Sleep   0 . 5 0 . 5 C 1 0 . 8 0 . 2 0.1 0.5 1.0 C 2   0.2   0 . 6 0 . 4 C 3 0.5 0.8 0.6   Class 1 Class 2 Class 3 Pass   1 . 0 P ass   0.2 0.4 0.4 0.4   0 . 2 0 . 4 0 . 4 P ub     0 . 1 0 . 9 F B Pub   1 . 0 Figure credit: David Silver, DeepMind Sleep Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 10 / 43

Agenda Terminology Markov Decision Process Markov Reward Process A Markov reward process is a Markov process with rewards. Definition A Markov Reward Process is a tuple �S , P , R , γ � , where § S is the state space (can be continuous or discrete) § P is the state transition probability matrix. P also called an operator. P ss ′ = P ( S t +1 = s ′ | S t = s ) � � § R is a reward function, R = E R t +1 | S t = s = R ( s ) � � § γ is a discount factor, γ ∈ 0 , 1 Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 11 / 43

Agenda Terminology Markov Decision Process Student Markov Reward Process 0.9 Facebook Sleep R=-1 R=0 0.1 0.5 1.0 0.2 R=+10 0.6 0.5 0.8 Class 1 Class 2 Class 3 Pass R=-2 R=-2 R=-2 0.2 0.4 0.4 0.4 Pub R=+1 Figure credit: David Silver, DeepMind Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 12 / 43

Agenda Terminology Markov Decision Process Return Definition The return G t is the total discounted reward from timestep t . ∞ � γ k R t + k +1 G t = R t +1 + γR t +2 + · · · = (1) k =0 � � § γ ∈ 0 , 1 is the discounted present value of the future rewards. § Immediate rewards are valued above delayed rewards. ◮ γ close to 0 leads to “myopic” evaluation. ◮ γ close to 1 leads to “far-sighted” evaluation. Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 13 / 43

Agenda Terminology Markov Decision Process Why Discount? Most Markov reward and decision processes are discounted. Why? Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 14 / 43

Agenda Terminology Markov Decision Process Why Discount? Most Markov reward and decision processes are discounted. Why? § Uncertainty about the future may not be fully represented Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 14 / 43

Agenda Terminology Markov Decision Process Why Discount? Most Markov reward and decision processes are discounted. Why? § Uncertainty about the future may not be fully represented § Immediate rewards are valued above delayed rewards. Abir Das (IIT Kharagpur) CS60077 Sep 14 and 15, 2020 14 / 43

Markov Decision Processes CS60077: Reinforcement Learning Abir Das - PowerPoint PPT Presentation

Markov Decision Processes CS60077: Reinforcement Learning Abir Das IIT Kharagpur Sep 14 and 15, 2020 Agenda Terminology Markov Decision Process Agenda Understand definitions and notation to be used in the course. Understand definition

Markov Systems, Markov Decision Processes, and Dynamic Programming Andrew W. Moore Note to

Outline Md Md Markov Markov Decision Decision Processes Processes Grid World Example

1 Markov Decision Processes Markov Decision Processes An MDP is defined by: An MDP is

Markov Decision Processes [RN2] Sec 17.1, 17.2, 17.4, 17.5 [RN3] Sec 17.1, 17.2, 17.4 CS 486/686

Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo

Model Repair for Markov Decision Model Repair for Markov Decision Model Repair for Markov

max ( | ) ( ) P s a U s preferences, must exist consistent utility function a s

POMDPs (Ch. 17.4-17.6) Markov Decision Process Recap of Markov Decision Processes (MDPs): Know:

Facilitating Testing and Debugging of Markov Decision Processes with Interactive Visualization

Introduction to Partially Observable Markov Decision Processes CS 886 Sequential Decision Making

Processes (MDP) Prof. Kuan-Ting Lai 2020/3/20 Markov Decision Process (MDP)

Recap: MDPs Op)mal Quan))es Markov decision processes:

Markov Decision Processes and Reinforcement Learning Marco Chiarandini Department of Mathematics

CS 188: Artificial Intelligence Markov Decision Processes II Instructor: Anca Dragan University

CMU-Q 15-381 Lecture 15: Predictions in Markov Chains Markov Decision Processes Teacher:

Outline CS 188: Artificial Intelligence Markov Decision Processes (MDPs) Formalism

Markov chains and Markov decision processes in Isabelle/HOL Introduction Coalgebraic view on

CS 188: Artificial Intelligence Markov Decision Processes (MDPs) Pieter Abbeel UC Berkeley

Planning and Optimization F1. Markov Decision Processes Malte Helmert and Thomas Keller

CS440/ECE448 Lecture 21: Markov Decision Processes Slides by Svetlana Lazebnik, 11/2016 Modified

Feature Markov Decision Processes Marcus Hutter Canberra, ACT, 0200, Australia

Markov Decision Processes Philipp Koehn 7 April 2020 Philipp Koehn Artificial Intelligence:

Markov Decision Processes Mausam CSE 515 Operations Research Machine Graph Learning Theory

Markov Decision Processes: Biosens II E. Jrgensen & Lars R. Nielsen Department of Genetics

Markov Decision Processes CS60077: Reinforcement Learning Abir Das - PowerPoint PPT Presentation

Markov Decision Processes CS60077: Reinforcement Learning Abir Das IIT Kharagpur Sep 14 and 15, 2020 Agenda Terminology Markov Decision Process Agenda Understand definitions and notation to be used in the course. Understand definition

Markov Systems, Markov Decision Processes, and Dynamic Programming Andrew W. Moore Note to

Outline Md Md Markov Markov Decision Decision Processes Processes Grid World Example

1 Markov Decision Processes Markov Decision Processes An MDP is defined by: An MDP is

Markov Decision Processes [RN2] Sec 17.1, 17.2, 17.4, 17.5 [RN3] Sec 17.1, 17.2, 17.4 CS 486/686

Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo

Model Repair for Markov Decision Model Repair for Markov Decision Model Repair for Markov

max ( | ) ( ) P s a U s preferences, must exist consistent utility function a s

POMDPs (Ch. 17.4-17.6) Markov Decision Process Recap of Markov Decision Processes (MDPs): Know:

Facilitating Testing and Debugging of Markov Decision Processes with Interactive Visualization

Introduction to Partially Observable Markov Decision Processes CS 886 Sequential Decision Making

Processes (MDP) Prof. Kuan-Ting Lai 2020/3/20 Markov Decision Process (MDP)

Recap: MDPs Op)mal Quan))es Markov decision processes:

Markov Decision Processes and Reinforcement Learning Marco Chiarandini Department of Mathematics

CS 188: Artificial Intelligence Markov Decision Processes II Instructor: Anca Dragan University

CMU-Q 15-381 Lecture 15: Predictions in Markov Chains Markov Decision Processes Teacher:

Outline CS 188: Artificial Intelligence Markov Decision Processes (MDPs) Formalism

Markov chains and Markov decision processes in Isabelle/HOL Introduction Coalgebraic view on

CS 188: Artificial Intelligence Markov Decision Processes (MDPs) Pieter Abbeel UC Berkeley

Planning and Optimization F1. Markov Decision Processes Malte Helmert and Thomas Keller

CS440/ECE448 Lecture 21: Markov Decision Processes Slides by Svetlana Lazebnik, 11/2016 Modified

Feature Markov Decision Processes Marcus Hutter Canberra, ACT, 0200, Australia

Markov Decision Processes Philipp Koehn 7 April 2020 Philipp Koehn Artificial Intelligence:

Markov Decision Processes Mausam CSE 515 Operations Research Machine Graph Learning Theory

Markov Decision Processes: Biosens II E. Jrgensen &amp; Lars R. Nielsen Department of Genetics

Markov Decision Processes: Biosens II E. Jrgensen & Lars R. Nielsen Department of Genetics