Markov decision process (MDP) Robert Platt Northeastern University

The RL Setting Action Agent World Observation Reward On a single time step, agent does the following: 1. observe some information 2. select an action to execute 3. take note of any reward Goal of agent: select actions that maximize cumulative reward in the long run

Let’s turn this into an MDP Action Agent World Observation Reward On a single time step, agent does the following: 1. observe some information 2. select an action to execute 3. take note of any reward Goal of agent: select actions that maximize cumulative reward in the long run

Let’s turn this into an MDP Action Agent World State Observation Reward On a single time step, agent does the following: 1. observe state 2. select an action to execute 3. take note of any reward Goal of agent: select actions that maximize cumulative reward in the long run

Let’s turn this into an MDP Action Agent World State Observation Reward On a single time step, agent does the following: 1. observe state 2. select an action to execute This part is the MDP 3. take note of any reward Goal of agent: select actions that maximize cumulative reward in the long run

Example: Grid world Grid world: – agent lives on grid – always occupies a single cell – can move left, right, up, down – gets zero reward unless in “+1” or “-1” cells

States and actions State set: Action set:

Reward function Reward function: Otherwise:

Reward function Reward function: Otherwise: In general:

Reward function Expected reward on this time step given that agent takes action from state Reward function: Otherwise: In general:

Transition function Transition model: For example:

Transition function Transition model: For example: – This entire probability distribution can be written as a table over state, action, next state. probability of this transition

Definition of an MDP An MDP is a tuple: where State set: Action set: Reward function: Transition model:

Example: Frozen Lake Frozen Lake is this 4x4 grid State set: Action set: if Reward function: otherwise Transition model: only one third chance of going in specified direction – one third chance of moving +90deg – one third change of moving -90deg

Example: Recycling Robot Example 3.4 in SB, 2 nd Ed.

Think-pair-share Mobile robot: – the robot moves on a flat surface – the robot can execute point turns either left or right. It can also go forward or back with fixed velocity – it must reach a goal while avoiding obstacles Express mobile robot control problem as an MDP

Definition of an MDP An MDP is a tuple: where State set: Action set: Reward function: Transition model:

Definition of an MDP Why is it called a Markov decision process? An MDP is a tuple: where State set: Action set: Reward function: Transition model:

Definition of an MDP Why is it called a Markov decision process? Because we’re making the following assumption: An MDP is a tuple: where State set: Action set: Reward function: Transition model:

Definition of an MDP Why is it called a Markov decision process? Because we’re making the following assumption: An MDP is a tuple: – this is called the “Markov” assumption where State set: Action set: Reward function: Transition model:

The Markov Assumption Suppose agent starts in and follows this path:

The Markov Assumption Suppose agent starts in and follows this path: Notice that probability of arriving in if agent executes right action does not depend on path taken to get to :

Think-pair-share Cart-pole robot: – state is the position of the cart and the orientation of the pole – cart can execute a constant acceleration either left or right 1. Is this system Markov? 2. Why / Why not? 3. If not, how do you change it to make it Markov?

Policy A policy is a rule for selecting actions: If agent is in this state, then take this action

Policy A policy is a rule for selecting actions: If agent is in this state, then take this action A policy can be stochastic:

Question Why would we want to use a stochastic policy? A policy is a rule for selecting actions: If agent is in this state, then take this action A policy can be stochastic:

Episodic vs Continuing Process Episodic process: execution ends at some point and starts over. – after a fixed number of time steps – upon reaching a terminal state Terminal state Example of an episodic task: – execution ends upon reaching terminal state OR after 15 time steps

Episodic vs Continuing Process Continuing process: execution goes on forever. Example of a continuing task Process doesn’t stop – keep getting rewards

Rewards and Return On each time step, the agent gets a reward:

Rewards and Return On each time step, the agent gets a reward: – could have positive reward at goal, zero reward elsewhere – could have negative reward on every time step – could have an arbitrary reward function

Rewards and Return On each time step, the agent gets a reward: Return can be a simple sum of rewards:

Rewards and Return Return On each time step, the agent gets a reward: Return can be a simple sum of rewards:

Rewards and Return On each time step, the agent gets a reward: Return can be a simple sum of rewards: But, it is often a discounted sum of rewards:

Rewards and Return On each time step, the agent gets a reward: Return can be a simple sum of rewards: But, it is often a discounted sum of rewards: What effect does gamma have?

Rewards and Return On each time step, the agent gets a reward: Return can be a simple sum of rewards: But, it is often a discounted sum of rewards: Reward received k time steps in the future is only worth of what it would have been worth immediately

Rewards and Return On each time step, the agent gets a reward: Return can be a simple sum of rewards: But, it is often a discounted sum of rewards:

Rewards and Return Return On each time step, the agent gets a reward: Return can be a simple sum of rewards: But, it is often a discounted sum of rewards: Return is often evaluated over an infinite horizon:

Think-pair-share

Value Function Value of state when acting according to policy :

Value Function Value of state when acting according to policy : Value of a state == expected return from that state if agent follows policy

Value Function Value of state when acting according to policy : Value of a state == expected return from that state if agent follows policy Value of taking action from state when acting according to policy :

Value Function Value of state when acting according to policy : Value of a state == expected return from that state if agent follows policy Value of taking action from state when acting according to policy : Value of a state/action pair == expected return when taking action a from state s and following after that

Value Function Value of state when acting according to policy : Value of taking action from state when acting according to policy :

Value function example 1 Policy: Discount factor: Value fn: 6.9 6.6 7.3 8.1 9 10

Value function example 2 Policy: Discount factor: Value fn: 1 0.9 0.81 0.73 0.66 10.66

Value function example 2 Notice that value function can help us compare two different policies Policy: – how? Discount factor: Value fn: 1 0.9 0.81 0.73 0.66 10.66

Value function example 3 Policy: Discount factor: Value fn: 11 10 10 10 10 10

Think-pair-share Policy: Discount factor: Value fn: ? ? ? ? ? ?

Value Function Revisited Value of state when acting according to policy :

Value Function Revisited Value of state when acting according to policy : This is called a “backup diagram”

Value Function Revisited Value of state when acting according to policy :

Think-pair-share 1 Value of state when acting according to policy : Write this expectation in terms of P( s’,r | s,a ) for a deterministic policy,

Think-pair-share 2 Value of state when acting according to policy : Write this expectation in terms of P( s’,r | s,a ) for a stochastic policy,

Think-pair-share

Value Function Revisited Can we calculate Q in terms of V ?

Think-pair-share Can we calculate Q in terms of V ? Write this expectation in terms of P( s’,r | s,a ) and

Markov decision process (MDP) Robert Platt Northeastern University - PowerPoint PPT Presentation

Markov decision process (MDP) Robert Platt Northeastern University The RL Setting Action Agent World Observation Reward On a single time step, agent does the following: 1. observe some information 2. select an action to execute 3. take

Processes (MDP) Prof. Kuan-Ting Lai 2020/3/20 Markov Decision Process (MDP)

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Model Repair for Markov Decision Model Repair for Markov Decision Model Repair for Markov

Outline Md Md Markov Markov Decision Decision Processes Processes Grid World Example

Markov Decision Process and Reinforcement Learning Zeqian (Chris) Li Feb 28, 2019 Zeqian

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

1 Markov Decision Processes Markov Decision Processes An MDP is defined by: An MDP is

Stochastic Processes Markov Processes Hamid R. Rabiee 1 Overview o Markov Property o Markov

Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo

AI Basics Heechul Yun Acknowledgement: Many slides are adopted from Berkeleys CS188 AI slide

Logic Programming and MDPs for Planning Alborz Geramifard Winter 2009 Index Introduction Logic

Whats Next? 29 1 The MDP Journey Next Spring Leadership MDP Badge Fall Symposium

CS 730/730W/830: Intro AI MDP Wrap-Up ADP Q -Learning 1 handout: slides project proposals are

Talk overview Introduction and historical background Multiple delivery publishing (MDP)

A Correctness Result for Synthesizing Plans With Loops in Stochastic Domains Laszlo Treszkai

Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and

Argon : tradeoff-resilient password hashing scheme Alex Biryukov Dmitry Khovratovich University

Securing Circuits Against Constant-Rate Tampering Dana Dachman-Soled Yael Tauman Kalai

Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit Problem Playing k armed

From Qualitative to Quantitative Dominance Pruning for Optimal Planning Alvaro Torralba

Monte Carlo Tree Search guided by Symbolic Advice for MDPs Damien Busatto-Gaston, Debraj

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning: