Markov Decision Processes 2/23/18 Recall: State Space Search - PowerPoint PPT Presentation

Markov Decision Processes 2/23/18

Recall: State Space Search Problems • A set of discrete states • A distinguished start state • A set of actions available to the agent in each state • An action function that, given a state and an action, returns a new state • A set of goal states , often specified as a function • A way to measure solution quality

What if actions aren’t perfect? • We might not know exactly which next state will result from an action. • We can model this as a probability distribution over next states.

Search with Non-Deterministic Actions • A set of discrete states • A distinguished start state • A set of actions available to the agent in each state • An action function that, given a state and an action, returns a new state a probability distribution over next states • A set of goal states , often specified as a function • A way to measure solution quality • A set of terminal states • A reward function that gives a utility for each state

Markov Decision Processes (MDPs) Named after the “Markov property”: if you know the state you don’t need to remember history. • We still represent states and actions. • Actions no longer lead to a single next state. • Instead they lead to one of several possible states, determined randomly. • We’re now working with utilities instead of goals. • Expected utility works well for handling randomness. • We need to plan for unintended consequences. • We need to plan over an indefinite horizon. • Even an optimal agent may run forever!

State Space Search MDPs • States: S • States: S • Actions: A s • Actions: A s • Transition probabilities • Transition function • P(s’ | s, a) • F(s, a) = s’ • Start ∈ S • Start ∈ S • Terminal ⊂ S • Goals ⊂ S • Can be empty. • Action Costs: C(a) • State Rewards: R(s) • Or action costs: C(a)

We can’t rely on a single plan! Actions might not have the outcome we expect, so our plans need to include contingencies for states we could end up in. Instead of searching for a plan , we devise a policy . A policy is a function that maps states to actions. • For each state we could end up in, the policy tells us which action to take.

A simple example: Grid World end +1 end -1 start If actions were deterministic, we could solve this with state space search. • (3,2) would be a goal state • (3,1) would be a dead end

A simple example: Grid World end +1 end -1 start • Suppose instead that the move we try to make only works correctly 80% of the time. • 10% of the time, we go in each perpendicular direction, e.g. try to go right, go up instead. • If impossible, stay in place.

A simple example: Grid World end +1 end -1 start • Before, we had two equally-good alternatives. • Which path is better when actions are uncertain? • What should we do if we find ourselves in (2,1)?

New Objective: Find an optimal po policy . We can’t just rely on a single plan, since we might end up in an unintended state. A policy is a function that maps every state to an action: π ( s ) = a We want policies that yield high reward. • In expectation (since transitions are random). • Over time (we may be willing to accept low reward now to achieve high reward later, but not always).

Expected Value • Since future states are uncertain, we can’t perfectly maximize future reward. • Instead, we maximize expected reward, a probability-weighted average over state rewards. X E ( R t ) = Pr( s t = s ) · R ( s ) s Expected Reward reward Probability of state s at time t of state s at time t

Discounting How do we trade off short-term vs long-term reward? Key idea: reward now is better than reward later. • Rewards in the future are exponentially decayed. • Reward t steps in the future is discounted by γ t V = γ t · R t Reward at time-step t Value now 0 < γ < 1

Value of a policy • Value depends on what state the agent is in. • Value depends on what the policy tells the agent to do in the future. 1 γ t · π ( s t = s 0 | s 0 = s ) · R ( s 0 ) X X V π ( s ) = Pr t =0 s 0 Value is the sum over all timesteps of the expected discounted reward at that timestep.

Optimal Policy • The optimal policy is the one that maximizes value. • If we knew the optimal policy, we could easily find the true value of any state. • If we knew the true value of every state, we could easily find the optimal policy. end +1 V ∗ ( s ) = V π ∗ ( s ) end -1 start

Value Iteration • The value of state s depends on the value of other states s’ . • The value of s’ may depend on the value of s . We can iteratively approximate the value using dynamic programming. • Initialize all values to the immediate rewards. • Update values based on the best next-state. • Repeat until convergence (values don’t change).

Value Iteration Pseudocode values = {state : R(state) for each state} until values don’t change: prev = copy of values for each state s: initialize best_EV for each action: EV = 0 for each next state ns: EV += prob * prev[ns] best_EV = max(EV, best_EV) values[s] = R(s) + gamma*best_EV

Markov Decision Processes 2/23/18 Recall: State Space Search - PowerPoint PPT Presentation

Markov Decision Processes 2/23/18 Recall: State Space Search Problems A set of discrete states A distinguished start state A set of actions available to the agent in each state An action function that, given a state and an

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Model Repair for Markov Decision Model Repair for Markov Decision Model Repair for Markov

Outline Md Md Markov Markov Decision Decision Processes Processes Grid World Example

Stochastic Processes Markov Processes Hamid R. Rabiee 1 Overview o Markov Property o Markov

Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Markov Systems, Markov Decision Processes, and Dynamic Programming Andrew W. Moore Note to

Introduction to Partially Observable Markov Decision Processes CS 886 Sequential Decision Making

The simplex method is strongly polynomial for deterministic Markov decision processes Ian Post

Discrete Time Markov Chains Discrete-Time Markov Chains Books - Introduction to Stochastic

Markov processes (Markov chains) Construct a Bayes net from these variables: parents? Markov

Markov processes (Markov chains) Construct a Bayes net from these variables: parents? CPTs?

Markov Decision Processes Philipp Koehn 7 April 2020 Philipp Koehn Artificial Intelligence:

Markov Decision Processes CS60077: Reinforcement Learning Abir Das IIT Kharagpur July 26, Aug

Semi-Markov PEPA: Compositional Modelling and Analysis with General Distributions Jeremy Bradley

Math 20, Fall 2017 Edgar Costa Week 8 Dartmouth College Edgar Costa Math 20, Fall 2017 Week 8

CS70: Lecture 36. Markov Chains 1. Markov Process: Motivation, Definition 2. Examples 3.

Image Segmentation Philipp Kr ahenb uhl Stanford University April 24, 2013 Philipp Kr

Discrete Markov Random Fields the Inference story Pradeep Ravikumar Graphical Models, The

Models CMSC 678 UMBC Announcement 1: Progress Report on Project Due Monday April 16 th , 11:59

Markov Networks [KF] Chapter 4 CS 786 University of Waterloo Lecture 7: May 24, 2012 Outline