1
play

1 Markov Decision Processes Markov Decision Processes An MDP is - PDF document

Non-Deterministic Search CSE 473: Artificial Intelligence Markov Decision Processes Dieter Fox University of Washington [Slides originally created by Dan Klein & Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are


  1. Non-Deterministic Search CSE 473: Artificial Intelligence Markov Decision Processes Dieter Fox University of Washington [Slides originally created by Dan Klein & Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.] Example: Grid World Grid World Actions Deterministic Grid World Stochastic Grid World § A maze-like problem The agent lives in a grid § § Walls block the agent’s path § Noisy movement: actions do not always go as planned § 80% of the time, the action North takes the agent North (if there is no wall there) 10% of the time, North takes the agent West; 10% East § § If there is a wall in the direction the agent would have been taken, the agent stays put § The agent receives rewards each time step Small “living” reward each step (can be negative) § Big rewards come at the end (good or bad) § § Goal: maximize sum of rewards Markov Decision Processes Markov Decision Processes § An MDP is defined by: § An MDP is defined by: § A set of states s in S § A set of states s in S § A set of actions a in A § A set of actions a in A § A transition function T(s, a, s’) § A transition function T(s, a, s’) § Probability that a from s leads to s’, i.e., P(s’| s, a) § Probability that a from s leads to s’, i.e., P(s’| s, a) § Also called the model or the dynamics § Also called the model or the dynamics § A reward function R(s, a, s’) T(s 11 , E, … … … Cost of breathing T(s 31 , N, s 11 ) = 0 R(s 32 , N, s 33 ) = -0.01 … T is a Big Table! … T(s 31 , N, s 32 ) = 0.8 R(s 32 , N, s 42 ) = -1.01 11 X 4 x 11 = 484 entries R is also a Big Table! … T(s 31 , N, s 21 ) = 0.1 T(s 31 , N, s 41 ) = 0.1 R(s 33 , E, s 43 ) = 0.99 For now, we give this as input to the agent For now, we also give this to the agent … 1

  2. Markov Decision Processes Markov Decision Processes § An MDP is defined by: § An MDP is defined by: § A set of states s in S § A set of states s in S § A set of actions a in A § A set of actions a in A § A transition function T(s, a, s’) § A transition function T(s, a, s’) § Probability that a from s leads to s’, i.e., P(s’| s, a) § Probability that a from s leads to s’, i.e., P(s’| s, a) § Also called the model or the dynamics § Also called the model or the dynamics § A reward function R(s, a, s’) § A reward function R(s, a, s’) § Sometimes just R(s) or R(s’) § Sometimes just R(s) or R(s’) § A start state … § Maybe a terminal state R(s 33 ) = -0.01 § MDPs are non-deterministic search problems R(s 42 ) = -1.01 § One way to solve them is with expectimax search § We’ll have a new tool soon R(s 43 ) = 0.99 What is Markov about MDPs? Policies § “Markov” generally means that given the present state, the § In deterministic single-agent search problems, future and the past are independent we wanted an optimal plan, or sequence of actions, from start to a goal § For Markov decision processes, “Markov” means action outcomes depend only on the current state § For MDPs, we want an optimal policy π*: S → A § A policy π gives an action for each state § An optimal policy is one that maximizes expected utility if followed Andrey Markov § An explicit policy defines a reflex agent (1856-1922) Optimal policy when R(s, a, s’) = -0.03 for all non-terminals s § Expectimax didn’t compute entire policies § This is just like search, where the successor function could only depend on the current state (not the history) § It computed the action for a single state only Optimal Policies Example: Racing R(s) = -0.01 R(s) = -0.03 R(s) = -0.4 R(s) = -2.0 2

  3. Example: Racing Racing Search Tree § A robot car wants to travel far, quickly § Three states: Cool, Warm, Overheated § Two actions: Slow , Fast 0.5 +1 § Going faster gets double reward 1.0 Fast Slow -10 +1 0.5 Warm Slow Fast 0.5 +2 Cool 0.5 Overheated +1 1.0 +2 MDP Search Trees Utilities of Sequences § Each MDP state projects an expectimax-like search tree s is a state s a (s, a) is a q- s, a state (s,a,s ’ ) called a transition T(s,a,s ’ ) = P(s ’ |s,a) s,a,s ’ R(s,a,s ’ ) s ’ Utilities of Sequences Discounting § It’s reasonable to maximize the sum of rewards § What preferences should an agent have over reward sequences? § It’s also reasonable to prefer rewards now to rewards later § One solution: values of rewards decay exponentially § More or less? [1, 2, 2] or [2, 3, 4] [0, 0, 1] or [1, 0, 0] § Now or later? Worth Now Worth Next Step Worth In Two Steps 3

  4. Discounting Stationary Preferences § Theorem: if we assume stationary preferences: § How to discount? § Each time we descend a level, we multiply in the discount once § Why discount? § Sooner rewards probably do have higher utility than later rewards § Also helps our algorithms converge § Then: there are only two ways to define utilities § Example: discount of 0.5 § Additive utility: § U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3 § U([1,2,3]) < U([3,2,1]) § Discounted utility: Quiz: Discounting Infinite Utilities?! § Problem: What if the game lasts forever? Do we get infinite rewards? 10* γ 3 = 1* γ § Given: γ 2 = 1 § Solutions: § Actions: East, West, and Exit (only available in exit states a, e) 10 § Finite horizon: (similar to depth-limited search) § Transitions: deterministic § Terminate episodes after a fixed T steps (e.g. life) § Gives nonstationary policies (γ depends on time left) § Quiz 1: For γ = 1, what is the optimal policy? § Discounting: use 0 < γ < 1 § Quiz 2: For γ = 0.1, what is the optimal policy? § Smaller γ means smaller “horizon” – shorter term focus § Quiz 3: For which γ are West and East equally good when in state d? § Absorbing state: guarantee that for every policy, a terminal state will eventually be reached (like “overheated” for racing) Recap: Defining MDPs Solving MDPs § Markov decision processes: § Value Iteration s § Set of states S a § Policy Iteration § Start state s 0 § Set of actions A s, a § Transitions P(s’|s,a) (or T(s,a,s’)) § Reinforcement Learning s,a,s ’ § Rewards R(s,a,s’) (and discount γ) s ’ § MDP quantities so far: § Policy = Choice of action for each state § Utility = sum of (discounted) rewards 4

  5. Optimal Quantities Snapshot of Demo – Gridworld V Values § The value (utility) of a state s: V * (s) = expected utility starting in s and s is a s state acting optimally a (s, a) is a § The value (utility) of a q-state (s,a): s, a q-state Q * (s,a) = expected utility starting out s,a,s’ having taken action a from state s and (s,a,s’) is a transition (thereafter) acting optimally s’ § The optimal policy: π * (s) = optimal action from state s Noise = 0.2 Discount = 0.9 Living reward = 0 Snapshot of Demo – Gridworld Q Values Values of States § Fundamental operation: compute the (expectimax) value of a state § Expected utility under optimal action s § Average sum of (discounted) rewards § This is just what expectimax computed! a s, a § Recursive definition of value: s,a,s ’ s ’ Noise = 0.2 Discount = 0.9 Living reward = 0 Racing Search Tree Racing Search Tree § We’re doing way too much work with expectimax! § Problem: States are repeated § Idea: Only compute needed quantities once § Problem: Tree goes on forever § Idea: Do a depth-limited computation, but with increasing depths until change is small § Note: deep parts of the tree eventually don’t matter if γ < 1 5

  6. Time-Limited Values Computing Time-Limited Values § Key idea: time-limited values § Define V k (s) to be the optimal value of s if the game ends in k more time steps § Equivalently, it’s what a depth-k expectimax would give from s Value Iteration The Bellman Equations How to be optimal: Step 1: Take correct first action Step 2: Keep being optimal The Bellman Equations Value Iteration § Definition of “optimal utility” via expectimax s § Bellman equations characterize the optimal values: V(s) recurrence gives a simple one-step lookahead a relationship amongst optimal utility values a s, a s, a s,a,s ’ s,a,s ’ § Value iteration computes them: V(s’) s ’ § These are the Bellman equations, and they characterize § Value iteration is just a fixed point solution method optimal values in a way we’ll use over and over § … though the V k vectors are also interpretable as time-limited values 6

  7. Value Iteration Algorithm k=0 § Start with V 0 (s) = 0: § Given vector of V k (s) values, do one ply of expectimax from each state: V k+1 (s) a s, a s,a,s ’ § Repeat until convergence V ( s’ ) k § Complexity of each iteration: O(S 2 A) § Number of iterations: poly(|S|, |A|, 1/(1-γ)) Noise = 0.2 § Theorem: will converge to unique optimal values Discount = 0.9 Living reward = 0 k=1 k=2 Noise = 0.2 Noise = 0.2 Discount = 0.9 Discount = 0.9 Living reward = 0 Living reward = 0 k=3 k=4 Noise = 0.2 Noise = 0.2 Discount = 0.9 Discount = 0.9 Living reward = 0 Living reward = 0 7

  8. k=5 k=6 Noise = 0.2 Noise = 0.2 Discount = 0.9 Discount = 0.9 Living reward = 0 Living reward = 0 k=7 k=8 Noise = 0.2 Noise = 0.2 Discount = 0.9 Discount = 0.9 Living reward = 0 Living reward = 0 k=9 k=10 Noise = 0.2 Noise = 0.2 Discount = 0.9 Discount = 0.9 Living reward = 0 Living reward = 0 8

Recommend


More recommend