Announcements • Homework k 3: Game Trees s (lead TA: Zhaoqing) • Due Tue 1 Oct at 11:59pm (deadline extended) • Homework k 4: MDPs s (lead TA: Iris) • Due Mon 7 Oct at 11:59pm • Pr Project 2 t 2: Mu Multi-Ag Agent Search (lead TA: Zhaoqing) • Due Thu 10 Oct at 11:59pm • Offi Office H Hours • Iris: s: Mon 10.00am-noon, RI 237 • JW JW: Tue 1.40pm-2.40pm, DG 111 • Zh Zhaoqi qing: : Thu 9.00am-11.00am, HS 202 • El Eli: Fri 10.00am-noon, RY 207
CS 4100: Artificial Intelligence Markov Decision Processes Jan-Willem van de Meent Northeastern University [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]
Non-Deterministic Search
Example: Grid World A A maze-like ke problem • The agent lives in a grid • Walls block the agent’s path • No Nois isy movement: act actions s do o not ot al always ays go as as plan anned ed • 80% of the time, the action North takes the agent North • (if there is no wall there) 10% of the time, North takes the agent West; 10% East • If there is a wall in the direction the agent would have • been taken, the agent stays put The The age gent nt receives s rewards s each h time st step • Small “living” reward each step (can be negative) • Big rewards come at the end (good or bad) • Go Goal: l: maxim imiz ize sum of rewa wards •
Grid World Actions De Determ rmin inis istic ic Grid rid World rld St Stochastic Grid World
Markov Decision Processes • An MDP is s defined by s Î S • A se set of st states s a Î A • A se set of actions s a • A transi sition function T(s, s, a, s’) ’) • Probability that a a from s leads to s’ s’ , i.e., P(s P(s’| s, s, a) • Also called the model or the dynamics • A re reward rd function R(s, s, a, s’) ’) • Sometimes just R(s) s) or R( R(s’) ’) • A st start st state • Maybe a terminal st state • MDPs s are non-determinist stic se search problems • One way to solve them is with exp xpectimax search • We’ll have a new tool soon [Demo – gridworld manual intro (L8D1)]
What is Markov about MDPs? • “Marko kov” v” generally means that given the current st state , the future and the past st are independent • For Marko kov v decisi sion processe sses , “Markov” means action outcomes s depend only on the current st state Andrey Markov (1856-1922) • This is just like search, where the successor function could only depend on the current state (not the history)
Policies • In determinist stic si single-agent se search problems , we wanted an optimal pl plan , or sequence of actions, from start to a goal y p *: • *: S → A For MD MDPs , we want an optimal policy A policy p gives an acti • action on for each st state • An optimal policy is one that maxi ximize zes s exp xpected utility y • An exp xplicit policy defines a reflex x agent Optimal policy when R(s, a, s’) = -0.03 • xpectimax didn’t compute entire policies Exp for all non-terminals s • It computed the action for a single state only
Optimal Policies R(s) = -0.01 R(s) = -0.03 R(s) = -0.4 R(s) = -2.0
Example: Racing
Example: Racing A robot car wants to travel far, quickly • Three states: Cool, Warm, Overheated • Two actions: Slow , Fast • 0.5 +1 Going faster gets double reward • 1.0 Fast Slow -10 +1 0.5 Warm Slow Fast 0.5 +2 0.5 Cool Overheated +1 1.0 +2
Racing Search Tree
MDP Search Trees • Each MDP st state projects s an exp xpectimax-like ke se search tree s is a state s a a) is a (s, a) (s s, a q-st state s,a,s ’ ) is called a tr ( s, transitio ition T(s, T( s,a,s ’ ) ) = P(s P(s ’ |s, s,a) s,a,s ’ R( R(s, s,a,s ’ ) s ’
Utilities of Sequences
Utilities of Sequences • What preferences should an agent have over reward sequences? [1, 2, 2] or [2, 3, 4] • More or less? [0, 0, 1] or [1, 0, 0] • Now or later?
Discounting • It’s reasonable to maxi ze the su ximize sum of rewards • It’s also reasonable to pr prefer rewards s now to rewards s later • One so solution: values of rewards decay y exp xponentially Worth Now Worth Next Step Worth In Two Steps
Discounting • How to disc scount? • Each time we descend a level, we multiply in the discount once • Why y disc scount? • Sooner rewards probably do have higher utility than later rewards • Also helps our algorithms converge • Exa xample: disc scount of 0. 0.5 • U( U([1,2 ,2,3 ,3]) = = 1 1*1 + + 0 0.5 .5*2 + + 0 0.2 .25*3 • U( U([1,2 ,2,3 ,3]) < < U( U([3,2 ,2,1 ,1])
Stationary Preferences • Theorem: Theorem: if we assume st stationary y preferences • Then: Then: there are only two ways to define ut utilities es • Additive ve utility: y: • Disc scounted utility: y:
Exercise: Discounting • Give ven: • Actions: xit (only available in exit states a , e ) s: East st , West st , and Exi • Transi sitions: s: determinist stic z 1: For g = • Quiz = 1 , what is the optimal policy? z 2: For g = • Quiz = 0.1 , what is the optimal policy? z 3: For which g are West • Quiz st and East st equally good when in state d ?
<latexit sha1_base64="AGurigOo2orwZWQwqUQeS/5KGPs=">AGXnicfZTNbtQwEIDd0m5LoLSFCxKXFXvhsFqS3artBakquBYqv5J9VI5zmzWav7WdtpuLT8dT8GNa6/wAji7ATZxhCNFo5lvPGPePwsYkK67vel5Scrq6219afOs+cbLza3tl+eizTnFM5oGqX80icCIpbAmWQygsuMA4n9C78m4+F/eIWuGBpciqnGQxjEiZsxCiRnW9NcQhiWPydDGNEhl23PbH9pz3R+NgyeTnAQO5iwcS8J5evdXNQeNh5hwqbz3nquNzGKYtN3eoH+91XF7my1bcErhc4Bmq/j6+2VRxykNI8hkTQiQlx5biaHinDJaATawbmAjNAbEsKVERMSgxiq2T3odsV6g3VKE0kJLTipkgsYiLHlrKARVLxyYw8GrYUjlUxS4BCBYmVS8/1o6DAxiZmswyU4Ef5aDVyadDrdzu7qDr9fd0DeEQlIS373bNVwdCDpCUyP5O19vdt5ks51kE/yC3wIpsOCRwR1NTrSRQ+BaovjL3gyEROYfiIAr7sep4WmsLnqPGZ2Z38KLxXiuFKwniW3Wv69h0ASsOaqCpBT07fVgYZMmDMsxSNKQvWymSd7A5hab2xC3IF7PEBpjQiZYlCbWeUYL9KxPRnbQaIEpa1xsGZmHhBrx2zcjGdjZrEntcqc6KJdFgnCw5iYOuM0A05kyotHd8fkOGIxk0KVdm17seT/XsZeD3akq01Z/H1fHWmLpH40a8zq3dkdSnlQ5YpTNmAhr2LzwjWAWQ0sL3hGVuZARCTci2lcnQ7gZzxNR9oxw9Grj0JbO/3vEGv/2Wnc3BYjsl19Aa9Re+Qh/bQAfqMjtEZougbekQ/0a/VH61Wa6O1OUeXl0qfV6iyWq9/A/fVUqw=</latexit> Exercise: Discounting • Give ven: • Actions: xit (only available in exit states a , e ) s: East st , West st , and Exi • Transi sitions: s: determinist stic z 1: For g = • Quiz = 1 , what is the optimal policy? z 2: For g = • Quiz = 0.1 , what is the optimal policy? z 3: For which g are West • Quiz st and East st equally good when in state d ? ∆ γ 3 · 10 = γ · 1 1 / 10 ' 0.32 ! γ =
Infinite Utilities?! • Pr blem: What if the game lasts forever? Do we get infinite rewards? Probl • Solutions: s: • Finite horizo zon: (similar to depth-limited search) • Terminate episodes after a fixed T steps (e.g. life) stationary policies ( p depends on time left) • Gives nonst < g < • Disc scounting: use 0 0 < < 1 • Smaller g means smaller “horizo zon” – shorter term focus • Abso state: guarantee that for every policy, a terminal state sorbing st will eventually be reached (like “overheated” for racing)
Recap: Defining MDPs • Marko kov v decisi sion processe sses: s: s • Set of st states S a • Start st state s 0 • Se s, a Set of actions A • Transi sitions P( P(s’ s’|s, s,a) (or T( T(s, s,a,s’) ’) ) s,a,s ’ s,a,s’) (and discount g ) • Re Rewards R( R(s, s ’ • MDP quantities s so so far: • Po Policy = Choice of action for each state • Ut Utilit ility = sum of (discounted) rewards
Solving MDPs
Optimal Quantities Th The value (uti utility ty) ) of f a st state s • V * (s (s) = expected utility starting in s s • and acting opt optima mally s is a s state a Th The value (uti utility ty) ) of f a q-st state (s, s,a) • (s, a) is a s, a q-state Q * (s, s,a) = expected utility starting out • having taken action a from state s s s,a,s’ (s,a,s’) is a and (thereafter) acting optimally transition s’ The opt Th optima mal pol policy • p * (s (s) ) = optimal action from state s • [Demo – gridworld values (L8D4)]
Gridworld V values Noise = 0.2 Discount = 0.9 Living reward = 0
Gridworld Q Q values Noise = 0.2 Discount = 0.9 Living reward = 0
Values of States • Fund on: compute the exp Fundament amental al op operat eration: xpectimax va value of a state • Expected utility under optimal action s • Average sum of (discounted) rewards a • This is just what expectimax computed! s, a • Recursi sive ve definition of va value (Bellman Equations) s): s,a,s ’ s ’
Racing Search Tree
Racing Search Tree
Racing Search Tree • We’re doing way y too much work k with exp xpectimax! • Pr Probl blem: : States are repeated • Id Idea: Only compute needed quantities once • Pr blem: Tree goes on forever Probl • Id Idea: Do a depth-limited computation, but with increasing depths until change is small • No te: deep parts of the tree Note eventually don’t matter if γ < < 1
Time-Limited Values • Key y idea: time-limited values • De s) to be the optimal value of s if the Defin ine V k (s) game ends in k more time steps • Equivalently, it’s what a de xpectimax would give from s dept pth-k exp [Demo – time-limited values (L8D6)]
Recommend
More recommend