Announcements CS 4100: Artificial Intelligence Markov Decision Processes II • Homework k 4: MDPs s (lead TA: Iris) • Due Mon 7 Oct at 11:59pm • Pr Project 2 t 2: Mu Multi-Ag Agent Search (lead TA: Zhaoqing) • Due Thu 10 Oct at 11:59pm • Offi Office H Hours • Iris: s: Mon 10.00am-noon, RI 237 • JW JW: Tue 1.40pm-2.40pm, DG 111 • Zh Zhaoqi qing: : Thu 9.00am-11.00am, HS 202 • El Eli: Fri 10.00am-noon, RY 207 Jan-Willem van de Meent, Northeastern University [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.] Example: Grid World Recap: MDPs • Marko kov v decisi sion processe sses: s: A maze-like A ke problem • s • The agent lives in a grid • Set of st states S Walls block the agent’s path a • • Start st state s 0 • Nois No isy movement: act actions s do o not ot al always ays go as as plan anned ed • Se Set of actions A s, a • 80% of the time, the action North takes the agent North • Transi sitions P( P(s’ s’|s, s,a) (or T( T(s, s,a,s’) ’) ) (if there is no wall there) s,a,s’ s,a,s’) (and discount g ) • Re • 10% of the time, North takes the agent West; 10% East Rewards R( R(s, s’ If there is a wall in the direction the agent would have • been taken, the agent stays put • MDP quantities s so so far: • The The age gent nt receives s rewards s each h time st step Small “living” reward each step (can be negative) • • Po Policy = Choice of action for each state • Big rewards come at the end (good or bad) • Ut Utilit ility = sum of (discounted) rewards Goal: Go l: maxim imiz ize sum of rewa wards • Optimal Quantities Gridworld V*(s) s) values Th The value (uti utility ty) ) of f a st state s • • V * (s (s) = expected utility starting in s s and acting opt optima mally s is a s state a • Th The value (uti utility ty) ) of f a q-st state (s, s,a) (s, a) is a s, a q-state • Q * (s, s,a) = expected utility starting out having taken action a from state s s s,a,s’ (s,a,s’) is a and (thereafter) acting optimally transition s’ The opt Th optima mal pol policy • p * (s) s) = optimal action from state s • [Demo – gridworld values (L8D4)] Gridworld Q*(s, s, a) values The Bellman Equations Ho How to be optima mal: 1: Take correct first action St Step 1: St Step 2: Keep being optimal
The Bellman Equations Val Value I e Iter erat ation on ion of “optimal utility” via ex ax recurrence • Be Bellm llman equatio ions ch char aract acter erize e the the opti timal va values: • De Defin init itio expect ectimax V(s) gives a simple on one-st step looka kahead re relation onship p s amongst optimal utility values a a s, a s, a s,a,s’ • Va Value iteration co computes es up update tes: V(s’) s,a,s’ s’ • These are the Be Bellm llman equatio ions , and they characterize • Va Value iteration is s just st a fixe xed point so solution method opt optima mal values in a way we’ll use over and over • … though V k vectors are also interpretable as time-limited values Convergence* Policy Methods • How do we kn know the V k vect vectors ar are e going to co conver verge ge? Ca Case 1: If the tree has ma maximu mum m depth M , • then V M holds the actual unt untrunc uncated value ues Ca Case 2: If the di discount is less than 1 • • Ske ketch: For any state V k and V k+ k+1 can be viewed as depth k+1 expectimax results in nearly identical search trees k+ • The dif difference is that on the bo botto ttom la laye yer , V k+ k+1 has actual rewards while V k has zeros • That last layer is at at bes est all R ma max It is at at wo worst R mi • min ted by γ k that far out • But everything is dis discounte k+1 are at most γ k (R ma • So V k and V k+ max - R mi min ) different • So as k increases, the valu values es co conver verge Fixed Policies Policy Evaluation Do the optim Do imal l actio ion Do what p sa Do says t s to d do s s a p (s) s, a s, p (s) s,a,s’ s, p (s),s’ s’ s’ • Ex Expecti tima max: compute ma max ov over all acti tion ons to compute the op opti timal values • For fi fixed ed pol olicy cy p (s) , then the tree would be simpler – on only on one acti tion on per sta tate te • … though the tree’s val value e would depend on wh whic ich polic licy we use Utilities for a Fixed Policy Example: Policy Evaluation Always Go Right Always Go Forward • An Another basic operation: compute the utility of a state s s under a fi fixed ed (g (generally non-op opti timal) pol olicy p (s) • De Defin ine the ut utility of a state s , under a fi fixed ed pol olicy cy p : s, p (s) V p (s (s) = expected total discounted rewards starting in s and following p s, p (s),s’ s’ • Re Recur ursive relation (one-step look-ahead / Bellman equation):
Example: Policy Evaluation Policy Evaluation Always Go Right Always Go Forward • Ho How w do do we we calcul ulate the he V’ V’s for for a a fi fixed ed pol olicy cy p ? s • Id Idea ea 1: Turn recursive Bellman equations into updates p (s) (like value iteration) s, p (s) s, p (s),s’ s’ • Ef Efficiency: y: O(S O(S 2 ) per iteration • Id Idea ea 2: Wi Witho hout ut the he maxes , the Bellman equations are ju just a lin linear r system • Solve with Matlab (or your favorite linear system solver) Policy Extraction Computing Actions from Values • Let’s s imagine we have ve the optimal va values s V*(s) s) • How sh should we act? • It’s s not obvi vious! s! • We We need to do a mi mini ni-exp xpectimax (one st step) • This is called policy y ext xtraction , since it finds the policy implied by the values Computing Actions from Q-Values Policy Iteration • Let’s s imagine we have ve the optimal q-va values: • How sh should we act? • Completely trivi vial to decide! • Important lesso sson: actions are easier to select from q-va values than va values ! Problems with Value Iteration k=0 Value iteration repeats the Be llman updates : • Va Bellm s a s, a s,a,s’ • Pr Problem 1: It’s slow – O(S O(S 2 A) A) per iteration s’ • Pr Problem 2: The “m “max” at each state ra rare rely change ges • Pr Problem 3: The pol policy often converges long be before ore the values Noise = 0.2 Discount = 0.9 [Demo: value iteration (L9D2)] Living reward = 0
k=1 k=2 Noise = 0.2 Noise = 0.2 Discount = 0.9 Discount = 0.9 Living reward = 0 Living reward = 0 k=3 k=4 Noise = 0.2 Noise = 0.2 Discount = 0.9 Discount = 0.9 Living reward = 0 Living reward = 0 k=5 k=6 Noise = 0.2 Noise = 0.2 Discount = 0.9 Discount = 0.9 Living reward = 0 Living reward = 0 k=7 k=8 Noise = 0.2 Noise = 0.2 Discount = 0.9 Discount = 0.9 Living reward = 0 Living reward = 0
k=9 k=10 Noise = 0.2 Noise = 0.2 Discount = 0.9 Discount = 0.9 Living reward = 0 Living reward = 0 k=11 k=12 Noise = 0.2 Noise = 0.2 Discount = 0.9 Discount = 0.9 Living reward = 0 Living reward = 0 k=100 Policy Iteration • Alternative ve approach for optimal va values: s: • Step 1: Policy y eva valuation: calculate ut utilities es for some fixe xed policy y (not optimal utilities!) until convergence • Step 2: Policy y improve vement: update po policy using one-step look-ahead with converged (but not optimal!) utilities s as future values • Re Repeat steps until policy conve verges • This s is s policy y iteration • It’s still optimal! • Can converge (much) faster under some conditions Noise = 0.2 Discount = 0.9 Living reward = 0 Policy Iteration Value Iteration vs Policy Iteration • Both va value iteration and policy y iteration compute the same thing ( all optimal va values ) values V p (with policy evaluation): • Eva xed current policy p , find va valuation: For fixe • Ite Iterate te until values conve verge : • In va value iteration: • Eve very y iteration updates both the va values s and (implicitly) the po policy • We don’t extract the po policy , but taking the max over actions implicitly y (re)computes s it • In policy y iteration: • Improve vement: For fixe xed va values , get a better po policy (using policy extraction) • We do several passes that update utilities s with fixe xed policy y • On One-st step look-ahead: (each pass is fast st because we consider only y one action , not all of them) • After the po policy is evaluated, we update the policy y ( sl slow like a value iteration pass) • The new policy y will be be better r (or we’re done) • Both are dyn ynamic programs s for so solvi ving MDPs
Recommend
More recommend