Announcements CS 4100: Artificial Intelligence Markov Decision - PowerPoint PPT Presentation

Announcements CS 4100: Artificial Intelligence Markov Decision Processes II • Homework k 4: MDPs s (lead TA: Iris) • Due Mon 7 Oct at 11:59pm • Pr Project 2 t 2: Mu Multi-Ag Agent Search (lead TA: Zhaoqing) • Due Thu 10 Oct at 11:59pm • Offi Office H Hours • Iris: s: Mon 10.00am-noon, RI 237 • JW JW: Tue 1.40pm-2.40pm, DG 111 • Zh Zhaoqi qing: : Thu 9.00am-11.00am, HS 202 • El Eli: Fri 10.00am-noon, RY 207 Jan-Willem van de Meent, Northeastern University [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.] Example: Grid World Recap: MDPs • Marko kov v decisi sion processe sses: s: A A maze-like ke problem • s • The agent lives in a grid • Set of st states S Walls block the agent’s path • a • Start st state s 0 No Nois isy movement: act actions s do o not ot al always ays go as as plan anned ed • • Se s, a Set of actions A • 80% of the time, the action North takes the agent North • Transi sitions P( P(s’ s’|s, s,a) (or T( T(s, s,a,s’) ’) ) (if there is no wall there) s,a,s’ s,a,s’) (and discount g ) • Re • 10% of the time, North takes the agent West; 10% East Rewards R( R(s, s’ If there is a wall in the direction the agent would have • been taken, the agent stays put • MDP quantities s so so far: • The The age gent nt receives s rewards s each h time st step • Small “living” reward each step (can be negative) • Po Policy = Choice of action for each state • Big rewards come at the end (good or bad) • Ut Utilit ility = sum of (discounted) rewards Go Goal: l: maxim imiz ize sum of rewa wards •

Optimal Quantities Gridworld V*(s) s) values Th The value (uti utility ty) ) of f a st state s • V * (s (s) = expected utility starting in s s • and acting opt optima mally s is a s state a Th The value (uti utility ty) ) of f a q-st state (s, s,a) • (s, a) is a s, a Q * (s, s,a) = expected utility starting out q-state • having taken action a from state s s s,a,s’ (s,a,s’) is a and (thereafter) acting optimally transition s’ Th The opt optima mal pol policy • p * (s) s) = optimal action from state s • [Demo – gridworld values (L8D4)] Gridworld Q*(s, s, a) values The Bellman Equations Ho How to be optima mal: St Step 1: 1: Take correct first action Step 2: Keep being optimal St

The Bellman Equations Val Value I e Iter erat ation on • De Defin init itio ion of “optimal utility” via ex expect ectimax ax recurrence • Be Bellm llman equatio ions ch char aract acter erize e the the opti timal va values: V(s) gives a simple on one-st step looka kahead re relation onship p s amongst optimal utility values a a s, a s, a s,a,s’ • Va Value iteration co computes es up update tes: V(s’) s,a,s’ s’ • These are the Be Bellm llman equatio ions , and they characterize • Va Value iteration is s just st a fixe xed point so solution method opt optima mal values in a way we’ll use over and over • … though V k vectors are also interpretable as time-limited values Convergence* Policy Methods • How do we kn know the V k vect vectors ar are e going to co conver verge ge? Case 1: If the tree has ma m depth M , • Ca maximu mum then V M holds the actual unt untrunc uncated value ues • Ca Case 2: If the di discount is less than 1 • Ske ketch: For any state V k and V k+ k+1 can be viewed as depth k+1 expectimax results in nearly identical search trees k+ The dif difference is that on the bo botto ttom la laye yer , • k+1 has actual rewards while V k has zeros V k+ That last layer is at at bes est all R ma • max • It is at at wo worst R mi min ted by γ k that far out • But everything is dis discounte k+1 are at most γ k (R ma • So V k and V k+ max - R mi min ) different So as k increases, the valu values es co conver verge •

Fixed Policies Policy Evaluation Do what p sa Do Do the optim imal l actio ion Do says t s to d do s s a p (s) s, a s, p (s) s,a,s’ s, p (s),s’ s’ s’ • Ex Expecti tima max: compute ma max ov over all acti tion ons to compute the op opti timal values cy p (s) , then the tree would be simpler – on • For fi fixed ed pol olicy only on one acti tion on per sta tate te • … though the tree’s val value e would depend on wh whic ich polic licy we use Utilities for a Fixed Policy Example: Policy Evaluation Always Go Right Always Go Forward Another basic operation: compute the utility of a state s • An s under a fi fixed ed (g (generally non-op opti timal) pol olicy p (s) • De Defin ine the ut utility of a state s , under a fi fixed ed pol olicy cy p : s, p (s) V p (s (s) = expected total discounted rewards starting in s and following p s, p (s),s’ s’ • Re Recur ursive relation (one-step look-ahead / Bellman equation):

Example: Policy Evaluation Policy Evaluation Always Go Right Always Go Forward • Ho How w do do we we calcul ulate the he V’ V’s for for a a fi fixed ed pol olicy cy p ? s • Id Idea ea 1: Turn recursive Bellman equations into updates p (s) (like value iteration) s, p (s) s, p (s),s’ s’ • Ef Efficiency: y: O(S O(S 2 ) per iteration • Id Idea ea 2: Wi Witho hout ut the he maxes , the Bellman equations are ju just a lin linear r system • Solve with Matlab (or your favorite linear system solver) Policy Extraction Computing Actions from Values • Let’s s imagine we have ve the optimal va values s V*(s) s) • How sh should we act? • It’s s not obvi vious! s! • We We need to do a mi mini ni-exp xpectimax (one st step) • This is called policy y ext xtraction , since it finds the policy implied by the values

Computing Actions from Q-Values Policy Iteration • Let’s s imagine we have ve the optimal q-va values: • How sh should we act? • Completely trivi vial to decide! • Important lesso sson: actions are easier to select from q-va values than va values ! Problems with Value Iteration k=0 • Va Value iteration repeats the Be Bellm llman updates : s a s, a s,a,s’ • Pr Problem 1: It’s slow – O(S O(S 2 A) A) per iteration s’ • Pr Problem 2: The “m “max” at each state ra rare rely change ges • Pr Problem 3: The pol policy often converges long be before ore the values Noise = 0.2 Discount = 0.9 [Demo: value iteration (L9D2)] Living reward = 0

k=1 k=2 Noise = 0.2 Noise = 0.2 Discount = 0.9 Discount = 0.9 Living reward = 0 Living reward = 0 k=3 k=4 Noise = 0.2 Noise = 0.2 Discount = 0.9 Discount = 0.9 Living reward = 0 Living reward = 0

k=100 Policy Iteration • Alternative ve approach for optimal va values: s: • Step 1: Policy y eva valuation: calculate ut utilities es for some fixe xed policy y (not optimal utilities!) until convergence • Step 2: Policy y improve vement: update po policy using one-step look-ahead with converged (but not optimal!) utilities s as future values • Re Repeat steps until policy conve verges • This s is s policy y iteration • It’s still optimal! • Can converge (much) faster under some conditions Noise = 0.2 Discount = 0.9 Living reward = 0 Policy Iteration Value Iteration vs Policy Iteration • Both va value iteration and policy y iteration compute the same thing ( all optimal va values ) values V p (with policy evaluation): • Eva xed current policy p , find va valuation: For fixe • Ite Iterate te until values conve verge : • In va value iteration: • Eve very y iteration updates both the va values s and (implicitly) the po policy • We don’t extract the po policy , but taking the max over actions implicitly y (re)computes s it • In policy y iteration: • Improve vement: For fixe xed va values , get a better po policy (using policy extraction) • We do several passes that update utilities s with fixe xed policy y • On One-st step look-ahead: (each pass is fast st because we consider only y one action , not all of them) • After the po policy is evaluated, we update the policy y ( sl slow like a value iteration pass) • The new policy y will be be better r (or we’re done) • Both are dyn ynamic programs s for so solvi ving MDPs

Announcements CS 4100: Artificial Intelligence Markov Decision - PowerPoint PPT Presentation

Announcements CS 4100: Artificial Intelligence Markov Decision Processes II Homework k 4: MDPs s (lead TA: Iris) Due Mon 7 Oct at 11:59pm Pr Project 2 t 2: Mu Multi-Ag Agent Search (lead TA: Zhaoqing) Due Thu 10 Oct at 11:59pm

Artificial Intelligence Artificial Intelligence Artificial Intelligence Study and design of

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

Artificial intelligence Artificial Intelligence is the science of PHILOSOPHY OF ARTIFICIAL

Artificial Intelligence Intro (Chapter 1 of AIMA) Summary Artificial Intelligence What is AI?

Announcements CS 4100: Artificial Intelligence Uncertainty and Utilities Homework k 3: Game

Announcements CS 4100: Artificial Intelligence Markov Decision Processes II Homework k 4:

Announcements CS 4100: Artificial Intelligence Markov Decision Processes Homework k 3: Game

Announcements CS 4100: Artificial Intelligence Homework k 1: Search (lead TA: Iris) Informed

What is Artificial Intelligence? CPSC 322 Lecture 1 September 5, 2007 What is Artificial

Traditional Definition of Artificial Intelligence Trends Artificial Intelligence (AI) is

CS 4100: Artificial Intelligence Informed Search Instructor: Jan-Willem van de Meent [Adapted

CS 4100: Artificial Intelligence Informed Search Instructor: Jan-Willem van de Meent [Adapted

Artificial Intelligence as Law Bart Verheij Department of Artificial Intelligence, Bernoulli

CSCI 446 ARTIFICIAL INTELLIGENCE EXAM 1 STUDY OUTLINE Introduction to Artificial Intelligence

Lecture Overview What is Artificial Intelligence? Agents acting in an environment

sPHENIX computing sPHENIX timeline PD 2/3 1 st sPHENIX workfest, 2011 in Boulder Computing

Sampling Algorithms for Data Sampling Algorithms for Data Collection in Online Networks

Scripting in Virtual Worlds with Remote Data Behram Mistree Virtual World Scripting Scripts

Are PWAs ready to Are PWAs ready to take over the world? take over the world? Implementing main

Welcome everyone, thanks for coming (reading), were going to get started. Please turn off

Being Church in a Digital Age: The Continued Case for Prioritizing Digital Ministry Ryan Panzer

Chapter 3 Solving Problems By Searching 3.1 3.4 Uninformed search strategies CS5811 - Advanced

Foundations of AI 3. Solving Problems by Searching Problem-Solving Agents, Formulating

Sambuz

Useful Links

Newsletter

Mail Us