Announcements CS 4100: Artificial Intelligence Markov Decision - PDF document

Announcements CS 4100: Artificial Intelligence Markov Decision Processes II • Homework k 4: MDPs s (lead TA: Iris) • Due Mon 7 Oct at 11:59pm • Pr Project 2 t 2: Mu Multi-Ag Agent Search (lead TA: Zhaoqing) • Due Thu 10 Oct at 11:59pm • Offi Office H Hours • Iris: s: Mon 10.00am-noon, RI 237 • JW JW: Tue 1.40pm-2.40pm, DG 111 • Zh Zhaoqi qing: : Thu 9.00am-11.00am, HS 202 • El Eli: Fri 10.00am-noon, RY 207 Jan-Willem van de Meent, Northeastern University [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.] Example: Grid World Recap: MDPs • Marko kov v decisi sion processe sses: s: A maze-like A ke problem • s • The agent lives in a grid • Set of st states S Walls block the agent’s path a • • Start st state s 0 • Nois No isy movement: act actions s do o not ot al always ays go as as plan anned ed • Se Set of actions A s, a • 80% of the time, the action North takes the agent North • Transi sitions P( P(s’ s’|s, s,a) (or T( T(s, s,a,s’) ’) ) (if there is no wall there) s,a,s’ s,a,s’) (and discount g ) • Re • 10% of the time, North takes the agent West; 10% East Rewards R( R(s, s’ If there is a wall in the direction the agent would have • been taken, the agent stays put • MDP quantities s so so far: • The The age gent nt receives s rewards s each h time st step Small “living” reward each step (can be negative) • • Po Policy = Choice of action for each state • Big rewards come at the end (good or bad) • Ut Utilit ility = sum of (discounted) rewards Goal: Go l: maxim imiz ize sum of rewa wards • Optimal Quantities Gridworld V*(s) s) values Th The value (uti utility ty) ) of f a st state s • • V * (s (s) = expected utility starting in s s and acting opt optima mally s is a s state a • Th The value (uti utility ty) ) of f a q-st state (s, s,a) (s, a) is a s, a q-state • Q * (s, s,a) = expected utility starting out having taken action a from state s s s,a,s’ (s,a,s’) is a and (thereafter) acting optimally transition s’ The opt Th optima mal pol policy • p * (s) s) = optimal action from state s • [Demo – gridworld values (L8D4)] Gridworld Q*(s, s, a) values The Bellman Equations Ho How to be optima mal: 1: Take correct first action St Step 1: St Step 2: Keep being optimal

The Bellman Equations Val Value I e Iter erat ation on ion of “optimal utility” via ex ax recurrence • Be Bellm llman equatio ions ch char aract acter erize e the the opti timal va values: • De Defin init itio expect ectimax V(s) gives a simple on one-st step looka kahead re relation onship p s amongst optimal utility values a a s, a s, a s,a,s’ • Va Value iteration co computes es up update tes: V(s’) s,a,s’ s’ • These are the Be Bellm llman equatio ions , and they characterize • Va Value iteration is s just st a fixe xed point so solution method opt optima mal values in a way we’ll use over and over • … though V k vectors are also interpretable as time-limited values Convergence* Policy Methods • How do we kn know the V k vect vectors ar are e going to co conver verge ge? Ca Case 1: If the tree has ma maximu mum m depth M , • then V M holds the actual unt untrunc uncated value ues Ca Case 2: If the di discount is less than 1 • • Ske ketch: For any state V k and V k+ k+1 can be viewed as depth k+1 expectimax results in nearly identical search trees k+ • The dif difference is that on the bo botto ttom la laye yer , V k+ k+1 has actual rewards while V k has zeros • That last layer is at at bes est all R ma max It is at at wo worst R mi • min ted by γ k that far out • But everything is dis discounte k+1 are at most γ k (R ma • So V k and V k+ max - R mi min ) different • So as k increases, the valu values es co conver verge Fixed Policies Policy Evaluation Do the optim Do imal l actio ion Do what p sa Do says t s to d do s s a p (s) s, a s, p (s) s,a,s’ s, p (s),s’ s’ s’ • Ex Expecti tima max: compute ma max ov over all acti tion ons to compute the op opti timal values • For fi fixed ed pol olicy cy p (s) , then the tree would be simpler – on only on one acti tion on per sta tate te • … though the tree’s val value e would depend on wh whic ich polic licy we use Utilities for a Fixed Policy Example: Policy Evaluation Always Go Right Always Go Forward • An Another basic operation: compute the utility of a state s s under a fi fixed ed (g (generally non-op opti timal) pol olicy p (s) • De Defin ine the ut utility of a state s , under a fi fixed ed pol olicy cy p : s, p (s) V p (s (s) = expected total discounted rewards starting in s and following p s, p (s),s’ s’ • Re Recur ursive relation (one-step look-ahead / Bellman equation):

Example: Policy Evaluation Policy Evaluation Always Go Right Always Go Forward • Ho How w do do we we calcul ulate the he V’ V’s for for a a fi fixed ed pol olicy cy p ? s • Id Idea ea 1: Turn recursive Bellman equations into updates p (s) (like value iteration) s, p (s) s, p (s),s’ s’ • Ef Efficiency: y: O(S O(S 2 ) per iteration • Id Idea ea 2: Wi Witho hout ut the he maxes , the Bellman equations are ju just a lin linear r system • Solve with Matlab (or your favorite linear system solver) Policy Extraction Computing Actions from Values • Let’s s imagine we have ve the optimal va values s V*(s) s) • How sh should we act? • It’s s not obvi vious! s! • We We need to do a mi mini ni-exp xpectimax (one st step) • This is called policy y ext xtraction , since it finds the policy implied by the values Computing Actions from Q-Values Policy Iteration • Let’s s imagine we have ve the optimal q-va values: • How sh should we act? • Completely trivi vial to decide! • Important lesso sson: actions are easier to select from q-va values than va values ! Problems with Value Iteration k=0 Value iteration repeats the Be llman updates : • Va Bellm s a s, a s,a,s’ • Pr Problem 1: It’s slow – O(S O(S 2 A) A) per iteration s’ • Pr Problem 2: The “m “max” at each state ra rare rely change ges • Pr Problem 3: The pol policy often converges long be before ore the values Noise = 0.2 Discount = 0.9 [Demo: value iteration (L9D2)] Living reward = 0

k=1 k=2 Noise = 0.2 Noise = 0.2 Discount = 0.9 Discount = 0.9 Living reward = 0 Living reward = 0 k=3 k=4 Noise = 0.2 Noise = 0.2 Discount = 0.9 Discount = 0.9 Living reward = 0 Living reward = 0 k=5 k=6 Noise = 0.2 Noise = 0.2 Discount = 0.9 Discount = 0.9 Living reward = 0 Living reward = 0 k=7 k=8 Noise = 0.2 Noise = 0.2 Discount = 0.9 Discount = 0.9 Living reward = 0 Living reward = 0

k=9 k=10 Noise = 0.2 Noise = 0.2 Discount = 0.9 Discount = 0.9 Living reward = 0 Living reward = 0 k=11 k=12 Noise = 0.2 Noise = 0.2 Discount = 0.9 Discount = 0.9 Living reward = 0 Living reward = 0 k=100 Policy Iteration • Alternative ve approach for optimal va values: s: • Step 1: Policy y eva valuation: calculate ut utilities es for some fixe xed policy y (not optimal utilities!) until convergence • Step 2: Policy y improve vement: update po policy using one-step look-ahead with converged (but not optimal!) utilities s as future values • Re Repeat steps until policy conve verges • This s is s policy y iteration • It’s still optimal! • Can converge (much) faster under some conditions Noise = 0.2 Discount = 0.9 Living reward = 0 Policy Iteration Value Iteration vs Policy Iteration • Both va value iteration and policy y iteration compute the same thing ( all optimal va values ) values V p (with policy evaluation): • Eva xed current policy p , find va valuation: For fixe • Ite Iterate te until values conve verge : • In va value iteration: • Eve very y iteration updates both the va values s and (implicitly) the po policy • We don’t extract the po policy , but taking the max over actions implicitly y (re)computes s it • In policy y iteration: • Improve vement: For fixe xed va values , get a better po policy (using policy extraction) • We do several passes that update utilities s with fixe xed policy y • On One-st step look-ahead: (each pass is fast st because we consider only y one action , not all of them) • After the po policy is evaluated, we update the policy y ( sl slow like a value iteration pass) • The new policy y will be be better r (or we’re done) • Both are dyn ynamic programs s for so solvi ving MDPs

Announcements CS 4100: Artificial Intelligence Markov Decision - PDF document

Announcements CS 4100: Artificial Intelligence Markov Decision Processes II Homework k 4: MDPs s (lead TA: Iris) Due Mon 7 Oct at 11:59pm Pr Project 2 t 2: Mu Multi-Ag Agent Search (lead TA: Zhaoqing) Due Thu 10 Oct at 11:59pm

Artificial Intelligence Artificial Intelligence Artificial Intelligence Study and design of

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

Artificial intelligence Artificial Intelligence is the science of PHILOSOPHY OF ARTIFICIAL

Artificial Intelligence Intro (Chapter 1 of AIMA) Summary Artificial Intelligence What is AI?

Announcements CS 4100: Artificial Intelligence Uncertainty and Utilities Homework k 3: Game

Announcements CS 4100: Artificial Intelligence Markov Decision Processes II Homework k 4:

Announcements CS 4100: Artificial Intelligence Markov Decision Processes Homework k 3: Game

Announcements CS 4100: Artificial Intelligence Homework k 1: Search (lead TA: Iris) Informed

What is Artificial Intelligence? CPSC 322 Lecture 1 September 5, 2007 What is Artificial

Traditional Definition of Artificial Intelligence Trends Artificial Intelligence (AI) is

CS 4100: Artificial Intelligence Informed Search Instructor: Jan-Willem van de Meent [Adapted

CS 4100: Artificial Intelligence Informed Search Instructor: Jan-Willem van de Meent [Adapted

Artificial Intelligence as Law Bart Verheij Department of Artificial Intelligence, Bernoulli

CSCI 446 ARTIFICIAL INTELLIGENCE EXAM 1 STUDY OUTLINE Introduction to Artificial Intelligence

Lecture Overview What is Artificial Intelligence? Agents acting in an environment

From Development to Production: Many Uses of Serverless Observability EMRAH SAMDAN | SEPTEMBER

2010 Computing on Grids and Supercomputers Improving Many-Task Computing in Scientific Workflows

Privacy Issues in Cloud computing Zeeshan Ali Shah System administrator PhD researcher KTH PDC

Infrastructure Technologies for Large- Scale Service-Oriented Systems Kostas Magoutis

Oracles in TTCN-3 and UTP Ina Schieferdecker 2012, May 22nd, CREST Workshop, London Outline

Estimating the Specific Indirect Effect for Multiple Types of Correspondence Audit DISCUSSED BY:

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 18:

Get to know tec 1 7 Days left 29-30 January 2019 Chancellorcomplex, universiti teknologi