Announcements Introduction to Artificial Intelligence • Assignment 1 graded V22.0472-001 Fall 2009 • Come and see me after class if you have Lecture 9: Markov Decision Processes Lecture 9: Markov Decision Processes questions questions Rob Fergus – Dept of Computer Science, Courant Institute, NYU Many slides from Dan Klein, Stuart Russell or Andrew Moore 2 Reinforcement Learning Grid World � • Basic idea: The agent lives in a grid � Walls block the agent’s path • Receive feedback in the form of rewards � • Agent’s utility is defined by the reward function The agent’s actions do not always go as planned: • Must learn to act so as to maximize expected rewards � 80% of the time, the action North takes the agent North takes the agent North (if there is no wall there) � 10% of the time, North takes the agent West; 10% East � If there is a wall in the direction the agent would have been taken, the agent stays put � Small “living” reward each step � Big rewards come at the end � Goal: maximize sum of rewards* Markov Decision Processes What is Markov about MDPs? • An MDP is defined by: • Andrey Markov (1856-1922) • A set of states s ∈ S • A set of actions a ∈ A • “Markov” generally means that given the • A transition function T(s,a,s’) • Prob that a from s leads to s’ present state, the future and the past are • i.e., P(s’ | s,a) independent • Also called the model Also called the model • A reward function R(s, a, s’) • For Markov decision processes, “Markov” • Sometimes just R(s) or R(s’) • A start state (or distribution) means: • Maybe a terminal state • MDPs are a family of non- deterministic search problems • Reinforcement learning: MDPs where we don’t know the transition First order Markov or reward functions 5 1
Solving MDPs Example Optimal Policies • In deterministic single-agent search problems, want an optimal plan, or sequence of actions, from start to a goal • In an MDP, we want an optimal policy π *: S → A • A policy π gives an action for each state • An optimal policy maximizes expected utility if followed • Defines a reflex agent R(s) = -0.01 R(s) = -0.03 Optimal policy when R(s, a, s’) = -0.03 for all non-terminals s 8 R(s) = -0.4 R(s) = -2.0 Example: High-Low High-Low as an MDP • States: 2, 3, 4, done • Three card types: 2, 3, 4 • Actions: High, Low • Infinite deck, twice as many 2’s • Model: T(s, a, s’): • Start with 3 showing • P(s’=4 | 4, Low) = 1/4 • After each card, you say “high” or 3 • P(s’=3 | 4, Low) = 1/4 3 “low” • P(s’=2 | 4, Low) = 1/2 • N • New card is flipped d i fli d • P(s’=done | 4, Low) = 0 • If you’re right, you win the points • P(s’=4 | 4, High) = 1/4 shown on the new card • P(s’=3 | 4, High) = 0 • Ties are no-ops • P(s’=2 | 4, High) = 0 • P(s’=done | 4, High) = 3/4 • If you’re wrong, game ends • … • Rewards: R(s, a, s’): • Differences from expectimax: • Number shown on s’ if s ≠ s’ • #1: get rewards as you go • 0 otherwise • #2: you might play forever! • Start: 3 9 Example: High-Low MDP Search Trees • Each MDP state gives an expectimax-like search tree High Low s is a state s a , Low Low , High High (s, a) is a s, a q-state (s,a,s’) called a transition T = 0.5, T = 0.25, T = 0, T = 0.25, R = 3 R = 4 R = 0 R = 2 T(s,a,s’) = P(s’|s,a) s,a,s’ R(s,a,s’) s’ High Low Low High Low High 11 12 2
Utilities of Sequences Infinite Utilities?! • In order to formalize optimality of a policy, need to • Problem: infinite state sequences have infinite rewards understand utilities of sequences of rewards • Typically consider stationary preferences: • Solutions: • Finite horizon: • Terminate episodes after a fixed T steps (e.g. life) • Gives nonstationary policies ( π depends on time left) • Absorbing state: guarantee that for every policy, a terminal state will eventually be reached (like “done” for High-Low) • Theorem: only two ways to define stationary utilities • Discounting: for 0 < γ < 1 • Additive utility: • Discounted utility: • Smaller γ means smaller “horizon” – shorter term focus 13 14 Discounting Recap: Defining MDPs • Markov decision processes: • Typically discount s • States S rewards by γ < 1 • Start state s 0 a each time step • Actions A s, a • Transitions P(s’|s,a) (or T(s,a,s’)) • Sooner rewards have Sooner rewards have s,a,s’ , , • Rewards R(s,a,s’) (and discount γ ) s’ higher utility than later rewards • MDP quantities so far: • Also helps the • Policy = Choice of action for each state algorithms converge • Utility (or return) = sum of discounted rewards 15 16 Optimal Utilities The Bellman Equations • Fundamental operation: compute the • Definition of “optimal utility” leads to a values (optimal expectimax utilities) simple one-step lookahead relationship of states s s s amongst optimal utility values: a a • Why? Optimal values define optimal policies! s, a s, a Optimal rewards = maximize over first action and then follow optimal policy and then follow optimal policy • • Define the value of a state s: Define the al e of a state s: s,a,s’ ’ s,a,s’ V * (s) = expected utility starting in s and s’ s’ acting optimally • Formally: • Define the value of a q-state (s,a): Q * (s,a) = expected utility starting in s, taking action a and thereafter acting optimally • Define the optimal policy: π * (s) = optimal action from state s 17 18 3
Solving MDPs Why Not Search Trees? • We want to find the optimal policy π * • Why not solve with expectimax? • Problems: • Proposal 1: modified expectimax search, starting from each • This tree is usually infinite state s: • Same states appear over and over • We would search once per state We would search once per state • Idea: Value iteration s • Compute optimal values for all states all at a once using successive approximations s, a • Will be a bottom-up dynamic program similar in cost to memoization s,a,s’ • Do all planning offline, no replanning needed! s’ 19 20 Value Estimates Memoized Recursion? • Calculate estimates V k * (s) • Recurrences: • Not the optimal value of s! • The optimal value considering only next k time steps (k rewards) • As k → ∞ , it approaches the optimal value optimal value • Why: • If discounting, distant rewards become negligible • If terminal states reachable from everywhere, fraction of episodes not ending becomes negligible • Otherwise, can get infinite expected utility and then this approach actually • Cache all function call results so you never repeat work won’t work • What happened to the evaluation function? 21 22 Value Iteration Value Iteration • Idea: • Problems with the recursive computation: * (s) = 0, which we know is right (why?) • Start with V 0 • Have to keep all the V k * (s) around all the time • Given V i * , calculate the values for all states for depth i+1: • Don’t know which depth π k (s) to ask for when planning • Solution: value iteration • Calculate values for all states, bottom-up • This is called a value update or Bellman update • Keep increasing k until convergence • Repeat until convergence • Theorem: will converge to unique optimal values • Basic idea: approximations get refined towards optimal values • Policy may converge long before values do 23 24 4
Example: γ =0.9, living reward=0, noise=0.2 Example: Bellman Updates Example: Value Iteration V 2 V 3 • Information propagates outward from terminal states and eventually all states have correct value estimates max happens for a=right, other 25 26 [DEMO] actions not shown Convergence* Practice: Computing Actions • Define the max-norm: • Which action should we chose from state s: • Given optimal values V? • Theorem: For any two approximations U and V • I.e. any distinct approximations must get closer to each other, so, in particular, any approximation must get closer to the true U and value • Given optimal q-values Q? iteration converges to a unique, stable, optimal solution • Theorem: • Lesson: actions are easier to select from Q’s! • I.e. once the change in our approximation is small, it must also be close to correct 27 28 Recap: MDPs Utilities for Fixed Policies • Markov decision processes: • Another basic operation: compute the s s utility of a state s under a fixed (general • States S non-optimal) policy π (s) a • Actions A s, π (s) s, a • Transitions P(s’|s,a) (or T(s,a,s’)) • Define the utility of a state s, under a • Rewards R(s,a,s’) (and discount γ ) s, π (s),s’ s,a,s’ , , fixed policy π : fixed policy π : • Start state s 0 s’ s’ V π (s) = expected total discounted rewards (return) starting in s and following π • Quantities: • Recursive relation (one-step look-ahead • Returns = sum of discounted rewards / Bellman equation): • Values = expected future returns from a state (optimal, or for a fixed policy) • Q-Values = expected future returns from a q-state (optimal, or for a fixed policy) 29 30 5
Recommend
More recommend