1 Solving MDPs Example Optimal Policies In deterministic - PDF document

Announcements Introduction to Artificial Intelligence • Assignment 1 graded V22.0472-001 Fall 2009 • Come and see me after class if you have Lecture 9: Markov Decision Processes Lecture 9: Markov Decision Processes questions questions Rob Fergus – Dept of Computer Science, Courant Institute, NYU Many slides from Dan Klein, Stuart Russell or Andrew Moore 2 Reinforcement Learning Grid World � • Basic idea: The agent lives in a grid � Walls block the agent’s path • Receive feedback in the form of rewards � • Agent’s utility is defined by the reward function The agent’s actions do not always go as planned: • Must learn to act so as to maximize expected rewards � 80% of the time, the action North takes the agent North takes the agent North (if there is no wall there) � 10% of the time, North takes the agent West; 10% East � If there is a wall in the direction the agent would have been taken, the agent stays put � Small “living” reward each step � Big rewards come at the end � Goal: maximize sum of rewards* Markov Decision Processes What is Markov about MDPs? • An MDP is defined by: • Andrey Markov (1856-1922) • A set of states s ∈ S • A set of actions a ∈ A • “Markov” generally means that given the • A transition function T(s,a,s’) • Prob that a from s leads to s’ present state, the future and the past are • i.e., P(s’ | s,a) independent • Also called the model Also called the model • A reward function R(s, a, s’) • For Markov decision processes, “Markov” • Sometimes just R(s) or R(s’) • A start state (or distribution) means: • Maybe a terminal state • MDPs are a family of non- deterministic search problems • Reinforcement learning: MDPs where we don’t know the transition First order Markov or reward functions 5 1

Solving MDPs Example Optimal Policies • In deterministic single-agent search problems, want an optimal plan, or sequence of actions, from start to a goal • In an MDP, we want an optimal policy π *: S → A • A policy π gives an action for each state • An optimal policy maximizes expected utility if followed • Defines a reflex agent R(s) = -0.01 R(s) = -0.03 Optimal policy when R(s, a, s’) = -0.03 for all non-terminals s 8 R(s) = -0.4 R(s) = -2.0 Example: High-Low High-Low as an MDP • States: 2, 3, 4, done • Three card types: 2, 3, 4 • Actions: High, Low • Infinite deck, twice as many 2’s • Model: T(s, a, s’): • Start with 3 showing • P(s’=4 | 4, Low) = 1/4 • After each card, you say “high” or 3 • P(s’=3 | 4, Low) = 1/4 3 “low” • P(s’=2 | 4, Low) = 1/2 • N • New card is flipped d i fli d • P(s’=done | 4, Low) = 0 • If you’re right, you win the points • P(s’=4 | 4, High) = 1/4 shown on the new card • P(s’=3 | 4, High) = 0 • Ties are no-ops • P(s’=2 | 4, High) = 0 • P(s’=done | 4, High) = 3/4 • If you’re wrong, game ends • … • Rewards: R(s, a, s’): • Differences from expectimax: • Number shown on s’ if s ≠ s’ • #1: get rewards as you go • 0 otherwise • #2: you might play forever! • Start: 3 9 Example: High-Low MDP Search Trees • Each MDP state gives an expectimax-like search tree High Low s is a state s a , Low Low , High High (s, a) is a s, a q-state (s,a,s’) called a transition T = 0.5, T = 0.25, T = 0, T = 0.25, R = 3 R = 4 R = 0 R = 2 T(s,a,s’) = P(s’|s,a) s,a,s’ R(s,a,s’) s’ High Low Low High Low High 11 12 2

Utilities of Sequences Infinite Utilities?! • In order to formalize optimality of a policy, need to • Problem: infinite state sequences have infinite rewards understand utilities of sequences of rewards • Typically consider stationary preferences: • Solutions: • Finite horizon: • Terminate episodes after a fixed T steps (e.g. life) • Gives nonstationary policies ( π depends on time left) • Absorbing state: guarantee that for every policy, a terminal state will eventually be reached (like “done” for High-Low) • Theorem: only two ways to define stationary utilities • Discounting: for 0 < γ < 1 • Additive utility: • Discounted utility: • Smaller γ means smaller “horizon” – shorter term focus 13 14 Discounting Recap: Defining MDPs • Markov decision processes: • Typically discount s • States S rewards by γ < 1 • Start state s 0 a each time step • Actions A s, a • Transitions P(s’|s,a) (or T(s,a,s’)) • Sooner rewards have Sooner rewards have s,a,s’ , , • Rewards R(s,a,s’) (and discount γ ) s’ higher utility than later rewards • MDP quantities so far: • Also helps the • Policy = Choice of action for each state algorithms converge • Utility (or return) = sum of discounted rewards 15 16 Optimal Utilities The Bellman Equations • Fundamental operation: compute the • Definition of “optimal utility” leads to a values (optimal expectimax utilities) simple one-step lookahead relationship of states s s s amongst optimal utility values: a a • Why? Optimal values define optimal policies! s, a s, a Optimal rewards = maximize over first action and then follow optimal policy and then follow optimal policy • • Define the value of a state s: Define the al e of a state s: s,a,s’ ’ s,a,s’ V * (s) = expected utility starting in s and s’ s’ acting optimally • Formally: • Define the value of a q-state (s,a): Q * (s,a) = expected utility starting in s, taking action a and thereafter acting optimally • Define the optimal policy: π * (s) = optimal action from state s 17 18 3

Solving MDPs Why Not Search Trees? • We want to find the optimal policy π * • Why not solve with expectimax? • Problems: • Proposal 1: modified expectimax search, starting from each • This tree is usually infinite state s: • Same states appear over and over • We would search once per state We would search once per state • Idea: Value iteration s • Compute optimal values for all states all at a once using successive approximations s, a • Will be a bottom-up dynamic program similar in cost to memoization s,a,s’ • Do all planning offline, no replanning needed! s’ 19 20 Value Estimates Memoized Recursion? • Calculate estimates V k * (s) • Recurrences: • Not the optimal value of s! • The optimal value considering only next k time steps (k rewards) • As k → ∞ , it approaches the optimal value optimal value • Why: • If discounting, distant rewards become negligible • If terminal states reachable from everywhere, fraction of episodes not ending becomes negligible • Otherwise, can get infinite expected utility and then this approach actually • Cache all function call results so you never repeat work won’t work • What happened to the evaluation function? 21 22 Value Iteration Value Iteration • Idea: • Problems with the recursive computation: * (s) = 0, which we know is right (why?) • Start with V 0 • Have to keep all the V k * (s) around all the time • Given V i * , calculate the values for all states for depth i+1: • Don’t know which depth π k (s) to ask for when planning • Solution: value iteration • Calculate values for all states, bottom-up • This is called a value update or Bellman update • Keep increasing k until convergence • Repeat until convergence • Theorem: will converge to unique optimal values • Basic idea: approximations get refined towards optimal values • Policy may converge long before values do 23 24 4

Example: γ =0.9, living reward=0, noise=0.2 Example: Bellman Updates Example: Value Iteration V 2 V 3 • Information propagates outward from terminal states and eventually all states have correct value estimates max happens for a=right, other 25 26 [DEMO] actions not shown Convergence* Practice: Computing Actions • Define the max-norm: • Which action should we chose from state s: • Given optimal values V? • Theorem: For any two approximations U and V • I.e. any distinct approximations must get closer to each other, so, in particular, any approximation must get closer to the true U and value • Given optimal q-values Q? iteration converges to a unique, stable, optimal solution • Theorem: • Lesson: actions are easier to select from Q’s! • I.e. once the change in our approximation is small, it must also be close to correct 27 28 Recap: MDPs Utilities for Fixed Policies • Markov decision processes: • Another basic operation: compute the s s utility of a state s under a fixed (general • States S non-optimal) policy π (s) a • Actions A s, π (s) s, a • Transitions P(s’|s,a) (or T(s,a,s’)) • Define the utility of a state s, under a • Rewards R(s,a,s’) (and discount γ ) s, π (s),s’ s,a,s’ , , fixed policy π : fixed policy π : • Start state s 0 s’ s’ V π (s) = expected total discounted rewards (return) starting in s and following π • Quantities: • Recursive relation (one-step look-ahead • Returns = sum of discounted rewards / Bellman equation): • Values = expected future returns from a state (optimal, or for a fixed policy) • Q-Values = expected future returns from a q-state (optimal, or for a fixed policy) 29 30 5

1 Solving MDPs Example Optimal Policies In deterministic - PDF document

Announcements Introduction to Artificial Intelligence Assignment 1 graded V22.0472-001 Fall 2009 Come and see me after class if you have Lecture 9: Markov Decision Processes Lecture 9: Markov Decision Processes questions questions Rob

10/12/2012 Logistics PS 2 due Tuesday Thursday 10/18 PS 3 due Thursday 10/25 CSE 473

CS449/649: Human-Computer Interaction Winter 2018 Course website:

Linguistics 201 Personnel Introduction to Linguistics General Course Description Syllabus

Estate, Tax and Other Planning after the Tax Cuts and Jobs Act of 2017 Martin M. Shenkman, Esq.

INTRODUCTION TO ECE477 OUTLINE Course Overview Communications Staff and TAs

Markov Decision Processes (MDPs) Machine Learning 10701/15781 Carlos Guestrin Carnegie

CSE 517 Natural Language Processing Winter 2015 Frames Yejin Choi Some slides adapted from

Warren Weber Bank Liability Insurance Schemes before 1865 Robert E. Lucas, Jr. Conference in

FISD General Meeting in Asia Regulation in a Changing Industry: The Drive Towards Transparency

MUST OMBUDSMEN RETAIN REMIT OVER PRIVATISED SERVICES? Brian Thompson, School of Law, University

The Long Slump Robert Hall Stanford American Economic Association Presidential Address January

login Who is login? We make sure that the transport sector has the junior staff it needs for the

At the end of this event delegates will be able to: Describe the impact the civil justice

RSA Sustainability Network Tweet: #RSAsustain Tweet: #RSAsustain Susan Harris Network Chair

How to communicate reality? The future of pension accounting Wednesday 30 April 2008 WELCOME

Interest Rate Swap and Interest Rate Swap and Variable Rate Debt Programs Variable Rate Debt

Conceptual Models as Ontological Contracts Giancarlo Guizzardi CORE/UNIBZ, Italy (together with

PRINCIPLE OF CONTRACT LAW (PART 1 OF 3) 1 Learning Outcome Understand the principles of

Module 1 Financial Accounting - Dr. Varadraj Bapat Dr. Varadraj Bapat M.Com. Cost

Delivering inclusive capitalism Sharing success with investors, customers and society LEGAL &

Please see pages 85-86 in the Prospectus for this data.

Enhanced BizFile Keeping it customer focused ACRA-SAICSA SEMINAR: KEY LEGISLATIVE REFORMS TO THE

GOVERNANCE EVALUATION FOR MID AND SMALL CAPS (GEMS) GEMS LAUNCH EVENT WEDNESDAY , 8 APRIL 2015

Mohammed Amin 17 July 2012 Disclaimer (1) Finance is a complex subject and almost all issues

1 Solving MDPs Example Optimal Policies In deterministic - PDF document

Announcements Introduction to Artificial Intelligence Assignment 1 graded V22.0472-001 Fall 2009 Come and see me after class if you have Lecture 9: Markov Decision Processes Lecture 9: Markov Decision Processes questions questions Rob

10/12/2012 Logistics PS 2 due Tuesday Thursday 10/18 PS 3 due Thursday 10/25 CSE 473

CS449/649: Human-Computer Interaction Winter 2018 Course website:

Linguistics 201 Personnel Introduction to Linguistics General Course Description Syllabus

Estate, Tax and Other Planning after the Tax Cuts and Jobs Act of 2017 Martin M. Shenkman, Esq.

INTRODUCTION TO ECE477 OUTLINE Course Overview Communications Staff and TAs

Markov Decision Processes (MDPs) Machine Learning 10701/15781 Carlos Guestrin Carnegie

CSE 517 Natural Language Processing Winter 2015 Frames Yejin Choi Some slides adapted from

Warren Weber Bank Liability Insurance Schemes before 1865 Robert E. Lucas, Jr. Conference in

FISD General Meeting in Asia Regulation in a Changing Industry: The Drive Towards Transparency

MUST OMBUDSMEN RETAIN REMIT OVER PRIVATISED SERVICES? Brian Thompson, School of Law, University

The Long Slump Robert Hall Stanford American Economic Association Presidential Address January

login Who is login? We make sure that the transport sector has the junior staff it needs for the

At the end of this event delegates will be able to: Describe the impact the civil justice

RSA Sustainability Network Tweet: #RSAsustain Tweet: #RSAsustain Susan Harris Network Chair

How to communicate reality? The future of pension accounting Wednesday 30 April 2008 WELCOME

Interest Rate Swap and Interest Rate Swap and Variable Rate Debt Programs Variable Rate Debt

Conceptual Models as Ontological Contracts Giancarlo Guizzardi CORE/UNIBZ, Italy (together with

PRINCIPLE OF CONTRACT LAW (PART 1 OF 3) 1 Learning Outcome Understand the principles of

Module 1 Financial Accounting - Dr. Varadraj Bapat Dr. Varadraj Bapat M.Com. Cost

Delivering inclusive capitalism Sharing success with investors, customers and society LEGAL &amp;

Please see pages 85-86 in the Prospectus for this data.

Enhanced BizFile Keeping it customer focused ACRA-SAICSA SEMINAR: KEY LEGISLATIVE REFORMS TO THE

GOVERNANCE EVALUATION FOR MID AND SMALL CAPS (GEMS) GEMS LAUNCH EVENT WEDNESDAY , 8 APRIL 2015

Mohammed Amin 17 July 2012 Disclaimer (1) Finance is a complex subject and almost all issues

Delivering inclusive capitalism Sharing success with investors, customers and society LEGAL &