Todays Outline Reinforcement Learning II Dan Weld Review - PDF document

10/26/2012 CSE 573: Artificial Intelligence Today’s Outline Reinforcement Learning II Dan Weld  Review Reinforcement Learning  Review MDPs  New MDP Algorithm: Q-value iteration  Review Q-learning  Large MDPs  Linear function approximation  Policy gradient Many slides adapted from either Alan Fern, Dan Klein, Stuart Russell, Luke Zettlemoyer or Andrew Moore 1 Applications Demos  http://inst.eecs.berkeley.edu/~ee128/fa11/ videos.html  Robotic control  helicopter maneuvering, autonomous vehicles  Mars rover - path planning, oversubscription planning g g  elevator planning  Game playing - backgammon, tetris, checkers  Neuroscience  Computational Finance, Sequential Auctions  Assisting elderly in simple tasks  Spoken dialog management  Communication Networks – switching, routing, flow control  War planning, evacuation planning 4 Small vs. Huge MDPs Agent Assets  First cover RL methods for small MDPs  Number of states and actions is reasonably small  Eg can represent policy as explicit table  These algorithms will inspire more advanced methods Value Iteration Monte Carlo Reinforcement Policy Iteration Planning Learning  Later we will cover algorithms for huge MDPs  Function Approximation Methods  Policy Gradient Methods  Least-Squares Policy Iteration 5 6 1

10/26/2012 Passive vs. Active learning Model-Based vs . Model-Free RL  Model-based approach to RL:  Passive learning  learn the MDP model, or an approximation of it  The agent has a fixed policy and tries to learn the utilities of states by observing the world go by  use it for policy evaluation or to find the optimal  Analogous to policy evaluation policy  Often serves as a component of active learning algorithms  Often inspires active learning algorithms  Often inspires active learning algorithms  Model-free approach to RL:  Active learning  derive optimal policy w/o explicitly learning the  The agent attempts to find an optimal (or at least good) policy model by acting in the world  useful when model is difficult to represent  Analogous to solving the underlying MDP, but without first being given the MDP model and/or learn  We will consider both types of approaches 7 8 Comparison RL Dimensions TD Learning  Model-based approaches: Learn T + R Optimistic Q Learning Explore / RMax |S| 2 |A| + |S||A| parameters (40,400) ADP  -greedy Active  Model-free approach: ADP Learn Q TD Learning Direct Estimation |S||A| parameters (400) Passive Uses Model Model Free Supposing 100 states, 4 actions… 10 Recap: MDPs  Markov decision processes: Bellman Equations s  States S a  Actions A s, a  Transitions T(s,a,s ʼ ) aka P(s ʼ |s,a) s,a,s ʼ  Rewards R(s,a,s ʼ ) (and discount  ) s ʼ  Start state s 0 (or distribution P 0 ) 0 ( 0 )  Algorithms  Value Iteration  Q-value iteration Q*(a, s) =  Quantities:  Policy = map from states to actions  Utility = sum of discounted future rewards  Q-Value = expected utility from a q-state Andrey Markov  Ie. from a state/action pair (1856 ‐ 1922) 12 2

10/26/2012 Q- Value Iteration Q-Value Iteration  Regular Value iteration: find successive approx optimal values Initialize each q-state: Q 0 (s,a) = 0  Start with V 0 * (s) = 0  Given V i * , calculate the values for all states for depth i+1: Repeat Q i+1 (s,a) For all q-states, s,a Compute Q i+1 (s,a) from Q i by Bellman backup at s,a. Until max s,a |Q i+1 (s,a) – Q i (s,a)| <  V i (s’) ]  Storing Q-values is more useful!  Start with Q 0 * (s,a) = 0  Given Q i * , calculate the q-values for all q-states for depth i+1: V i (s’) ] Reinforcement Learning Recap: Sampling Expectations  Markov decision processes:  Want to compute an expectation weighted by P(x):  States S s  Actions A a  Transitions T(s,a,s ʼ ) aka P(s ʼ |s,a)  Rewards R(s,a,s ʼ ) (and discount  ) s, a  Model-based: estimate P(x) from samples, compute expectation  Start state s 0 (or distribution P 0 ) s,a,s ʼ 0 ( 0 ) s ʼ  Algorithms  Q-value iteration  Q-learning  Model-free: estimate expectation directly from samples  Why does this work? Because samples appear with the right frequencies! Q-Learning Update Recap: Exp. Moving Average  Exponential moving average  Q-Learning = sample-based Q-value iteration  Makes recent samples more important  How learn Q*(s,a) values? ( , )  Receive a sample (s,a,s ʼ ,r)  Forgets about the past (distant past values were wrong anyway)  Consider your old estimate:  Easy to compute from the running average  Consider your new sample estimate:  Decreasing learning rate can give converging averages  Incorporate the new estimate into a running average: 3

10/26/2012 Q-Learning Update Exploration / Exploitation  Alternatively….   greedy difference = sample – Q(s, a)  Every time step, flip a coin: with probability  , act randomly Q-Learning = sample-based Q-value iteration  With probability 1-  , act according to current policy  Exploration function  How learn Q*(s,a) values? ( , )  Receive a sample (s,a,s ʼ ,r)  Explore areas whose badness is not (yet) established  Takes a value estimate and a count, and returns an  Consider your old estimate: optimistic utility , e.g.  Consider your new sample estimate: (exact form not important)  Exploration policy π ( s ’ )=  Incorporate the new estimate into a running average: vs. Q-Learning:  Greedy Q-Learning Final Solution  Q-learning produces tables of q-values: QuickTime™ and a H.264 decompressor are needed to see this picture. Q-Learning – Small Problem Q-Learning Properties  Doesn’t work  Amazing result: Q-learning converges to optimal policy  If you explore enough  If you make the learning rate small enough  In realistic situations, we can’t possibly learn about  … but not decrease it too quickly! every single state!  Not too sensitive to how you select actions (!) y ( )  Too many states to visit them all in training Too many states to visit them all in training  Too many states to hold the q-tables in memory  Neat property: off-policy learning  learn optimal policy without following it (some caveats)  Instead, we need to generalize :  Learn about a few states from experience  Generalize that experience to new, similar states (Fundamental idea in machine learning) S E S E 4

10/26/2012 RL Dimensions Example: Pacman  Let ʼ s say we discover through experience that this state is bad: Active  In naïve Q learning, Many States we know nothing about related states and their Q values: Passive Uses Model  Or even this third one! Model Free 25 Feature-Based Representations Linear Feature Functions  Using a feature representation, we can write a  Solution: describe a state using a vector of features (properties) q function (or value function) for any state  Features are functions from states to using a linear combination of a few weights: real numbers (often 0/1) that capture important properties of the state  Example features:  Example features:  Distance to closest ghost  Advantage: our experience is summed up in  Advantage:  Distance to closest dot  Number of ghosts a few powerful numbers  1 / (dist to dot) 2  Is Pacman in a tunnel? (0/1) |S| 2 |A| ? |S||A| ?  …… etc.  Disadvantage: states may share features but  Disadvantage:  Can also describe a q-state (s, a) with features actually be very different in value! (e.g. action moves closer to food) Function Approximation Example: Q-Pacman  Q-learning with linear q-functions: Exact Q ʼ s Approximate Q ʼ s  Intuitive interpretation:  Adjust weights of active features  E.g. if something unexpectedly bad happens, disprefer all states with that state ʼ s features  Formal justification: online least squares 5

10/26/2012 Q-learning with Linear Approximators Q-learning, no features, 50 learning trials: 1. Start with initial parameter values 2. Take action a according to an explore/exploit policy (should converge to greedy policy, i.e. GLIE) transitioning from s to s’ 3. Perform TD update for each parameter QuickTime™ and a GIF decompressor 4. Goto 2 are needed to see this picture. •Q-learning can diverge. Converges under some conditions. 31 Q-learning, no features, 1000 learning trials: Q-learning, simple features, 50 learning trials: QuickTime™ and a QuickTime™ and a GIF decompressor GIF decompressor are needed to see this picture. are needed to see this picture. Linear Regression Why Does This Work? 40 26 24 22 20 20 20 30 40 20 30 10 20 0 10 0 20 0 0 Prediction Prediction 35 6

Todays Outline Reinforcement Learning II Dan Weld Review - PDF document

10/26/2012 CSE 573: Artificial Intelligence Todays Outline Reinforcement Learning II Dan Weld Review Reinforcement Learning Review MDPs New MDP Algorithm: Q-value iteration Review Q-learning Large MDPs Linear function

Ins Domingues Breast Cancer Workshop April 7th 2015 Outline Outline Outline Outline

What is the League Today 1 1/23/2017 What is the League Today What is the League Today 2

Social/Network/Analysis mohamed.bouguessa@uqo.ca/ 1 Web/today 2

Lecture 15 Logistics HW4 is due today HW5 posted today HW5 posted today Exam

WIEMANN LAMPHERE ARCHITECTS MONTPELIER TODAY MONTPELIER TODAY PARKING! VEHICLES ARE

Today. Types of graphs. Today. Types of graphs. Complete Graphs. Trees. Hypercubes. Today.

Welcome back. Today. Welcome back. Today. Continue Sampling combinatorial structures. Welcome

AST 1420 Galactic Structure and Dynamics Today: disks! NGC 5907 M31 Today: disks! Outline

1. Abertis today 2. 2016 Financial Year 3. Outlook 4. Conclusions Abertis today 2016

Matt Fisher EUA Coordinator Overview of Parramatta today Overview of Parramatta today Overview

Course Business New dataset on CourseWeb: bpd.csv Midterm project due today Today

Featherweight Scala Week 14 January 31 1 Today Previously: Featherweight Java Today:

Stuff New HW on the web later today No lab today Tests graded by Thurs Last Time

Welcome back. Today. Welcome back. Today. Review: Spectral gap, Edge expansion h ( G ) ,

Sorting 15-121 Fall 2020 Margaret Reid-Miller Today Margaret will have office hours today

Exceptions Announcements Exceptions Today's Topic: Handling Errors 4 Today's Topic: Handling

17. Review Linear approximation: f f x x + f y y. Tangent plane: to z = f ( x, y )

Linear Approximations of Addition Modulo 2 n -1 Chunfang Zhou, Xiutao Feng and Chuankun Wu State

Linear Approximations Suppose we want to approximate the value of a function f for some value of x

Applying Learning Algorithms to Preference Elicitation in the Generalized Vickrey Auction ebastien

Function Approximation for (on policy) Prediction and Control Lecture 8, CMU 10-403 Katerina

Math 211 Math 211 Lecture #1 Introduction August 26, 2002 2 Welcome to Math 211 Welcome to

Class 4: On Policy Prediction With Approximation Chapter 9 Sutton slides/silver slides 295,

Nonlinear Equations = 40 / How can we solve these equations? Spring force: