today s outline
play

Todays Outline Reinforcement Learning Dan Weld Reinforcement - PDF document

5/7/2012 CSE 473: Artificial Intelligence Todays Outline Reinforcement Learning Dan Weld Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Linear function


  1. 5/7/2012 CSE 473: Artificial Intelligence Today’s Outline Reinforcement Learning Dan Weld  Reinforcement Learning  Q-value iteration  Q-learning  Exploration / exploitation  Linear function approximation  Linear function approximation Many slides adapted from either Dan Klein, Stuart Russell, Luke Zettlemoyer or Andrew Moore 1 Recap: MDPs Bellman Equations  Markov decision processes: s  States S a  Actions A s, a  Transitions T(s,a,s ʼ ) aka P(s ʼ |s,a) s,a,s ʼ  Rewards R(s,a,s ʼ ) (and discount  ) s ʼ  Start state s 0 (or distribution P 0 ) 0 ( 0 )  Algorithms  Value Iteration  Q-value iteration Q*(a, s) =  Quantities:  Policy = map from states to actions  Utility = sum of discounted future rewards  Q-Value = expected utility from a q-state Andrey Markov  Ie. from a state/action pair (1856 ‐ 1922) 4 Bellman Backup Q-Value Iteration  Regular Value iteration: find successive approx optimal values Q 5 (s,a 1 ) = 2 +  0  Start with V 0 * (s) = 0 ~ 2  Given V i * , calculate the values for all states for depth i+1: Q 5 (s,a 2 ) = 5 +  0.9~ 1 s 1 a 1 V 4 = 0 Q i+1 (s,a) +  0.1~ 2 V 5 = 6.5 ~ 6.1 5 s 0 s 0 a 2 a 2 Q 5 (s,a 3 ) = 4.5 +  2 s 2 V 4 = 1 ~ 6.5 a 3  Storing Q-values is more useful! s 3  Start with Q 0 V 4 = 2 * (s,a) = 0 max  Given Q i * , calculate the q-values for all q-states for depth i+1: V i (s’) ] 1

  2. 5/7/2012 Q-Value Iteration Reinforcement Learning  Markov decision processes: Initialize each q-state: Q 0 (s,a) = 0  States S s  Actions A Repeat a  Transitions T(s,a,s ʼ ) aka P(s ʼ |s,a) For all q-states, s,a s, a  Rewards R(s,a,s ʼ ) (and discount  ) Compute Q i+1 (s,a) from Q i by Bellman backup at s,a. s,a,s ʼ  Start state s 0 (or distribution P 0 ) 0 ( 0 ) Until max s,a |Q i+1 (s,a) – Q i (s,a)| <  s ʼ  Algorithms  Q-value iteration  Q-learning  Approaches for mixing exploration & exploitation V i (s’) ]   -greedy  Exploration functions Applications Stanford Autonomous Helicopter http://heli.stanford.edu/  Robotic control  helicopter maneuvering, autonomous vehicles  Mars rover - path planning, oversubscription planning g g  elevator planning  Game playing - backgammon, tetris, checkers  Neuroscience  Computational Finance, Sequential Auctions  Assisting elderly in simple tasks  Spoken dialog management  Communication Networks – switching, routing, flow control  War planning, evacuation planning 10 Two main reinforcement learning Two main reinforcement learning approaches approaches  Model-based approaches:  Model-based approaches:  explore environment & learn model, T=P( s ʼ | s , a ) and R( s , a ), Learn T + R (almost) everywhere |S| 2 |A| + |S||A| parameters (40,000)  use model to plan policy, MDP-style  approach leads to strongest theoretical results h l d t t t th ti l lt  often works well when state-space is manageable  Model-free approach:  Model-free approach: Learn Q  don ʼ t learn a model; learn value function or policy directly |S||A| parameters (400)  weaker theoretical results  often works better when state space is large 2

  3. 5/7/2012 Recap: Sampling Expectations Recap: Exp. Moving Average  Want to compute an expectation weighted by P(x):  Exponential moving average  Makes recent samples more important  Model-based: estimate P(x) from samples, compute expectation  Forgets about the past (distant past values were wrong anyway)  Easy to compute from the running average  Model-free: estimate expectation directly from samples  Decreasing learning rate can give converging averages  Why does this work? Because samples appear with the right frequencies! Q-Learning Update Exploration-Exploitation tradeoff  You have visited part of the state space and found a  Q-Learning = sample-based Q-value iteration reward of 100  is this the best you can hope for???  How learn Q*(s,a) values?  Exploitation : should I stick with what I know and find  Receive a sample (s a s ʼ r)  Receive a sample (s,a,s ,r) a good policy w.r.t. this knowledge? d li hi k l d ?  Consider your old estimate:  at risk of missing out on a better reward somewhere  Consider your new sample estimate:  Exploration : should I look for states w/ more reward?  Incorporate the new estimate into a running average:  at risk of wasting time & getting some negative reward 16 Q-Learning:  Greedy Exploration / Exploitation  Several schemes for action selection  Simplest: random actions (  greedy )  Every time step, flip a coin  With probability  , act randomly  With probability 1-  , act according to current policy With b bilit 1 t di t t li QuickTime™ and a H.264 decompressor  Problems with random actions? are needed to see this picture.  You do explore the space, but keep thrashing around once learning is done  One solution: lower  over time  Another solution: exploration functions 3

  4. 5/7/2012 Exploration Functions Q-Learning Final Solution  When to explore  Q-learning produces tables of q-values:  Random actions: explore a fixed amount  Better idea: explore areas whose badness is not (yet) established  Exploration function Exploration function  Takes a value estimate and a count, and returns an optimistic utility , e.g. (exact form not important)  Exploration policy π ( s ’ )= vs. Q-Learning – Small Problem Q-Learning Properties  Doesn’t work  Amazing result: Q-learning converges to optimal policy  If you explore enough  If you make the learning rate small enough  In realistic situations, we can’t possibly learn about  … but not decrease it too quickly! every single state!  Not too sensitive to how you select actions (!) y ( )  Too many states to visit them all in training Too many states to visit them all in training  Too many states to hold the q-tables in memory  Neat property: off-policy learning  learn optimal policy without following it (some caveats)  Instead, we need to generalize :  Learn about a few states from experience  Generalize that experience to new, similar states (Fundamental idea in machine learning) S E S E Example: Pacman Feature-Based Representations  Let ʼ s say we discover  Solution: describe a state using a vector of features (properties) through experience  Features are functions from states to that this state is bad: real numbers (often 0/1) that capture important properties of the state  Example features:  Example features:  In naïve Q learning,  Distance to closest ghost we know nothing  Distance to closest dot about related states  Number of ghosts  1 / (dist to dot) 2 and their Q values:  Is Pacman in a tunnel? (0/1)  …… etc.  Or even this third one!  Can also describe a q-state (s, a) with features (e.g. action moves closer to food) 4

  5. 5/7/2012 Linear Feature Functions Function Approximation  Using a feature representation, we can write a q function (or value function) for any state  Q-learning with linear q-functions: using a linear combination of a few weights: Exact Q ʼ s  Advantage: our experience is summed up in Approximate Q ʼ s a few powerful numbers  Intuitive interpretation: |S| 2 |A| ? |S||A| ?  Adjust weights of active features  E.g. if something unexpectedly bad happens, disprefer all states  Disadvantage: states may share features but with that state ʼ s features actually be very different in value!  Formal justification: online least squares Example: Q-Pacman Linear Regression 40 26 24 20 22 20 20 30 40 20 30 20 10 0 10 0 20 0 0 Prediction Prediction Ordinary Least Squares (OLS) Minimizing Error Imagine we had only one point x with features f(x): Error or “ residual ” Error or residual Observation Prediction Approximate q update: “ target ” “ prediction ” 0 0 20 5

  6. 5/7/2012 Overfitting Which Algorithm? 30 Q-learning, no features, 50 learning trials: 25 20 Degree 15 polynomial 15 10 QuickTime™ and a 5 GIF decompressor are needed to see this picture. 0 -5 -10 -15 0 2 4 6 8 10 12 14 16 18 20 Which Algorithm? Which Algorithm? Q-learning, no features, 1000 learning trials: Q-learning, simple features, 50 learning trials: QuickTime™ and a QuickTime™ and a GIF decompressor GIF decompressor are needed to see this picture. are needed to see this picture. Partially observable MDPs A POMDP: Ghost Hunter  Markov decision processes:  States S  Actions A b  Transitions P(s ʼ |s,a) (or T(s,a,s ʼ ))  Rewards R(s,a,s ʼ ) (and discount  ) a  Start state distribution b 0 =P(s 0 ) b, a QuickTime™ and a H.264 decompressor are needed to see this picture. o  POMDPs, just add: b ʼ  Observations O  Observation model P(o|s,a) (or O(s,a,o)) 6

Recommend


More recommend