today s outline
play

Todays Outline Reinforcement Learning II Dan Weld Review - PDF document

10/26/2012 CSE 573: Artificial Intelligence Todays Outline Reinforcement Learning II Dan Weld Review Reinforcement Learning Review MDPs New MDP Algorithm: Q-value iteration Review Q-learning Large MDPs Linear function


  1. 10/26/2012 CSE 573: Artificial Intelligence Today’s Outline Reinforcement Learning II Dan Weld  Review Reinforcement Learning  Review MDPs  New MDP Algorithm: Q-value iteration  Review Q-learning  Large MDPs  Linear function approximation  Policy gradient Many slides adapted from either Alan Fern, Dan Klein, Stuart Russell, Luke Zettlemoyer or Andrew Moore 1 Applications Demos  http://inst.eecs.berkeley.edu/~ee128/fa11/ videos.html  Robotic control  helicopter maneuvering, autonomous vehicles  Mars rover - path planning, oversubscription planning g g  elevator planning  Game playing - backgammon, tetris, checkers  Neuroscience  Computational Finance, Sequential Auctions  Assisting elderly in simple tasks  Spoken dialog management  Communication Networks – switching, routing, flow control  War planning, evacuation planning 4 Small vs. Huge MDPs Agent Assets  First cover RL methods for small MDPs  Number of states and actions is reasonably small  Eg can represent policy as explicit table  These algorithms will inspire more advanced methods Value Iteration Monte Carlo Reinforcement Policy Iteration Planning Learning  Later we will cover algorithms for huge MDPs  Function Approximation Methods  Policy Gradient Methods  Least-Squares Policy Iteration 5 6 1

  2. 10/26/2012 Passive vs. Active learning Model-Based vs . Model-Free RL  Model-based approach to RL:  Passive learning  learn the MDP model, or an approximation of it  The agent has a fixed policy and tries to learn the utilities of states by observing the world go by  use it for policy evaluation or to find the optimal  Analogous to policy evaluation policy  Often serves as a component of active learning algorithms  Often inspires active learning algorithms  Often inspires active learning algorithms  Model-free approach to RL:  Active learning  derive optimal policy w/o explicitly learning the  The agent attempts to find an optimal (or at least good) policy model by acting in the world  useful when model is difficult to represent  Analogous to solving the underlying MDP, but without first being given the MDP model and/or learn  We will consider both types of approaches 7 8 Comparison RL Dimensions TD Learning  Model-based approaches: Learn T + R Optimistic Q Learning Explore / RMax |S| 2 |A| + |S||A| parameters (40,400) ADP  -greedy Active  Model-free approach: ADP Learn Q TD Learning Direct Estimation |S||A| parameters (400) Passive Uses Model Model Free Supposing 100 states, 4 actions… 10 Recap: MDPs  Markov decision processes: Bellman Equations s  States S a  Actions A s, a  Transitions T(s,a,s ʼ ) aka P(s ʼ |s,a) s,a,s ʼ  Rewards R(s,a,s ʼ ) (and discount  ) s ʼ  Start state s 0 (or distribution P 0 ) 0 ( 0 )  Algorithms  Value Iteration  Q-value iteration Q*(a, s) =  Quantities:  Policy = map from states to actions  Utility = sum of discounted future rewards  Q-Value = expected utility from a q-state Andrey Markov  Ie. from a state/action pair (1856 ‐ 1922) 12 2

  3. 10/26/2012 Q- Value Iteration Q-Value Iteration  Regular Value iteration: find successive approx optimal values Initialize each q-state: Q 0 (s,a) = 0  Start with V 0 * (s) = 0  Given V i * , calculate the values for all states for depth i+1: Repeat Q i+1 (s,a) For all q-states, s,a Compute Q i+1 (s,a) from Q i by Bellman backup at s,a. Until max s,a |Q i+1 (s,a) – Q i (s,a)| <  V i (s’) ]  Storing Q-values is more useful!  Start with Q 0 * (s,a) = 0  Given Q i * , calculate the q-values for all q-states for depth i+1: V i (s’) ] Reinforcement Learning Recap: Sampling Expectations  Markov decision processes:  Want to compute an expectation weighted by P(x):  States S s  Actions A a  Transitions T(s,a,s ʼ ) aka P(s ʼ |s,a)  Rewards R(s,a,s ʼ ) (and discount  ) s, a  Model-based: estimate P(x) from samples, compute expectation  Start state s 0 (or distribution P 0 ) s,a,s ʼ 0 ( 0 ) s ʼ  Algorithms  Q-value iteration  Q-learning  Model-free: estimate expectation directly from samples  Why does this work? Because samples appear with the right frequencies! Q-Learning Update Recap: Exp. Moving Average  Exponential moving average  Q-Learning = sample-based Q-value iteration  Makes recent samples more important  How learn Q*(s,a) values? ( , )  Receive a sample (s,a,s ʼ ,r)  Forgets about the past (distant past values were wrong anyway)  Consider your old estimate:  Easy to compute from the running average  Consider your new sample estimate:  Decreasing learning rate can give converging averages  Incorporate the new estimate into a running average: 3

  4. 10/26/2012 Q-Learning Update Exploration / Exploitation  Alternatively….   greedy difference = sample – Q(s, a)  Every time step, flip a coin: with probability  , act randomly Q-Learning = sample-based Q-value iteration  With probability 1-  , act according to current policy  Exploration function  How learn Q*(s,a) values? ( , )  Receive a sample (s,a,s ʼ ,r)  Explore areas whose badness is not (yet) established  Takes a value estimate and a count, and returns an  Consider your old estimate: optimistic utility , e.g.  Consider your new sample estimate: (exact form not important)  Exploration policy π ( s ’ )=  Incorporate the new estimate into a running average: vs. Q-Learning:  Greedy Q-Learning Final Solution  Q-learning produces tables of q-values: QuickTime™ and a H.264 decompressor are needed to see this picture. Q-Learning – Small Problem Q-Learning Properties  Doesn’t work  Amazing result: Q-learning converges to optimal policy  If you explore enough  If you make the learning rate small enough  In realistic situations, we can’t possibly learn about  … but not decrease it too quickly! every single state!  Not too sensitive to how you select actions (!) y ( )  Too many states to visit them all in training Too many states to visit them all in training  Too many states to hold the q-tables in memory  Neat property: off-policy learning  learn optimal policy without following it (some caveats)  Instead, we need to generalize :  Learn about a few states from experience  Generalize that experience to new, similar states (Fundamental idea in machine learning) S E S E 4

  5. 10/26/2012 RL Dimensions Example: Pacman  Let ʼ s say we discover through experience that this state is bad: Active  In naïve Q learning, Many States we know nothing about related states and their Q values: Passive Uses Model  Or even this third one! Model Free 25 Feature-Based Representations Linear Feature Functions  Using a feature representation, we can write a  Solution: describe a state using a vector of features (properties) q function (or value function) for any state  Features are functions from states to using a linear combination of a few weights: real numbers (often 0/1) that capture important properties of the state  Example features:  Example features:  Distance to closest ghost  Advantage: our experience is summed up in  Advantage:  Distance to closest dot  Number of ghosts a few powerful numbers  1 / (dist to dot) 2  Is Pacman in a tunnel? (0/1) |S| 2 |A| ? |S||A| ?  …… etc.  Disadvantage: states may share features but  Disadvantage:  Can also describe a q-state (s, a) with features actually be very different in value! (e.g. action moves closer to food) Function Approximation Example: Q-Pacman  Q-learning with linear q-functions: Exact Q ʼ s Approximate Q ʼ s  Intuitive interpretation:  Adjust weights of active features  E.g. if something unexpectedly bad happens, disprefer all states with that state ʼ s features  Formal justification: online least squares 5

  6. 10/26/2012 Q-learning with Linear Approximators Q-learning, no features, 50 learning trials: 1. Start with initial parameter values 2. Take action a according to an explore/exploit policy (should converge to greedy policy, i.e. GLIE) transitioning from s to s’ 3. Perform TD update for each parameter QuickTime™ and a GIF decompressor 4. Goto 2 are needed to see this picture. •Q-learning can diverge. Converges under some conditions. 31 Q-learning, no features, 1000 learning trials: Q-learning, simple features, 50 learning trials: QuickTime™ and a QuickTime™ and a GIF decompressor GIF decompressor are needed to see this picture. are needed to see this picture. Linear Regression Why Does This Work? 40 26 24 22 20 20 20 30 40 20 30 10 20 0 10 0 20 0 0 Prediction Prediction 35 6

Recommend


More recommend