10/20/2009 Announcements Introduction to Artificial Intelligence • Assignment 2 due next Monday at midnight V22.0472-001 Fall 2009 • Please send email to me about final exam • Please send email to me about final exam Lecture 11: Reinforcement Learning 2 Lecture 11: Reinforcement Learning 2 Rob Fergus – Dept of Computer Science, Courant Institute, NYU Slides from Alan Fern, Daniel Weld, Dan Klein, John DeNero 2 Last Time: Q-Learning Example: Pacman • In realistic situations, we cannot possibly learn • Let’s say we discover through experience that about every single state! this state is bad: • Too many states to visit them all in training • Too many states to hold the q-tables in memory • In naïve q learning we • In naïve q learning, we • Instead, we want to generalize: know nothing about this state or its q states: • Learn about some small number of training states from experience • Generalize that experience to new, similar states • This is a fundamental idea in machine learning, and • Or even this one! we’ll see it over and over again 3 4 Function Approximation Feature-Based Representations • Never enough training data! • Must generalize what is learned from one situation to other • Solution: describe a state using a “similar” new situations vector of features • Idea: • Features are functions from states to • Instead of using large table to represent V or Q, use a real numbers (often 0/1) that capture parameterized function important properties of the state • Example features: • The number of parameters should be small compared to number of states (generally exponentially fewer number of states (generally exponentially fewer • Distance to closest ghost • Distance to closest ghost parameters) • Distance to closest dot • Number of ghosts • Learn parameters from experience • 1 / (dist to dot) 2 • When we update the parameters based on observations in • Is Pacman in a tunnel? (0/1) one state, then our V or Q estimate will also change for other • …… etc. similar states • Can also describe a q-state (s, a) with • I.e. the parameterization facilitates generalization of features (e.g. action moves closer to experience food) 5 6 1
10/20/2009 Linear Function Approximation TD-based RL for Linear Approximators • Define a set of features f1(s), …, fn(s) 1. Start with initial parameter values • The features are used as our representation of states 2. Take action according to an • States with similar feature values will be treated explore/exploit policy similarly • More complex functions require more complex features 3. Update estimated model ˆ = = θ θ + + θ θ + + θ θ + + + + θ θ V V ( ( s s ) ) f f ( ( s s ) ) f f ( ( s s ) ) ... f f ( ( s s ) ) θ 4. Perform TD update for each parameter 0 1 1 2 2 n n • Our goal is to learn good parameter values (i.e. θ ← ? feature weights) that approximate the value i function well 5. Goto 2 • How can we do this? What is a “TD update” for a parameter? • Use TD-based RL and somehow update parameters based on each experience. 7 8 Aside: Gradient Descent Aside: Gradient Descent for Squared Error Given a function f ( θ 1 ,…, θ n ) of n real values θ = ( θ 1 ,…, θ n ) suppose • Suppose that we have a sequence of states and target values • we want to minimize f with respect to θ for each state K s , v ( s ) , s , v ( s ) , A common approach to doing this is gradient descent • 1 1 2 2 • E.g. produced by the TD-based RL loop The gradient of f at point θ , denoted by ∇ θ f ( θ ), is an • Our goal is to minimize the sum of squared errors between n-dimensional vector that points in the direction where • f increases most steeply at point θ our estimated function and each target value: ( ( ) 2 ) • Vector calculus tells us that ∇ θ f ( θ ) is just a vector of 1 = ˆ − ( ( ) ) ( ( ) ) E E V V s s v v s s partial derivatives ti l d i ti θ θ j j j 2 ⎡ ∂ θ ∂ θ ⎤ ( ) ( ) f f ∇ θ = ( ) ⎢ , K , ⎥ f squared error of example j target value for j’th state θ ∂ θ ∂ θ our estimated value ⎣ ⎦ 1 n for j’th state After seeing j’th state the gradient descent rule tells us that • ∂ θ θ θ θ + ε θ θ − θ where ( ) ( , K , , , K , ) ( ) f f f we can decrease error by updating parameters by: = − + lim 1 i 1 i i 1 n ∂ θ ε ε → 0 ∂ ∂ ∂ ∂ ˆ i ( ) E E E V s can decrease f by moving in negative gradient direction θ ← θ − α = θ j j j j , ∂ θ ∂ θ ∂ ˆ ∂ θ i i V ( s ) θ i i i j 9 10 learning rate Aside: continued TD-based RL for Linear Approximators ( ) ˆ ∂ ∂ 1. Start with initial parameter values E V ( s ) θ ← θ − α = θ − α ˆ − θ j j V ( s ) v ( s ) 2. Take action according to an explore/exploit policy θ i i ∂ θ i j j ∂ θ i i ∂ Transition from s to s’ E j depends on form of 3. Update estimated model ∂ ˆ ( ) V s approximator θ j 4. Perform TD update for each parameter ( ( ) ) θ θ ← ← θ θ + + α α − − ˆ • For a linear approximation function: v v ( ( s s ) ) V V ( ( s s ) ) f f ( ( s s ) ) θ i i i ˆ = θ + θ + θ + + θ V ( s ) f ( s ) f ( s ) ... f ( s ) 5. Goto 2 θ 1 1 1 2 2 n n • Thus the update becomes: • For linear functions this update is guaranteed to converge What should we use for “target value” v(s)? to best approximation for suitable learning rate schedule ( ) • Use the TD prediction based on the next state s’ ∂ ˆ θ ← θ + α − ˆ ( ) v ( s ) V ( s ) f ( s ) V s θ θ = j i i j j i j ( ) f s ∂ θ this is the same as previous TD method only with approximation i j i = + β ˆ ( ) ( ) ( ' ) v s R s V s 11 12 θ 2
10/20/2009 Q-learning with Linear Approximators TD-based RL for Linear Approximators ˆ 1. Start with initial parameter values = θ + θ + θ + + θ ( , ) ( , ) ( , ) ... ( , ) Q s a f s a f s a f s a θ 0 1 1 2 2 n n 2. Take action according to an explore/exploit policy Features are a function of states and actions. 3. Update estimated model 1. Start with initial parameter values 4. Perform TD update for each parameter ( ) 2. Take action a according to an explore/exploit policy θ ← θ + α + β ˆ − ˆ R ( s ) V ( s ' ) V ( s ) f ( s ) transitioning from s to s’ θ θ i i i 5 5. Goto 2 G 2 3. Perform TD update for each parameter ) ( ˆ ˆ θ ← θ + α + β − R ( s ) max Q ( s ' , a ' ) Q ( s , a ) f ( s , a ) θ θ i i i a ' 4. Goto 2 • Step 2 requires a model to select action • For applications such as Backgammon it is easy to get a simulation-based model • For both Q and V, these algorithms converge to the closest • For others it is difficult to get a good model linear approximation to optimal Q or V. • But we can do the same thing for model-free Q-learning 13 14 Example: Tactical Battles in Wargus Example: Tactical Battles in Wargus • Wargus is real-time strategy (RTS) game • States : contain information about the locations, health, and current activity of all friendly and enemy agent • Tactical battles are a key aspect of the game • Actions : Attack(F,E) • causes friendly agent F to attack enemy E • Policy : represented via Q-function Q(s,Attack(F,E)) • Each decision cycle loop through each friendly agent F and select • Each decision cycle loop through each friendly agent F and select enemy E to attack that maximizes Q(s,Attack(F,E)) 10 vs. 10 5 vs. 5 • Q(s,Attack(F,E)) generalizes over any friendly and enemy • RL Task : learn a policy to control n friendly agents in a agents F and E battle against m enemy agents • We used a linear function approximator with Q-learning • Policy should be applicable to tasks with different sets and numbers of agents 15 16 Example: Tactical Battles in Wargus Example: Tactical Battles in Wargus • Linear Q-learning in 5 vs. 5 battle ˆ = θ + θ + θ + + θ ( , ) ( , ) ( , ) ... ( , ) Q s a f s a f s a f s a θ 1 1 1 2 2 n n • Engineered a set of relational features 700 {f 1 (s,Attack(F,E)), …., f n (s,Attack(F,E))} 600 500 • Example Features : erential 400 • # of other friendly agents that are currently attacking E • Health of friendly agent F • Health of friendly agent F 300 300 Damage Diffe • Health of enemy agent E 200 • Difference in health values 100 • Walking distance between F and E 0 • Is E the enemy agent that F is currently attacking? • Is F the closest friendly agent to E? -100 • Is E the closest enemy agent to E? Episodes • … • Features are well defined for any number of agents 17 18 3
Recommend
More recommend