11/9/16 Approximate Q-Learning Dan Weld / University of Washington [Many slides taken from Dan Klein and Pieter Abbeel / CS188 Intro to AI at UC Berkeley – materials available at http://ai.berkeley.edu.] Q Learning Forall s, a Initialize Q(s, a) = 0 Repeat Forever Where are you? s. Choose some action a Execute it in real world: (s, a, r, s’) Do update: } Equivalently 1
11/9/16 Q Learning Forall s, a Initialize Q(s, a) = 0 Repeat Forever Where are you? s. Choose some action a Execute it in real world: (s, a, r, s’) Do update: Example: Pacman Let’s say we discover Or even this through experience one! that this state is bad: 2
11/9/16 Q-learning, no features, 50 learning trials QuickTime™ and a GIF decompressor are needed to see this picture. Q-learning, no features, 1000 learning trials: QuickTime™ and a GIF decompressor are needed to see this picture. 3
11/9/16 Feature-Based Representations Soln: describe states w/ vector of features (aka “properties”) – Features = functions from states to R (often 0/1) capturing important properties of the state – Examples: • Distance to closest ghost or dot • Number of ghosts • 1 / (dist to dot) 2 • Is Pacman in a tunnel? (0/1) …… etc. • Is state the exact state on this slide? – Can also describe a q-state (s, a) with features (e.g. action moves closer to food) How to use features? Using features we can represent V and/or Q as follows: V(s) = g(f 1 (s), f 2 (s), …, f n (s)) Q(s,a) = g(f 1 (s,a), f 2 (s,a), …, f n (s,a)) What should we use for g? (and f)? 4
11/9/16 Linear Combination • Using a feature representation, we can write a q function (or value function) for any state using a few weights: • Advantage: our experience is summed up in a few powerful numbers • Disadvantage: states sharing features may actually have very different values! Approximate Q-Learning • Q-learning with linear Q-functions: Exact Q’s Approximate Q’s • Intuitive interpretation: – Adjust weights of active features – E.g., if something unexpectedly bad happens, blame the features that were on: disprefer all states with that state’s features • Formal justification: in a few slides! 5
11/9/16 Example: Pacman Features 𝑅 𝑡, 𝑏 = 𝑥 ( 𝑔 *+, 𝑡, 𝑏 + 𝑥 . 𝑔 /0, (𝑡, 𝑏) 1 𝑔 *+, 𝑡, 𝑏 = 𝑒𝑗𝑡𝑢𝑏𝑜𝑑𝑓 𝑢𝑝 𝑑𝑚𝑝𝑡𝑓𝑡𝑢 𝑔𝑝𝑝𝑒 𝑏𝑔𝑢𝑓𝑠 𝑢𝑏𝑙𝑗𝑜 𝑏 𝑔 *+, 𝑡, 𝑂𝑃𝑆𝑈𝐼 = 0.5 𝑔 /0, 𝑡, 𝑏 = 𝑒𝑗𝑡𝑢𝑏𝑜𝑑𝑓 𝑢𝑝 𝑑𝑚𝑝𝑡𝑓𝑡𝑢 ℎ𝑝𝑡𝑢 𝑏𝑔𝑢𝑓𝑠 𝑢𝑏𝑙𝑗𝑜 𝑔 /0, 𝑡, 𝑂𝑃𝑆𝑈𝐼 = 1.0 Example: Q-Pacman α = 0.004 [Demo: approximate Q- learning pacman 6
11/9/16 Video of Demo Approximate Q- Learning -- Pacman Sidebar: Q-Learning and Least Squares 7
11/9/16 Linear Approximation: Regression 40 26 24 20 22 20 30 40 20 0 30 0 20 20 10 10 0 0 Prediction: Prediction: Optimization: Least Squares Error or “residual” Observation Prediction 0 0 20 8
11/9/16 Minimizing Error Imagine we had only one point x, with features f(x), target value y, and weights w: Approximate q update explained: “prediction” “target” Overfitting: Why Limiting Capacity Can Help 30 25 20 Degree 15 polynomial 15 10 5 0 -5 -10 -15 0 2 4 6 8 10 12 14 16 18 20 9
11/9/16 Simple Problem Given: Features of current state Predict: Will Pacman die on the next step? 21 Just one feature. See a pattern? § Ghost one step away, pacman dies § Ghost one step away, pacman dies § Ghost one step away, pacman dies § Ghost one step away, pacman dies § Ghost one step away, pacman lives § Ghost more than one step away, pacman lives § Ghost more than one step away, pacman lives § Ghost more than one step away, pacman lives § Ghost more than one step away, pacman lives § Ghost more than one step away, pacman lives § Ghost more than one step away, pacman lives Learn: Ghost one step away à pacman dies! 22 10
11/9/16 What if we add more features? § Ghost one step away, score 211, pacman dies § Ghost one step away, score 341, pacman dies § Ghost one step away, score 231, pacman dies § Ghost one step away, score 121, pacman dies § Ghost one step away, score 301, pacman lives § Ghost more than one step away, score 205, pacman lives § Ghost more than one step away, score 441, pacman lives § Ghost more than one step away, score 219, pacman lives § Ghost more than one step away, score 199, pacman lives § Ghost more than one step away, score 331, pacman lives § Ghost more than one step away, score 251, pacman lives Learn: Ghost one step away AND score is NOT prime number à pacman dies! 24 There’s fitting, and there’s 30 25 20 Degree 1 polynomial 15 10 5 0 -5 -10 -15 0 2 4 6 8 10 12 14 16 18 20 11
11/9/16 There’s fitting, and there’s 30 25 20 Degree 2 polynomial 15 10 5 0 -5 -10 -15 0 2 4 6 8 10 12 14 16 18 20 Overfitting 30 25 20 Degree 15 polynomial 15 10 5 0 -5 -10 -15 0 2 4 6 8 10 12 14 16 18 20 12
11/9/16 Approximating Q Function • Linear Approximation • Could also use Deep Neural Network – https://www.nervanasys.com/demystifying-deep- reinforcement-learning/ Q(s,a) Deepmind Atari https://www.youtube.com/watch?v=V1eYniJ0Rnk 13
11/9/16 DQN Results on Atari Slide adapted from David Silver Approximating the Q Function Linear Approximation f 1 (s,a) f 2 (s,a) Q f m (s,a) 1 Neural Approximation (nonlinear) h ( z ) = 1 + e − z f 1 (s,a) f 2 (s,a) Q 1 f m (s,a) h(z) a 0 z 0 14
o O o o o O 11/9/16 Deep Representations I A deep representation is a composition of many functions / h 1 / h n / y / l x / ... w 1 w n ... I Its gradient can be backpropagated by the chain rule ∂ h 2 ∂ hn ∂ h 1 ∂ y ∂ hn − 1 ∂ h 1 ∂ l ∂ x ∂ l ∂ l ∂ hn ∂ l ∂ x ∂ h 1 ... ∂ h n ∂ y ∂ h 1 ∂ hn ∂ w 1 ✏ ∂ wn ✏ ∂ l ∂ l ∂ w 1 ∂ w n ... Slide adapted from David Silver Multi Layer Perceptron • Multiple Layers [ Y 1 , Y 2 ] • Feed Forward output k z j å • Connected Weights = x i w ij w jk i • 1-of-N Output hidden j 1 v ij a 0 z 0 i input 1 = a [ X 1 , X 2 , X 3 ] - z + e 1 15
11/9/16 Training via Stochastic Gradient Descent "$ � I Sample gradient of expected loss L ( w ) = E [ l ] ∂ l � = ∂ L ( w ) ∂ l ∂ w ∼ E ∂ w ∂ w I Adjust w down the sampled gradient &$ � ∆ w ∝ ∂ l ∂ w Slide adapted from David Silver Aka ... Backpropagation • Minimize error of calculated output k • Adjust weights • Gradient Descent w jk • Procedure • Forward Phase j • Backpropagation of errors v ij • For each sample, multiple epochs i 16
O / O ? O O / = 11/9/16 Weight Sharing Recurrent neural network shares weights between time-steps y t y t +1 / h t h t +1 ... ... x t x t +1 w w Convolutional neural network shares weights between local regions w 2 w 1 w 2 w 1 h 2 h 1 x Slide adapted from David Silver Recap: Approx Q-Learning I Optimal Q-values should obey Bellman equation � Q ( s 0 , a 0 ) ⇤ | s , a Q ⇤ ( s , a ) = E s 0 r + γ max a 0 I Treat right-hand side r + γ max Q ( s 0 , a 0 , w ) as a target a 0 I Minimise MSE loss by stochastic gradient descent ⌘ 2 ⇣ l = r + γ max Q ( s 0 , a 0 , w ) − Q ( s , a , w ) a I Converges to Q ⇤ using table lookup representation I But diverges using neural networks due to: I Correlations between samples I Non-stationary targets Slide adapted from David Silver 17
11/9/16 Deep Q-Networks (DQN) Experience Replay To remove correlations, build data-set from agent’s own experience s 1 , a 1 , r 2 , s 2 s 2 , a 2 , r 3 , s 3 s , a , r , s 0 → s 3 , a 3 , r 4 , s 4 ... s t , a t , r t +1 , s t +1 s t , a t , r t +1 , s t +1 → Sample experiences from data-set and apply update ◆ 2 ✓ Q ( s 0 , a 0 , w � ) − Q ( s , a , w ) l = r + γ max a 0 To deal with non-stationarity, target parameters w � are held fixed Slide adapted from David Silver DQN in Atari I End-to-end learning of values Q ( s , a ) from pixels s I Input state s is stack of raw pixels from last 4 frames I Output is Q ( s , a ) for 18 joystick/button positions I Reward is change in score for that step Network architecture and hyperparameters fixed across all games Slide adapted from David Silver 18
11/9/16 Deep Mind Resources See also: http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf That’s all for Reinforcement Learning! Data (experiences Reinforcement Policy (how to with environment) Learning Agent act in the future) • Very tough problem: How to perform any task well in an unknown, noisy environment! • Traditionally used mostly for robotics, but… Google DeepMind – RL applied to data center power usage 49 19
11/9/16 That’s all for Reinforcement Learning! Data (experiences Reinforcement Policy (how to with environment) Learning Agent act in the future) Lots of open research areas: – How to best balance exploration and exploitation? – How to deal with cases where we don’t know a good state/feature representation? 50 Conclusion • We’re done with Part I: Search and Planning! • We’ve seen how AI methods can solve problems in: – Search – Constraint Satisfaction Problems – Games – Markov Decision Problems – Reinforcement Learning • Next up: Part II: Uncertainty and Learning! 20
Recommend
More recommend