Q Learning Forall s, a Initialize Q(s, a) = 0 Repeat Forever - PDF document

11/9/16 Approximate Q-Learning Dan Weld / University of Washington [Many slides taken from Dan Klein and Pieter Abbeel / CS188 Intro to AI at UC Berkeley – materials available at http://ai.berkeley.edu.] Q Learning Forall s, a Initialize Q(s, a) = 0 Repeat Forever Where are you? s. Choose some action a Execute it in real world: (s, a, r, s’) Do update: } Equivalently 1

11/9/16 Q Learning Forall s, a Initialize Q(s, a) = 0 Repeat Forever Where are you? s. Choose some action a Execute it in real world: (s, a, r, s’) Do update: Example: Pacman Let’s say we discover Or even this through experience one! that this state is bad: 2

11/9/16 Q-learning, no features, 50 learning trials QuickTime™ and a GIF decompressor are needed to see this picture. Q-learning, no features, 1000 learning trials: QuickTime™ and a GIF decompressor are needed to see this picture. 3

11/9/16 Feature-Based Representations Soln: describe states w/ vector of features (aka “properties”) – Features = functions from states to R (often 0/1) capturing important properties of the state – Examples: • Distance to closest ghost or dot • Number of ghosts • 1 / (dist to dot) 2 • Is Pacman in a tunnel? (0/1) …… etc. • Is state the exact state on this slide? – Can also describe a q-state (s, a) with features (e.g. action moves closer to food) How to use features? Using features we can represent V and/or Q as follows: V(s) = g(f 1 (s), f 2 (s), …, f n (s)) Q(s,a) = g(f 1 (s,a), f 2 (s,a), …, f n (s,a)) What should we use for g? (and f)? 4

11/9/16 Linear Combination • Using a feature representation, we can write a q function (or value function) for any state using a few weights: • Advantage: our experience is summed up in a few powerful numbers • Disadvantage: states sharing features may actually have very different values! Approximate Q-Learning • Q-learning with linear Q-functions: Exact Q’s Approximate Q’s • Intuitive interpretation: – Adjust weights of active features – E.g., if something unexpectedly bad happens, blame the features that were on: disprefer all states with that state’s features • Formal justification: in a few slides! 5

11/9/16 Example: Pacman Features 𝑅 𝑡, 𝑏 = 𝑥 ( 𝑔 *+, 𝑡, 𝑏 + 𝑥 . 𝑔 /0, (𝑡, 𝑏) 1 𝑔 *+, 𝑡, 𝑏 = 𝑒𝑗𝑡𝑢𝑏𝑜𝑑𝑓 𝑢𝑝 𝑑𝑚𝑝𝑡𝑓𝑡𝑢 𝑔𝑝𝑝𝑒 𝑏𝑔𝑢𝑓𝑠 𝑢𝑏𝑙𝑗𝑜𝑕 𝑏 𝑔 *+, 𝑡, 𝑂𝑃𝑆𝑈𝐼 = 0.5 𝑔 /0, 𝑡, 𝑏 = 𝑒𝑗𝑡𝑢𝑏𝑜𝑑𝑓 𝑢𝑝 𝑑𝑚𝑝𝑡𝑓𝑡𝑢 𝑕ℎ𝑝𝑡𝑢 𝑏𝑔𝑢𝑓𝑠 𝑢𝑏𝑙𝑗𝑜𝑕 𝑔 /0, 𝑡, 𝑂𝑃𝑆𝑈𝐼 = 1.0 Example: Q-Pacman α = 0.004 [Demo: approximate Q- learning pacman 6

11/9/16 Video of Demo Approximate Q- Learning -- Pacman Sidebar: Q-Learning and Least Squares 7

11/9/16 Linear Approximation: Regression 40 26 24 20 22 20 30 40 20 0 30 0 20 20 10 10 0 0 Prediction: Prediction: Optimization: Least Squares Error or “residual” Observation Prediction 0 0 20 8

11/9/16 Minimizing Error Imagine we had only one point x, with features f(x), target value y, and weights w: Approximate q update explained: “prediction” “target” Overfitting: Why Limiting Capacity Can Help 30 25 20 Degree 15 polynomial 15 10 5 0 -5 -10 -15 0 2 4 6 8 10 12 14 16 18 20 9

11/9/16 Simple Problem Given: Features of current state Predict: Will Pacman die on the next step? 21 Just one feature. See a pattern? § Ghost one step away, pacman dies § Ghost one step away, pacman dies § Ghost one step away, pacman dies § Ghost one step away, pacman dies § Ghost one step away, pacman lives § Ghost more than one step away, pacman lives § Ghost more than one step away, pacman lives § Ghost more than one step away, pacman lives § Ghost more than one step away, pacman lives § Ghost more than one step away, pacman lives § Ghost more than one step away, pacman lives Learn: Ghost one step away à pacman dies! 22 10

11/9/16 What if we add more features? § Ghost one step away, score 211, pacman dies § Ghost one step away, score 341, pacman dies § Ghost one step away, score 231, pacman dies § Ghost one step away, score 121, pacman dies § Ghost one step away, score 301, pacman lives § Ghost more than one step away, score 205, pacman lives § Ghost more than one step away, score 441, pacman lives § Ghost more than one step away, score 219, pacman lives § Ghost more than one step away, score 199, pacman lives § Ghost more than one step away, score 331, pacman lives § Ghost more than one step away, score 251, pacman lives Learn: Ghost one step away AND score is NOT prime number à pacman dies! 24 There’s fitting, and there’s 30 25 20 Degree 1 polynomial 15 10 5 0 -5 -10 -15 0 2 4 6 8 10 12 14 16 18 20 11

11/9/16 There’s fitting, and there’s 30 25 20 Degree 2 polynomial 15 10 5 0 -5 -10 -15 0 2 4 6 8 10 12 14 16 18 20 Overfitting 30 25 20 Degree 15 polynomial 15 10 5 0 -5 -10 -15 0 2 4 6 8 10 12 14 16 18 20 12

11/9/16 Approximating Q Function • Linear Approximation • Could also use Deep Neural Network – https://www.nervanasys.com/demystifying-deep- reinforcement-learning/ Q(s,a) Deepmind Atari https://www.youtube.com/watch?v=V1eYniJ0Rnk 13

11/9/16 DQN Results on Atari Slide adapted from David Silver Approximating the Q Function Linear Approximation f 1 (s,a) f 2 (s,a) Q f m (s,a) 1 Neural Approximation (nonlinear) h ( z ) = 1 + e − z f 1 (s,a) f 2 (s,a) Q 1 f m (s,a) h(z) a 0 z 0 14

o O o o o O 11/9/16 Deep Representations I A deep representation is a composition of many functions / h 1 / h n / y / l x / ... w 1 w n ... I Its gradient can be backpropagated by the chain rule ∂ h 2 ∂ hn ∂ h 1 ∂ y ∂ hn − 1 ∂ h 1 ∂ l ∂ x ∂ l ∂ l ∂ hn ∂ l ∂ x ∂ h 1 ... ∂ h n ∂ y ∂ h 1 ∂ hn ∂ w 1 ✏ ∂ wn ✏ ∂ l ∂ l ∂ w 1 ∂ w n ... Slide adapted from David Silver Multi Layer Perceptron • Multiple Layers [ Y 1 , Y 2 ] • Feed Forward output k z j å • Connected Weights = x i w ij w jk i • 1-of-N Output hidden j 1 v ij a 0 z 0 i input 1 = a [ X 1 , X 2 , X 3 ] - z + e 1 15

11/9/16 Training via Stochastic Gradient Descent "$ � I Sample gradient of expected loss L ( w ) = E [ l ]  ∂ l � = ∂ L ( w ) ∂ l ∂ w ∼ E ∂ w ∂ w I Adjust w down the sampled gradient &$ � ∆ w ∝ ∂ l ∂ w Slide adapted from David Silver Aka ... Backpropagation • Minimize error of calculated output k • Adjust weights • Gradient Descent w jk • Procedure • Forward Phase j • Backpropagation of errors v ij • For each sample, multiple epochs i 16

O / O ? O O / = 11/9/16 Weight Sharing Recurrent neural network shares weights between time-steps y t y t +1 / h t h t +1 ... ... x t x t +1 w w Convolutional neural network shares weights between local regions w 2 w 1 w 2 w 1 h 2 h 1 x Slide adapted from David Silver Recap: Approx Q-Learning I Optimal Q-values should obey Bellman equation  � Q ( s 0 , a 0 ) ⇤ | s , a Q ⇤ ( s , a ) = E s 0 r + γ max a 0 I Treat right-hand side r + γ max Q ( s 0 , a 0 , w ) as a target a 0 I Minimise MSE loss by stochastic gradient descent ⌘ 2 ⇣ l = r + γ max Q ( s 0 , a 0 , w ) − Q ( s , a , w ) a I Converges to Q ⇤ using table lookup representation I But diverges using neural networks due to: I Correlations between samples I Non-stationary targets Slide adapted from David Silver 17

11/9/16 Deep Q-Networks (DQN) Experience Replay To remove correlations, build data-set from agent’s own experience s 1 , a 1 , r 2 , s 2 s 2 , a 2 , r 3 , s 3 s , a , r , s 0 → s 3 , a 3 , r 4 , s 4 ... s t , a t , r t +1 , s t +1 s t , a t , r t +1 , s t +1 → Sample experiences from data-set and apply update ◆ 2 ✓ Q ( s 0 , a 0 , w � ) − Q ( s , a , w ) l = r + γ max a 0 To deal with non-stationarity, target parameters w � are held fixed Slide adapted from David Silver DQN in Atari I End-to-end learning of values Q ( s , a ) from pixels s I Input state s is stack of raw pixels from last 4 frames I Output is Q ( s , a ) for 18 joystick/button positions I Reward is change in score for that step Network architecture and hyperparameters fixed across all games Slide adapted from David Silver 18

11/9/16 Deep Mind Resources See also: http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf That’s all for Reinforcement Learning! Data (experiences Reinforcement Policy (how to with environment) Learning Agent act in the future) • Very tough problem: How to perform any task well in an unknown, noisy environment! • Traditionally used mostly for robotics, but… Google DeepMind – RL applied to data center power usage 49 19

11/9/16 That’s all for Reinforcement Learning! Data (experiences Reinforcement Policy (how to with environment) Learning Agent act in the future) Lots of open research areas: – How to best balance exploration and exploitation? – How to deal with cases where we don’t know a good state/feature representation? 50 Conclusion • We’re done with Part I: Search and Planning! • We’ve seen how AI methods can solve problems in: – Search – Constraint Satisfaction Problems – Games – Markov Decision Problems – Reinforcement Learning • Next up: Part II: Uncertainty and Learning! 20

Q Learning Forall s, a Initialize Q(s, a) = 0 Repeat Forever - PDF document

11/9/16 Approximate Q-Learning Dan Weld / University of Washington [Many slides taken from Dan Klein and Pieter Abbeel / CS188 Intro to AI at UC Berkeley materials available at http://ai.berkeley.edu.] Q Learning Forall s, a Initialize

The Learning Tree Workshop: The Learning Tree Workshop: Experience-based Learning Series on

Machine Learning 11 AI Slides (6e) c Lin Zuoquan@PKU 1998-2020 11 1 11 Machine Learning

What is mobile learning, mobile learning policies and technologies Dr. Mohamed Ally Learning

Year 7 Learning Evening 2017 W elcome! Year 7 Learning Evening 2017 Year 7 Learning Evening

Learning is a never-ending process Tasks come and go, but learning is forever Learn more e ff

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

A Gentle Introduction to Machine Learning Supervised learning, unsupervised learning (very

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Learning From Data Lecture 2 The Perceptron The Learning Setup A Simple Learning Algorithm: PLA

Welcome to Welcome to The Learning Tree Workshop Series on Learning Differences, Learning

Impasse, Conflict Impasse, Conflict and Learning of CS Notions and Learning of CS Notions David

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

Why e Learning can actually be effective for learning an understanding from psycho

Fairness in Machine Learning Fairness in Supervised Learning Make decisions by machine learning:

Objectives Objectives Objectives Objectives Learning Learning Learning Learning

Learning Sciences: Impact on Learning Technologies & Learning Activities Phillip D. Long,

Plosives 0.9 0 -0.5012 0 4.1054 Time (s) 3000 0 0 4.1054 Time (s) [aba] [ada] [aa]

Discrete planes: an arithmetic and dynamical approach V. Berth e LIAFA-CNRS-Paris-France

Geometric View to Deep Learning David Xianfeng Gu 1 1 Computer Science & Applied Mathematics

Cost sharing and allowance allocation in a nutrient trading system for the Lake Rotorua

MARGARET KOUVELIS (Independent Chair) welcomed everyone to the meeting, and introduced

Power corrections with SCET Robert Szafron Technische Universit at M unchen MTTD 2019 1-6

Ecosystem Services Review Contacts hydro generation operations on the Clutha River 10 April

WHAKAPAPA COMMUNITY CULTURAL INTELLIGENCE TURANGAWAEWAE Appendix 2 M ori Achievement

Q Learning Forall s, a Initialize Q(s, a) = 0 Repeat Forever - PDF document

11/9/16 Approximate Q-Learning Dan Weld / University of Washington [Many slides taken from Dan Klein and Pieter Abbeel / CS188 Intro to AI at UC Berkeley materials available at http://ai.berkeley.edu.] Q Learning Forall s, a Initialize

The Learning Tree Workshop: The Learning Tree Workshop: Experience-based Learning Series on

Machine Learning 11 AI Slides (6e) c Lin Zuoquan@PKU 1998-2020 11 1 11 Machine Learning

What is mobile learning, mobile learning policies and technologies Dr. Mohamed Ally Learning

Year 7 Learning Evening 2017 W elcome! Year 7 Learning Evening 2017 Year 7 Learning Evening

Learning is a never-ending process Tasks come and go, but learning is forever Learn more e ff

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

A Gentle Introduction to Machine Learning Supervised learning, unsupervised learning (very

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Learning From Data Lecture 2 The Perceptron The Learning Setup A Simple Learning Algorithm: PLA

Welcome to Welcome to The Learning Tree Workshop Series on Learning Differences, Learning

Impasse, Conflict Impasse, Conflict and Learning of CS Notions and Learning of CS Notions David

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

Why e Learning can actually be effective for learning an understanding from psycho

Fairness in Machine Learning Fairness in Supervised Learning Make decisions by machine learning:

Objectives Objectives Objectives Objectives Learning Learning Learning Learning

Learning Sciences: Impact on Learning Technologies &amp; Learning Activities Phillip D. Long,

Plosives 0.9 0 -0.5012 0 4.1054 Time (s) 3000 0 0 4.1054 Time (s) [aba] [ada] [aa]

Discrete planes: an arithmetic and dynamical approach V. Berth e LIAFA-CNRS-Paris-France

Geometric View to Deep Learning David Xianfeng Gu 1 1 Computer Science &amp; Applied Mathematics

Cost sharing and allowance allocation in a nutrient trading system for the Lake Rotorua

MARGARET KOUVELIS (Independent Chair) welcomed everyone to the meeting, and introduced

Power corrections with SCET Robert Szafron Technische Universit at M unchen MTTD 2019 1-6

Ecosystem Services Review Contacts hydro generation operations on the Clutha River 10 April

WHAKAPAPA COMMUNITY CULTURAL INTELLIGENCE TURANGAWAEWAE Appendix 2 M ori Achievement

Learning Sciences: Impact on Learning Technologies & Learning Activities Phillip D. Long,

Geometric View to Deep Learning David Xianfeng Gu 1 1 Computer Science & Applied Mathematics