cs885 reinforcement learning lecture 4a may 11 2018
play

CS885 Reinforcement Learning Lecture 4a: May 11, 2018 Deep Neural - PowerPoint PPT Presentation

CS885 Reinforcement Learning Lecture 4a: May 11, 2018 Deep Neural Networks [GBC] Chap. 6, 7, 8 University of Waterloo CS885 Spring 2018 Pascal Poupart 1 Quick recap Markov Decision Processes: value iteration ( " + * ,- Pr "


  1. CS885 Reinforcement Learning Lecture 4a: May 11, 2018 Deep Neural Networks [GBC] Chap. 6, 7, 8 University of Waterloo CS885 Spring 2018 Pascal Poupart 1

  2. Quick recap • Markov Decision Processes: value iteration ( " + * ∑ ,- Pr " - ", 1 !(" - ) ! " ← max ' • Reinforcement Learning: Q-Learning ' 8 4 " - , 1 - − 4(", 1)] 4 ", 1 ← 4 ", 1 + 5[7 + * max • Complexity depends on number of states and actions University of Waterloo CS885 Spring 2018 Pascal Poupart 2

  3. Large State Spaces • Computer Go: 3 "#$ states • Inverted pendulum: (&, & ( , ), ) ( ) – 4-dimensional continuous state space • Atari: 210x160x3 dimensions (pixel values) University of Waterloo CS885 Spring 2018 Pascal Poupart 3

  4. Functions to be Approximated • Policy: ! " → $ • Q-function: % ", $ ∈ ℜ • Value function: ) " ∈ ℜ University of Waterloo CS885 Spring 2018 Pascal Poupart 4

  5. Q-function Approximation • Let ! = # $ , # & , … , # ( ) • Linear * !, + ≈ ∑ . / 0. # . • Non-linear (e.g., neural network) * !, + ≈ 1(3; 5) University of Waterloo CS885 Spring 2018 Pascal Poupart 5

  6. Traditional Neural Network • Network of units (computational neurons) linked by weighted edges • Each unit computes: z = ℎ(% & ' + )) – Inputs: ' – Output: + – Weights (parameters): % – Bias: ) – Activation function (usually non-linear): ℎ University of Waterloo CS885 Spring 2018 Pascal Poupart 6

  7. One hidden Layer Architecture • Feed-forward neural network (%) 1 * % input hidden 1 (.) * % (%) 1 %% output (.) 1 %% 3 % ! % (%) 1 .% , % (%) * . (%) 1 %. (.) 1 %. 3 . ! . (%) 1 .. % ( + * (%) ) • Hidden units: ! " = ℎ % (' " " . / + * - (.) ) • Output units: , - = ℎ . (' - (%) + * - . ℎ % ∑ 2 1 % 3 2 + * (.) • Overall: , - = ℎ . ∑ " 1 -" "2 " University of Waterloo CS885 Spring 2018 Pascal Poupart 7

  8. Traditional activation functions ℎ • Threshold: ℎ " = $ 1 " ≥ 0 −1 " < 0 + • Sigmoid: ℎ " = * " = +,- ./ 3 • Gaussian: ℎ " = 0 1 2 /.4 3 5 - / 1- ./ • Tanh: ℎ " = tanh " = - / ,- ./ • Identity: ℎ " = " University of Waterloo CS885 Spring 2018 Pascal Poupart 8

  9. Universal function approximation • Theorem: Neural networks with at least one hidden layer of sufficiently many sigmoid/tanh/Gaussian units can approximate any function arbitrarily closely. • Picture: University of Waterloo CS885 Spring 2018 Pascal Poupart 9

  10. Minimize least squared error • Minimize error function ! " = 1 ! ' " ( = 1 ( 2 & 2 & ) * + , " − . ' ( ' ' where ) is the function encoded by the neural net • Train by gradient descent (a.k.a. backpropagation) – For each example (* ' , . ' ) , adjust the weights as follows: 23 − 5 6! ' 1 23 ← 1 61 23 University of Waterloo CS885 Spring 2018 Pascal Poupart 10

  11. Deep Neural Networks • Definition: neural network with many hidden layers • Advantage: high expressivity • Challenges: – How should we train a deep neural network? – How can we avoid overfitting? University of Waterloo CS885 Spring 2018 Pascal Poupart 11

  12. Mixture of Gaussians • Deep neural network • Shallow neural network (hierarchical mixture) (flat mixture) University of Waterloo CS885 Spring 2018 Pascal Poupart 12

  13. Image Classification • ImageNet Large Scale Visual Recognition Challenge Features + SVMs Deep Convolutional Neural Nets 28.2 30 25.8 Classification error (%) 5 8 19 22 152 depth 25 20 16.4 15 11.7 7.3 10 6.7 5.1 3.57 3.07 5 0 ) ) ) ) ) ) ) ) n 0 1 2 3 4 4 5 6 a 1 1 1 1 1 1 1 1 m 0 0 0 0 0 0 0 0 u 2 2 2 2 2 2 2 2 ( ( ( ( ( ( ( ( H C E t F G t t 4 e e e v Z E C G N N N - N R t x V e s e e e X L N l e R A e l g L o e o l g G o o G University of Waterloo CS885 Spring 2018 Pascal Poupart 13

  14. Vanishing Gradients • Deep neural networks of sigmoid and hyperbolic units often suffer from vanishing gradients medium large small gradient gradient gradient University of Waterloo CS885 Spring 2018 Pascal Poupart 14

  15. Sigmoid and hyperbolic units • Derivative is always less than 1 sigmoid hyperbolic University of Waterloo CS885 Spring 2018 Pascal Poupart 15

  16. Simple Example ! = # $ % # $ & # $ ' # $ ( ) • $ ( $ ' $ & $ % ) ℎ ( ℎ ' ℎ & ! Common weight initialization in (-1,1) • Sigmoid function and its derivative always less than 1 • This leads to vanishing gradients: • *+ *, - = #′(0 % )# 0 & *, 2 = # 3 0 % $ % #′(0 & )# 0 ' ≤ *+ *+ *, - *, 5 = # 3 0 % $ % #′(0 & )$ & #′(0 ' )#(0 ( ) ≤ *+ *+ *, 2 *, 6 = # 3 0 % $ % # 3 0 & $ & # 3 0 ' $ ' #′ 0 ( ) ≤ *+ *+ *, 5 University of Waterloo CS885 Spring 2018 Pascal Poupart 16

  17. Mitigating Vanishing Gradients • Some popular solutions: – Pre-training – Rectified linear units – Batch normalization – Skip connections University of Waterloo CS885 Spring 2018 Pascal Poupart 17

  18. Rectified Linear Units • Rectified linear: ℎ " = max(0, ") – Gradient is 0 or 1 – Sparse computation • Soft version (“Softplus”) : ℎ " = log(1 + 0 1 ) Softplus Rectified Linear • Warning: softplus does not prevent gradient vanishing (gradient < 1) University of Waterloo CS885 Spring 2018 Pascal Poupart 18

Recommend


More recommend