cs480 680 lecture 15 june 26 2019
play

CS480/680 Lecture 15: June 26, 2019 Deep Neural Networks [GBC] - PowerPoint PPT Presentation

CS480/680 Lecture 15: June 26, 2019 Deep Neural Networks [GBC] Chap. 6, 7, 8 University of Waterloo CS480/680 Spring 2019 Pascal Poupart 1 Outline Deep Neural Networks Gradient Vanishing Rectified linear units Overfitting


  1. CS480/680 Lecture 15: June 26, 2019 Deep Neural Networks [GBC] Chap. 6, 7, 8 University of Waterloo CS480/680 Spring 2019 Pascal Poupart 1

  2. Outline • Deep Neural Networks – Gradient Vanishing • Rectified linear units – Overfitting • Dropout • Breakthroughs – Acoustic modeling in speech recognition – Image recognition University of Waterloo CS480/680 Spring 2019 Pascal Poupart 2

  3. Deep Neural Networks • Definition: neural network with many hidden layers • Advantage: high expressivity • Challenges: – How should we train a deep neural network? – How can we avoid overfitting? University of Waterloo CS480/680 Spring 2019 Pascal Poupart 3

  4. Expressiveness • Neural networks with one hidden layer of sigmoid/hyperbolic units can approximate arbitrarily closely neural networks with several layers of sigmoid/hyperbolic units • However as we increase the number of layers, the number of units needed may decrease exponentially (with the number of layers) University of Waterloo CS480/680 Spring 2019 Pascal Poupart 4

  5. Example – Parity Function • Single layer of hidden nodes = 0 1 23 %$$ %& −1 23 565# 7 = 1 7 = −1 2 %&! odd "#$ "#$ "#$ "#$ "#$ "#$ "#$ "#$ subsets ! ! ! " ! # ! $ # inputs University of Waterloo CS480/680 Spring 2019 Pascal Poupart 5

  6. Example – Parity Function • 2" − 2 layers of hidden nodes 2 odd 2 odd 2 odd subsets subsets subsets ! ! "#$ %& "#$ %& "#$ %& = ( 1 *+ %$$ −1 *+ -.-# ! " "#$ ! # ! $ "#$ "#$ / = 1 / = −1 University of Waterloo CS480/680 Spring 2019 Pascal Poupart 6

  7. The power of depth (practice) • Challenge: how to train deep NNs? University of Waterloo CS480/680 Spring 2019 Pascal Poupart 7

  8. Speech • 2006 (Hinton, al.): first effective algo for deep NN – layerwise training of Stacked Restricted Boltzmann Machines (SRBM)s • 2009: Breakthrough in acoustic modeling – replace Gaussian Mixture Models by SRBMs – Improved speech recognition at Google,Microsoft,IBM • 2013-today: recurrent neural nets (LSTM) – Google error rate: 23% (2013) à 8% (2015) – Microsoft error rate: 5.9% (Oct 17, 2016) same as human performance University of Waterloo CS480/680 Spring 2019 Pascal Poupart 8

  9. Image Classification • ImageNet Large Scale Visual Recognition Challenge Features + SVMs Deep Convolutional Neural Nets 28.2 30 25.8 Classification error (%) 5 8 19 22 152 depth 25 20 16.4 15 11.7 10 7.3 6.7 5.1 3.57 3.07 5 0 NEC (2010) XRCE (2011) AlexNet (2012) ZF (2013) VGG (2014) GoogleLeNet (2014) ResNet (2015) GoogleLeNet-v4 (2016) Human University of Waterloo CS480/680 Spring 2019 Pascal Poupart 9

  10. Vanishing Gradients • Deep neural networks of sigmoid and hyperbolic units often suffer from vanishing gradients medium large small gradient gradient gradient University of Waterloo CS480/680 Spring 2019 Pascal Poupart 10

  11. Sigmoid and hyperbolic units • Derivative is always less than 1 sigmoid hyperbolic University of Waterloo CS480/680 Spring 2019 Pascal Poupart 11

  12. Simple Example ! = # $ % # $ & # $ ' # $ ( ) • $ + $ ) $ ' $ & ) ℎ ) ℎ ' ! ℎ + Common weight initialization in (-1,1) • Sigmoid function and its derivative always less than 1 • This leads to vanishing gradients: • !# $ = # % * & # * ' !" As products of !# ( = # % * & $ & # % * ' # * ) !" factors less than 1 gets longer, !# * = # % * & $ & # % * ' $ ' # % * ) # * + !" gradient vanishes !# , = # % * & $ & # % * ' $ ' # % * ) $ ) #′ * + ) !" University of Waterloo CS480/680 Spring 2019 Pascal Poupart 12

  13. Avoiding Vanishing Gradients • Several popular solutions: – Pre-training – Rectified linear units and maxout units – Skip connections – Batch normalization University of Waterloo CS480/680 Spring 2019 Pascal Poupart 13

  14. Rectified Linear Units • Rectified linear: ℎ " = max(0, ") – Gradient is 0 or 1 – Sparse computation • Soft version (“Softplus”) : ℎ " = log(1 + 0 ! ) Softplus Rectified Linear • Warning: softplus does not prevent gradient vanishing (gradient < 1) University of Waterloo CS480/680 Spring 2019 Pascal Poupart 14

  15. Maxout Units • Generalization of rectified linear units " # ! , ∑ ! % ! # # ! , ∑ ! % ! $ # ! , … ∑ ! % ! !"# max identity identity identity ! ( ! ) ! + ! * University of Waterloo CS480/680 Spring 2019 Pascal Poupart 15

  16. Overfitting • High expressivity increases the risk of overfitting – # of parameters is often larger than the amount of data • Some solutions: – Regularization – Dropout – Data augmentation University of Waterloo CS480/680 Spring 2019 Pascal Poupart 16

  17. Dropout • Idea: randomly “drop” some units from the network when training • Training: at each iteration of gradient descent – Each input unit is dropped with probability ! ! (e.g., 0.2) – Each hidden unit is dropped with probability ! " (e.g., 0.5) • Prediction (testing): – Multiply each input unit by 1 − ! ! – Multiply each hidden unit by 1 − ! " University of Waterloo CS480/680 Spring 2019 Pascal Poupart 17

  18. Dropout Algorithm Training: let ⨀ denote elementwise multiplication Repeat • – For each training example (# ! , % ! ) do ()) from *+,-./001 1 − 4 ) 5 ! for 1 ≤ 0 ≤ 7 • Sample ' ( • Neural network with dropout applied: % $ # ! # ! , ' ! ; : = ℎ " : # … ℎ $ : $ ℎ % : % 8 > # ! ⨀' ! ⨀ ' ! … ⨀ ' ! • Loss: ?,,(% ( , 8 ( (# ( , ' ( ; :) DEFF • Update: @ 5A ← @ 5A − C DG "# – End for Until convergence • Prediction: 8 # ! ; : = ℎ " : # … ℎ $ : $ ℎ % : % > # ! (1 − 4 % 1 − 4 $ … (1 − 4 # ) University of Waterloo CS480/680 Spring 2019 Pascal Poupart 18

  19. Intuition • Dropout can be viewed as an approximate form of ensemble learning • In each training iteration, a different subnetwork is trained • At test time, these subnetworks are “merged” by averaging their weights University of Waterloo CS480/680 Spring 2019 Pascal Poupart 19

  20. Applications of Deep Neural Networks • Speech Recognition • Image recognition • Machine translation • Control • Any application of shallow neural networks University of Waterloo CS480/680 Spring 2019 Pascal Poupart 20

  21. Acoustic Modeling in Speech Recognition University of Waterloo CS480/680 Spring 2019 Pascal Poupart 21

  22. Acoustic Modeling in Speech Recognition University of Waterloo CS480/680 Spring 2019 Pascal Poupart 22

  23. Image Recognition • Convolutional Neural Network – With rectified linear units and dropout – Data augmentation for transformation invariance University of Waterloo CS480/680 Spring 2019 Pascal Poupart 23

  24. ImageNet Breakthrough • Results: ILSVRC-2012 • From Krizhevsky, Sutskever, Hinton University of Waterloo CS480/680 Spring 2019 Pascal Poupart 24

  25. ImageNet Breakthrough • From Krizhevsky, Sutskever, Hinton University of Waterloo CS480/680 Spring 2019 Pascal Poupart 25

Recommend


More recommend