le lecture 10 recap ap
play

Le Lecture 10 recap ap Prof. Leal-Taix and Prof. Niessner 1 Le - PowerPoint PPT Presentation

Le Lecture 10 recap ap Prof. Leal-Taix and Prof. Niessner 1 Le LeNet 60k parameters Digit recognition: 10 classes Conv -> Pool -> Conv -> Pool -> Conv -> FC As we go deeper: Width, height Number of filters


  1. Le Lecture 10 recap ap Prof. Leal-Taixé and Prof. Niessner 1

  2. Le LeNet 60k parameters • Digit recognition: 10 classes • Conv -> Pool -> Conv -> Pool -> Conv -> FC • As we go deeper: Width, height Number of filters Prof. Leal-Taixé and Prof. Niessner 2

  3. [Krizhevsky et al. 2012] Al AlexNe Net • Softmax for 1000 classes Prof. Leal-Taixé and Prof. Niessner 3

  4. VG VGGNet [Simonyan and Zisserman 2014] • Striving for simplicity • CONV = 3x3 filters with stride 1, same convolutions • MAXPOOL = 2x2 filters with stride 2 Prof. Leal-Taixé and Prof. Niessner 4

  5. VG VGGNet Conv=3x3,s=1,same Maxpool=2x2,s=2 Prof. Leal-Taixé and Prof. Niessner 5

  6. VG VGGNet • Conv -> Pool -> Conv -> Pool -> Conv -> FC • As we go deeper: Width, height Number of filters • Called VGG-16: 16 layers that have weights 138M parameters • Large but simplicity makes it appealing Prof. Leal-Taixé and Prof. Niessner 6

  7. The The problem of depth • As we add more and more layers, training becomes harder • Vanishing and exploding gradients • How can we train very deep nets? Prof. Leal-Taixé and Prof. Niessner 7

  8. Re Residual block • Two layers x L x L +1 x L − 1 x L = f ( W L x L − 1 + b L ) W L x L − 1 + b L Input Non-linearity Linear x L +1 = f ( W L +1 x L + b L +1 ) Prof. Leal-Taixé and Prof. Niessner 8

  9. Re Residual block • Two layers x L x L +1 x L − 1 Skip connection Input Linear Linear Main path Prof. Leal-Taixé and Prof. Niessner 9

  10. Re Residual block • Two layers x L x L +1 x L − 1 x L +1 = f ( W L +1 x L + b L +1 + x L − 1 ) Input Linear Linear x L +1 = f ( W L +1 x L + b L +1 ) Prof. Leal-Taixé and Prof. Niessner 10

  11. Re Residual block + • Two layers x L x L +1 x L − 1 • Usually use a same convolution since we need same dimensions • Otherwise we need to convert the dimensions with a matrix of learned weights or zero padding Prof. Leal-Taixé and Prof. Niessner 11

  12. Wh Why do Re ResNets wo work? + x L x L +1 x L − 1 NN • The identity is easy for the residual block to learn • Guaranteed it will not hurt performance, can only improve Prof. Leal-Taixé and Prof. Niessner 12

  13. 1x 1x1 1 convoluti tion -5 3 2 -5 3 Image 5x5 4 3 2 1 -3 What is the output size? 1 0 3 3 5 -2 0 1 4 4 5 6 7 9 -1 Kernel 1x1 2 Prof. Leal-Taixé and Prof. Niessner 13

  14. 1x 1x1 1 convoluti tion -5 3 2 -5 3 -10 Image 5x5 4 3 2 1 -3 1 0 3 3 5 -2 0 1 4 4 5 6 7 9 -1 Kernel 1x1 2 −5 ∗ 2 = −10 Prof. Leal-Taixé and Prof. Niessner 14

  15. 1x 1x1 1 convoluti tion -5 3 2 -5 3 -10 6 4 -10 6 Image 5x5 4 3 2 1 -3 8 6 4 2 -6 1 0 3 3 5 2 0 6 6 10 -2 0 1 4 4 -4 0 2 8 8 5 6 7 9 -1 10 12 14 18 -2 Kernel 1x1 2 −1 ∗ 2 = −2 Prof. Leal-Taixé and Prof. Niessner 15

  16. 1x 1x1 1 convoluti tion -5 3 2 -5 3 -10 6 4 -10 6 Image 5x5 4 3 2 1 -3 8 6 4 2 -6 1 0 3 3 5 2 0 6 6 10 -2 0 1 4 4 -4 0 2 8 8 5 6 7 9 -1 10 12 14 18 -2 • For 1 kernel or filter, it keeps the dimensions and just scales the input with a number Prof. Leal-Taixé and Prof. Niessner 16

  17. Us Using 1x1 convolutions • Use it to shrink the number of channels • Further adds a non-linearity à one can learn more complex functions 32 32 32 Conv 1x1x200 + ReLU 32 32 32 200 Prof. Leal-Taixé and Prof. Niessner 17

  18. In Inceptio ion layer • Tired of choosing filter sizes? • Use them all! • All same convolutions • 3x3 max pooling is with stride 1 Prof. Leal-Taixé and Prof. Niessner 18

  19. In Inceptio ion layer: : computatio ional cost 32 32 92 Conv 5x5 16 Conv 1x1 32 + ReLU + ReLU 32 32 32 200 16 92 Multiplications: 1x1x200x32x32x16 5x5x16x32x32x92 ~ 40 million Reduction of multiplications by 1/10 Prof. Leal-Taixé and Prof. Niessner 19

  20. In Inceptio ion layer Prof. Leal-Taixé and Prof. Niessner 20

  21. Se Semant ntic Se Segment ntation n (FCN) [Long et al. 15] Fully Convolutional Networks for Semantic Segmetnation (FCN) Prof. Leal-Taixé and Prof. Niessner 21

  22. Tr Trans nsfer learni ning ng Trained on ImageNet TRAIN New dataset with C classes FROZEN Prof. Leal-Taixé and Prof. Niessner Donahue 2014, Razavian 2014 22

  23. No Now you are: • Ready to perform image classification on any dataset • Ready to design your own architecture • Ready to deal with other problems such as semantic segmentation (Fully Convolutional Network) Prof. Leal-Taixé and Prof. Niessner 23

  24. Re Recurrent Ne Neural Ne Networks Prof. Leal-Taixé and Prof. Niessner 24

  25. RN RNNs are flexi xible Classic Neural Networks for Image Classification Prof. Leal-Taixé and Prof. Niessner 25

  26. RN RNNs are flexi xible Image captioning Prof. Leal-Taixé and Prof. Niessner 26

  27. RN RNNs are flexi xible Language recognition Prof. Leal-Taixé and Prof. Niessner 27

  28. RN RNNs are flexi xible Machine translation Prof. Leal-Taixé and Prof. Niessner 28

  29. RN RNNs are flexi xible Event classification Prof. Leal-Taixé and Prof. Niessner 29

  30. Ba Basic c struct uctur ure of a RN RNN • Multi-layer RNN Outputs Hidden states Inputs Prof. Leal-Taixé and Prof. Niessner 30

  31. Ba Basic c struct uctur ure of a RN RNN • Multi-layer RNN Outputs The hidden state will have its own Hidden internal dynamics states More expressive model! Inputs Prof. Leal-Taixé and Prof. Niessner 31

  32. Ba Basic c struct uctur ure of a RN RNN • We want to have notion of “time” or “sequence” Hidden state Previous input hidden state Prof. Leal-Taixé and Prof. Niessner 32 [Christopher Olah] Understanding LSTMs

  33. Ba Basic c struct uctur ure of a RN RNN • We want to have notion of “time” or “sequence” Hidden state Parameters to be learned Prof. Leal-Taixé and Prof. Niessner 33

  34. Ba Basic c struct uctur ure of a RN RNN • We want to have notion of “time” or “sequence” Output Hidden state Note: non-linearities ignored for now Prof. Leal-Taixé and Prof. Niessner 34

  35. Ba Basic c struct uctur ure of a RN RNN • We want to have notion of “time” or “sequence” Output Hidden state Same parameters for each time step = generalization! Prof. Leal-Taixé and Prof. Niessner 35

  36. Ba Basic c struct uctur ure of a RN RNN • Unrolling RNNs Hidden state is the same Prof. Leal-Taixé and Prof. Niessner 36 [Christopher Olah] Understanding LSTMs

  37. Ba Basic c struct uctur ure of a RN RNN • Unrolling RNNs Prof. Leal-Taixé and Prof. Niessner 37 [Christopher Olah] Understanding LSTMs

  38. Ba Basic c struct uctur ure of a RN RNN • Unrolling RNNs as feedforward nets x t x t +1 x t +2 1 1 1 w 1 w 1 w 1 w 1 w 2 w 2 w 2 w 2 w 4 w 4 w 4 w 4 w 3 w 3 w 3 w 3 x t +1 x t +2 Weights are the same! x t 2 2 2 Prof. Leal-Taixé and Prof. Niessner 38

  39. Ba Back ckprop th through a a RNN • Unrolling RNNs as feedforward nets Chain rule w 1 w 1 w 1 w 1 w 2 w 2 w 2 w 2 w 4 w 4 w 4 w 4 w 3 w 3 w 3 w 3 All the way to t=0 Add the derivatives at different times for each weight Prof. Leal-Taixé and Prof. Niessner 39

  40. Lo Long ng-te term dependencies I mo moved to Germany any … so I speak German an fluently Prof. Leal-Taixé and Prof. Niessner 40

  41. Lo Long ng-te term dependencies • Simple recurrence A t = θ t A 0 • Let us forget the input Same weights are multiplied over and over again Prof. Leal-Taixé and Prof. Niessner 41

  42. Lo Long ng-te term dependencies A t = θ t A 0 • Simple recurrence What happens to small weights? Vanishing gradient What happens to large weights? Exploding gradient Prof. Leal-Taixé and Prof. Niessner 42

  43. Lo Long ng-te term dependencies A t = θ t A 0 • Simple recurrence • If admits eigendecomposition Matrix of Diagonal of this eigenvectors matrix are the eigenvalues Prof. Leal-Taixé and Prof. Niessner 43

  44. Lo Long ng-te term dependencies A t = θ t A 0 • Simple recurrence • If admits eigendecomposition • Orthogonal allows us to simplify the recurrence A t = Q Λ t Q | A 0 Prof. Leal-Taixé and Prof. Niessner 44

  45. Lo Long ng-te term dependencies A t = Q Λ t Q | A 0 • Simple recurrence What happens to eigenvalues with magnitude less than one? Vanishing gradient What happens to eigenvalues with magnitude larger than one? Exploding gradient Gradient clipping Prof. Leal-Taixé and Prof. Niessner 45

  46. Lo Long ng-te term dependencies A t = θ t A 0 • Simple recurrence Let us just make a matrix with eigenvalues = 1 Allow the ce cell to maintain its “state” Prof. Leal-Taixé and Prof. Niessner 46

  47. Va Vanishing gradient A t = θ t A 0 • 1. From the weights • 2. From the activation functions (tanh) Prof. Leal-Taixé and Prof. Niessner 47

  48. Va Vanishing gradient A t = θ t A 0 • 1. From the weights 1 • 2. From the activation functions (tanh) Prof. Leal-Taixé and Prof. Niessner 48

  49. Lo Long ng Sho hort Term Me Memory Prof. Leal-Taixé and Prof. Niessner Hochreiter and Schmidhuber 1997 49

Recommend


More recommend