lectu ture 7 recap
play

Lectu ture 7 Recap Prof. Leal-Taix and Prof. Niessner 1 Bey - PowerPoint PPT Presentation

Lectu ture 7 Recap Prof. Leal-Taix and Prof. Niessner 1 Bey Beyon ond l linea ear 1-layer network: f = Wx x W f 10 128128 Prof. Leal-Taix and Prof. Niessner 2 Ne Neural Ne Netw twork Width Depth Prof. Leal-Taix and


  1. Lectu ture 7 Recap Prof. Leal-Taixé and Prof. Niessner 1

  2. Bey Beyon ond l linea ear 1-layer network: f = Wx x W f 10 128×128 Prof. Leal-Taixé and Prof. Niessner 2

  3. Ne Neural Ne Netw twork Width Depth Prof. Leal-Taixé and Prof. Niessner 3

  4. Opti Optimizati tion Prof. Leal-Taixé and Prof. Niessner 4

  5. Loss functi tions Prof. Leal-Taixé and Prof. Niessner 5

  6. Ne Neural netw tworks What is the shape of this function? Loss (Softmax, Hinge) Prediction Prof. Leal-Taixé and Prof. Niessner 6

  7. Si Sigmoid for bina nary predictions ns 1 σ ( x ) = 1 + e − x x 0 θ 0 1 Can be θ 1 interpreted as X x 1 a probability θ 2 0 x 2 p ( y i = 1 | x i , θ ) Prof. Leal-Taixé and Prof. Niessner 7

  8. Lo Logitic re regre ressi ssion • Binary classification x 0 θ 0 θ 1 X Π i x 1 θ 2 x 2 Prof. Leal-Taixé and Prof. Niessner 8

  9. Lo Logistic regression • Loss function L ( Π i , y i ) = y i log Π i + (1 − y i ) log(1 − Π i ) • Cost function n C ( θ ) = − 1 X y i log Π i + (1 − y i ) log(1 − Π i ) n i =1 Minimization = σ ( x i θ ) Prof. Leal-Taixé and Prof. Niessner 9

  10. Softmax re So regre ressi ssion • Cost function for the binary case n C ( θ ) = − 1 X y i log Π i + (1 − y i ) log(1 − Π i ) n i =1 Probability given by our • Extension to multiple classes sigmoid function n M C ( θ ) = − 1 X X y i,c log p i,c n i =1 c =1 Binary indicator whether is c the label for image i Prof. Leal-Taixé and Prof. Niessner 10

  11. So Softmax fo formu rmulation • What if we have multiple classes? x 0 Π 1 Softmax Π 2 x 1 Π 3 x 2 Prof. Leal-Taixé and Prof. Niessner 11

  12. So Softmax fo formu rmulation • Three neurons in the output layer for three classes x 0 Π 1 Π 2 x 1 Π 3 x 2 Prof. Leal-Taixé and Prof. Niessner 12

  13. So Softmax fo formu rmulation • What if we have multiple classes? n M C ( θ ) = − 1 X X y i,c log p i,c n i =1 c =1 • You can no longer assign to. as in the binary Π i p i,c case, because all outputs need to sum to 1 X Π i,c c Prof. Leal-Taixé and Prof. Niessner 13

  14. So Softmax fo formu rmulation Score for class cat given by all the layers of e s cat p ( cat | X i ) Π 1 the network = P c e s c p ( dog | X i ) Π 2 Normalization p ( bird | X i ) Π 3 • Softmax takes M inputs (Scores) and outputs M probabilities (M is the number of classes) Prof. Leal-Taixé and Prof. Niessner 14

  15. Lo Loss func nctions ns Evaluate the ground • Softmax loss function truth score for the e s yi ✓ ◆ image L i = − log P k e s k Comes from Maximum Likelihood Estimate • Hinge Loss (derived from the Multiclass SVM loss) X L i = max(0 , s k − s y i + 1) k 6 = y i Prof. Leal-Taixé and Prof. Niessner 15

  16. Lo Loss func nctions ns • Softmax loss function – Optimizes until the loss is zero • Hinge Loss (derived from the Multiclass SVM loss) – Saturates whenever it has learned a class “well enough” Prof. Leal-Taixé and Prof. Niessner 16

  17. Acti tivati tion functi tions Prof. Leal-Taixé and Prof. Niessner 17

  18. Si Sigmoid 1 σ ( x ) = 1 + e − x Forward x = 6 Saturated neurons kill the gradient flow ∂ L ∂ x = ∂σ ∂ L ∂σ ∂ L ∂ x ∂σ ∂ x ∂σ Prof. Leal-Taixé and Prof. Niessner 18

  19. Pr Probl blem of po positi tive ve outpu tput w 2 w 1 More on zero- mean data later Prof. Leal-Taixé and Prof. Niessner 19

  20. ta tanh Still saturates Zero- centered Still saturates LeCun 1991 Prof. Leal-Taixé and Prof. Niessner 20

  21. Rec Rectif ified L ied Lin inear ear U Unit its ( (ReL ReLU) Dead ReLU Large and What happens if a consistent ReLU outputs zero? gradients Fast convergence Does not saturate Prof. Leal-Taixé and Prof. Niessner 21

  22. Ma Maxou out un units ts Linear Generalization Does not Does not regimes of ReLUs die saturate Increase of the number of parameters Prof. Leal-Taixé and Prof. Niessner 22

  23. Da Data ta pre-pr proce cessing For images subtract the mean image (AlexNet) or per- channel mean (VGG-Net) Prof. Leal-Taixé and Prof. Niessner 23

  24. Weight t initi tializati tion Prof. Leal-Taixé and Prof. Niessner 24

  25. Ho How do I st start rt? Forward w w w w Prof. Leal-Taixé and Prof. Niessner 25

  26. In Init itial ializ izat atio ion is is extremely im importan ant Initialization Not guaranteed Optimum to reach the optimum Prof. Leal-Taixé and Prof. Niessner 26

  27. Ho How do I st start rt? X ! Forward f w i x i + b i w w w = 0 What happens to w the w gradients? No symmetry breaking Prof. Leal-Taixé and Prof. Niessner 27

  28. Al All we weights to to ze zero ro • Elaborate: the hidden units are all going to compute the same function, gradients are going to be the same Prof. Leal-Taixé and Prof. Niessner 28

  29. Sm Small rand ndom nu numbers • Gaussian with zero mean and standard deviation 0.01 • Let us see what happens: – Network with 10 layers with 500 neurons each – Tanh as activation functions – Input unit Gaussian data Prof. Leal-Taixé and Prof. Niessner 29

  30. Sm Small rand ndom nu numbers Activations become zero Last Input layer Forward Prof. Leal-Taixé and Prof. Niessner 30

  31. Sm Small rand ndom nu numbers small X ! f w i x i + b i Forward Prof. Leal-Taixé and Prof. Niessner 31

  32. Sm Small rand ndom nu numbers 2. Compute the 1. Activation X ! gradients wrt function f w i x i + b the weights gradient is ok i Backward Prof. Leal-Taixé and Prof. Niessner 32

  33. Sm Small rand ndom nu numbers 2. Compute the 1. Activation X ! gradients wrt function f w i x i + b the weights gradient is ok i Gradients vanish Prof. Leal-Taixé and Prof. Niessner 33

  34. Bi Big r random om n number ers • Gaussian with zero mean and standard deviation 1 • Let us see what happens: – Network with 10 layers with 500 neurons each – Tanh as activation functions – Input unit Gaussian data Prof. Leal-Taixé and Prof. Niessner 34

  35. Bi Big r random om n number ers Everything is saturated Prof. Leal-Taixé and Prof. Niessner 35

  36. Ho How to so solv lve this? s? • Working on the initialization • Working on the output generated by each layer Prof. Leal-Taixé and Prof. Niessner 36

  37. Xavier Xa r initiali liza zation • Gaussian with zero mean, but what standard deviation? n n X X Var( s ) = Var( w i x i ) = Var( w i x i ) i i n X Glorot 2010 Prof. Leal-Taixé and Prof. Niessner 37

  38. Xa Xavier r initiali liza zation • Gaussian with zero mean, but what standard deviation? n n X X Var( s ) = Var( w i x i ) = Var( w i x i ) Independent i i n X [ E ( w i )] 2 Var( x i ) + E [( x i )] 2 Var( w i ) + Var( x i )Var( w i ) = i n Zero mean Prof. Leal-Taixé and Prof. Niessner 38

  39. Xa Xavier r initiali liza zation • Gaussian with zero mean, but what standard deviation? n n X X Var( s ) = Var( w i x i ) = Var( w i x i ) i i n X [ E ( w i )] 2 Var( x i ) + E [( x i )] 2 Var( w i ) + Var( x i )Var( w i ) = i n X = Var( x i )Var( w i ) = ( n Var( w )) Var( x ) Identically distributed i Prof. Leal-Taixé and Prof. Niessner 39

  40. Xa Xavier r initiali liza zation • Gaussian with zero mean, but what standard deviation? n n X X Var( s ) = Var( w i x i ) = Var( w i x i ) i i n X [ E ( w i )] 2 Var( x i ) + E [( x i )] 2 Var( w i ) + Var( x i )Var( w i ) = i n X = Var( x i )Var( w i ) = ( n Var( w )) Var( x ) i Variance gets multiplied by the number of inputs Prof. Leal-Taixé and Prof. Niessner 40

  41. Xa Xavier r initiali liza zation • How to ensure the variance of the output is the same as the input? ( n Var( w )) Var( x ) 1 V ar ( w ) = 1 n Prof. Leal-Taixé and Prof. Niessner 41

  42. Xa Xavier r initiali liza zation Mitigates the effect of activations going to zero Prof. Leal-Taixé and Prof. Niessner 42

  43. Xa Xavier r initiali liza zation with ReL ReLU Prof. Leal-Taixé and Prof. Niessner 43

  44. ReL ReLU ki kills ha half of the he data V ar ( w ) = 2 n He 2015 Prof. Leal-Taixé and Prof. Niessner 44

  45. ReL ReLU ki kills ha half of the he data V ar ( w ) = 2 It makes a huge difference! n He 2015 Prof. Leal-Taixé and Prof. Niessner 45

  46. Ti Tips and nd tricks ks • Use ReLU and Xavier/2 initialization Prof. Leal-Taixé and Prof. Niessner 46

  47. Batc tch norma malizati tion Prof. Leal-Taixé and Prof. Niessner 47

  48. Ou Our go goal • All we want is that our activations do not die out

  49. Ba Batch n nor ormalization on • Wish: unit Gaussian activations (in our example) • Solution: let’s do it Mean of your mini-batch examples over feature k N = mini-batch size x ( k ) = x ( k ) − E[ x ( k ) ] ˆ p Var[ x ( k ) ] D = #features Ioffe and Szegedy 2015 Prof. Leal-Taixé and Prof. Niessner 49

  50. Ba Batch n nor ormalization on • In each dimension of the features, you have a unit gaussian (in our example) Mean of your mini-batch examples over feature k N = mini-batch size x ( k ) = x ( k ) − E[ x ( k ) ] ˆ p Var[ x ( k ) ] D = #features Ioffe and Szegedy 2015 Prof. Leal-Taixé and Prof. Niessner 50

Recommend


More recommend