Lectu ture 7 Recap Prof. Leal-Taixé and Prof. Niessner 1
Bey Beyon ond l linea ear 1-layer network: f = Wx x W f 10 128×128 Prof. Leal-Taixé and Prof. Niessner 2
Ne Neural Ne Netw twork Width Depth Prof. Leal-Taixé and Prof. Niessner 3
Opti Optimizati tion Prof. Leal-Taixé and Prof. Niessner 4
Loss functi tions Prof. Leal-Taixé and Prof. Niessner 5
Ne Neural netw tworks What is the shape of this function? Loss (Softmax, Hinge) Prediction Prof. Leal-Taixé and Prof. Niessner 6
Si Sigmoid for bina nary predictions ns 1 σ ( x ) = 1 + e − x x 0 θ 0 1 Can be θ 1 interpreted as X x 1 a probability θ 2 0 x 2 p ( y i = 1 | x i , θ ) Prof. Leal-Taixé and Prof. Niessner 7
Lo Logitic re regre ressi ssion • Binary classification x 0 θ 0 θ 1 X Π i x 1 θ 2 x 2 Prof. Leal-Taixé and Prof. Niessner 8
Lo Logistic regression • Loss function L ( Π i , y i ) = y i log Π i + (1 − y i ) log(1 − Π i ) • Cost function n C ( θ ) = − 1 X y i log Π i + (1 − y i ) log(1 − Π i ) n i =1 Minimization = σ ( x i θ ) Prof. Leal-Taixé and Prof. Niessner 9
Softmax re So regre ressi ssion • Cost function for the binary case n C ( θ ) = − 1 X y i log Π i + (1 − y i ) log(1 − Π i ) n i =1 Probability given by our • Extension to multiple classes sigmoid function n M C ( θ ) = − 1 X X y i,c log p i,c n i =1 c =1 Binary indicator whether is c the label for image i Prof. Leal-Taixé and Prof. Niessner 10
So Softmax fo formu rmulation • What if we have multiple classes? x 0 Π 1 Softmax Π 2 x 1 Π 3 x 2 Prof. Leal-Taixé and Prof. Niessner 11
So Softmax fo formu rmulation • Three neurons in the output layer for three classes x 0 Π 1 Π 2 x 1 Π 3 x 2 Prof. Leal-Taixé and Prof. Niessner 12
So Softmax fo formu rmulation • What if we have multiple classes? n M C ( θ ) = − 1 X X y i,c log p i,c n i =1 c =1 • You can no longer assign to. as in the binary Π i p i,c case, because all outputs need to sum to 1 X Π i,c c Prof. Leal-Taixé and Prof. Niessner 13
So Softmax fo formu rmulation Score for class cat given by all the layers of e s cat p ( cat | X i ) Π 1 the network = P c e s c p ( dog | X i ) Π 2 Normalization p ( bird | X i ) Π 3 • Softmax takes M inputs (Scores) and outputs M probabilities (M is the number of classes) Prof. Leal-Taixé and Prof. Niessner 14
Lo Loss func nctions ns Evaluate the ground • Softmax loss function truth score for the e s yi ✓ ◆ image L i = − log P k e s k Comes from Maximum Likelihood Estimate • Hinge Loss (derived from the Multiclass SVM loss) X L i = max(0 , s k − s y i + 1) k 6 = y i Prof. Leal-Taixé and Prof. Niessner 15
Lo Loss func nctions ns • Softmax loss function – Optimizes until the loss is zero • Hinge Loss (derived from the Multiclass SVM loss) – Saturates whenever it has learned a class “well enough” Prof. Leal-Taixé and Prof. Niessner 16
Acti tivati tion functi tions Prof. Leal-Taixé and Prof. Niessner 17
Si Sigmoid 1 σ ( x ) = 1 + e − x Forward x = 6 Saturated neurons kill the gradient flow ∂ L ∂ x = ∂σ ∂ L ∂σ ∂ L ∂ x ∂σ ∂ x ∂σ Prof. Leal-Taixé and Prof. Niessner 18
Pr Probl blem of po positi tive ve outpu tput w 2 w 1 More on zero- mean data later Prof. Leal-Taixé and Prof. Niessner 19
ta tanh Still saturates Zero- centered Still saturates LeCun 1991 Prof. Leal-Taixé and Prof. Niessner 20
Rec Rectif ified L ied Lin inear ear U Unit its ( (ReL ReLU) Dead ReLU Large and What happens if a consistent ReLU outputs zero? gradients Fast convergence Does not saturate Prof. Leal-Taixé and Prof. Niessner 21
Ma Maxou out un units ts Linear Generalization Does not Does not regimes of ReLUs die saturate Increase of the number of parameters Prof. Leal-Taixé and Prof. Niessner 22
Da Data ta pre-pr proce cessing For images subtract the mean image (AlexNet) or per- channel mean (VGG-Net) Prof. Leal-Taixé and Prof. Niessner 23
Weight t initi tializati tion Prof. Leal-Taixé and Prof. Niessner 24
Ho How do I st start rt? Forward w w w w Prof. Leal-Taixé and Prof. Niessner 25
In Init itial ializ izat atio ion is is extremely im importan ant Initialization Not guaranteed Optimum to reach the optimum Prof. Leal-Taixé and Prof. Niessner 26
Ho How do I st start rt? X ! Forward f w i x i + b i w w w = 0 What happens to w the w gradients? No symmetry breaking Prof. Leal-Taixé and Prof. Niessner 27
Al All we weights to to ze zero ro • Elaborate: the hidden units are all going to compute the same function, gradients are going to be the same Prof. Leal-Taixé and Prof. Niessner 28
Sm Small rand ndom nu numbers • Gaussian with zero mean and standard deviation 0.01 • Let us see what happens: – Network with 10 layers with 500 neurons each – Tanh as activation functions – Input unit Gaussian data Prof. Leal-Taixé and Prof. Niessner 29
Sm Small rand ndom nu numbers Activations become zero Last Input layer Forward Prof. Leal-Taixé and Prof. Niessner 30
Sm Small rand ndom nu numbers small X ! f w i x i + b i Forward Prof. Leal-Taixé and Prof. Niessner 31
Sm Small rand ndom nu numbers 2. Compute the 1. Activation X ! gradients wrt function f w i x i + b the weights gradient is ok i Backward Prof. Leal-Taixé and Prof. Niessner 32
Sm Small rand ndom nu numbers 2. Compute the 1. Activation X ! gradients wrt function f w i x i + b the weights gradient is ok i Gradients vanish Prof. Leal-Taixé and Prof. Niessner 33
Bi Big r random om n number ers • Gaussian with zero mean and standard deviation 1 • Let us see what happens: – Network with 10 layers with 500 neurons each – Tanh as activation functions – Input unit Gaussian data Prof. Leal-Taixé and Prof. Niessner 34
Bi Big r random om n number ers Everything is saturated Prof. Leal-Taixé and Prof. Niessner 35
Ho How to so solv lve this? s? • Working on the initialization • Working on the output generated by each layer Prof. Leal-Taixé and Prof. Niessner 36
Xavier Xa r initiali liza zation • Gaussian with zero mean, but what standard deviation? n n X X Var( s ) = Var( w i x i ) = Var( w i x i ) i i n X Glorot 2010 Prof. Leal-Taixé and Prof. Niessner 37
Xa Xavier r initiali liza zation • Gaussian with zero mean, but what standard deviation? n n X X Var( s ) = Var( w i x i ) = Var( w i x i ) Independent i i n X [ E ( w i )] 2 Var( x i ) + E [( x i )] 2 Var( w i ) + Var( x i )Var( w i ) = i n Zero mean Prof. Leal-Taixé and Prof. Niessner 38
Xa Xavier r initiali liza zation • Gaussian with zero mean, but what standard deviation? n n X X Var( s ) = Var( w i x i ) = Var( w i x i ) i i n X [ E ( w i )] 2 Var( x i ) + E [( x i )] 2 Var( w i ) + Var( x i )Var( w i ) = i n X = Var( x i )Var( w i ) = ( n Var( w )) Var( x ) Identically distributed i Prof. Leal-Taixé and Prof. Niessner 39
Xa Xavier r initiali liza zation • Gaussian with zero mean, but what standard deviation? n n X X Var( s ) = Var( w i x i ) = Var( w i x i ) i i n X [ E ( w i )] 2 Var( x i ) + E [( x i )] 2 Var( w i ) + Var( x i )Var( w i ) = i n X = Var( x i )Var( w i ) = ( n Var( w )) Var( x ) i Variance gets multiplied by the number of inputs Prof. Leal-Taixé and Prof. Niessner 40
Xa Xavier r initiali liza zation • How to ensure the variance of the output is the same as the input? ( n Var( w )) Var( x ) 1 V ar ( w ) = 1 n Prof. Leal-Taixé and Prof. Niessner 41
Xa Xavier r initiali liza zation Mitigates the effect of activations going to zero Prof. Leal-Taixé and Prof. Niessner 42
Xa Xavier r initiali liza zation with ReL ReLU Prof. Leal-Taixé and Prof. Niessner 43
ReL ReLU ki kills ha half of the he data V ar ( w ) = 2 n He 2015 Prof. Leal-Taixé and Prof. Niessner 44
ReL ReLU ki kills ha half of the he data V ar ( w ) = 2 It makes a huge difference! n He 2015 Prof. Leal-Taixé and Prof. Niessner 45
Ti Tips and nd tricks ks • Use ReLU and Xavier/2 initialization Prof. Leal-Taixé and Prof. Niessner 46
Batc tch norma malizati tion Prof. Leal-Taixé and Prof. Niessner 47
Ou Our go goal • All we want is that our activations do not die out
Ba Batch n nor ormalization on • Wish: unit Gaussian activations (in our example) • Solution: let’s do it Mean of your mini-batch examples over feature k N = mini-batch size x ( k ) = x ( k ) − E[ x ( k ) ] ˆ p Var[ x ( k ) ] D = #features Ioffe and Szegedy 2015 Prof. Leal-Taixé and Prof. Niessner 49
Ba Batch n nor ormalization on • In each dimension of the features, you have a unit gaussian (in our example) Mean of your mini-batch examples over feature k N = mini-batch size x ( k ) = x ( k ) − E[ x ( k ) ] ˆ p Var[ x ( k ) ] D = #features Ioffe and Szegedy 2015 Prof. Leal-Taixé and Prof. Niessner 50
Recommend
More recommend