Le Lecture 7 7 R Recap ap I2DL: Prof. Niessner, Prof. Leal-Taixé 1
Na Naïve L Losse sses: s: L L2 v vs L s L1 • L2 Loss: • L1 Loss: – 𝑀 $ = ∑ "#$ % – 𝑀 ! = ∑ "#$ ! |𝑧 " − 𝑔(𝑦 " )| % 𝑧 " − 𝑔 𝑦 " – Sum of absolute – Sum of squared differences differences (SSD) – Robust – Prone to outliers – Costly to compute – Compute-efficient (optimization) – Optimum is the median – Optimum is the mean I2DL: Prof. Niessner, Prof. Leal-Taixé 2
Bi Binar ary Cl Clas assificat ation on: Sigmoi moid 1 𝜏 𝒚, 𝜾 = 1 + 𝑓 !∑# ! $ ! 𝑦 % 1 𝜄 % 𝜏 𝑡 = 1 1 + 𝑓 !" Can be 𝑡 Σ 𝜄 # interpreted as 𝑦 # a probability 𝑞(𝑧 = 1|𝑦, 𝜾) 𝜄 $ 0 𝑦 $ I2DL: Prof. Niessner, Prof. Leal-Taixé 3
So Softm tmax x Formula lati tion • What if we have multiple classes? Scores Probabilities 𝑦 % for each class for each class 𝑓 𝒕𝟐 𝑡1 Σ 𝑞(𝑧 = 1|𝒚, 𝜾) = 𝑓 𝒕𝟐 + 𝑓 𝒕𝟑 + 𝑓 𝒕𝟒 𝑓 𝒕𝟑 𝑡2 Σ Softmax 𝑦 # 𝑞(𝑧 = 2|𝒚, 𝜾) = 𝑓 𝒕𝟐 + 𝑓 𝒕𝟑 + 𝑓 𝒕𝟒 𝑓 𝒕𝟒 𝑡3 Σ 𝑞(𝑧 = 3|𝒚, 𝜾) = 𝑓 𝒕𝟐 + 𝑓 𝒕𝟑 + 𝑓 𝒕𝟒 𝑦 $ I2DL: Prof. Niessner, Prof. Leal-Taixé 4
Ex Examp mple: e: Hin inge e vs Cr Cros oss-En Entrop opy Hinge Loss: 𝑀 @ = ∑ ABC % max(0, 𝑡 A − 𝑡 C % + 1) D &'% Cross Entropy : 𝑀 @ = − log( ∑ ( D &( ) Given the following scores for 𝒚 , : Hinge loss: Cross Entropy loss: max(0, −3 − 5 + 1) + 𝑡 = [5, −3, 2] * ! − ln * ! +* "% +* # = 0.05 Model 1 max 0, 2 − 5 + 1 = 0 max(0, 10 − 5 + 1) + * ! 𝑡 = [5, 10, 10] − ln * ! +* &$ +* &$ = 5.70 Model 2 max 0, 10 − 5 + 1 = 12 𝑡 = [5, −20, −20] * ! max(0, −20 − 5 + 1) + − ln Model 3 * ! +* "#$ +* "#$ max 0, −20 − 5 + 1 = 0 = 2 ∗ 10 !## 𝑧 , = 0 − Cross Entropy *always* wants to improve! (loss never 0) − Hinge Loss saturates. I2DL: Prof. Niessner, Prof. Leal-Taixé 5
Sigmoid Acti Si Activa vati tion 1 Forward 𝜏 𝑡 = 1 + 𝑓 !" 𝜖𝑥 = 𝜖𝑡 𝜖𝑀 𝜖𝑀 𝜖𝑥 𝜖𝑡 Saturated neurons kill the gradient flow 𝜖𝑀 𝜖𝑡 = 𝜖𝜏 𝜖𝑀 𝜖𝑀 𝜖𝜏 𝜖𝑡 𝜖𝜏 𝜖𝜏 𝜖𝑡 I2DL: Prof. Niessner, Prof. Leal-Taixé 6
Ta TanH nH Ac Acti tiva vati tion Still saturates Zero- centered [LeCun et al. 1991] Improving Generalization Performance in Character Recognition I2DL: Prof. Niessner, Prof. Leal-Taixé 7
Rec Rectified ed Linear ear Units (ReL ReLU) Dead ReLU Large and What happens if a consistent ReLU outputs zero? gradients Fast convergence Does not saturate [Krizhevsky et al. NeurIPS 2012] ImageNet Classification with Deep Convolutional Neural Networks I2DL: Prof. Niessner, Prof. Leal-Taixé 8
Qui Quick ck Gui Guide • Sigmoid is not really used. • ReLU is the standard choice. • Second choice are the variants of ReLU or Maxout. • Recurrent nets will require TanH or similar. I2DL: Prof. Niessner, Prof. Leal-Taixé 9
In Initialization is Extremely Im Important 𝑦 ∗ = arg min 𝑔(𝑦) • Optimum Initialization Not guaranteed to reach the optimum I2DL: Prof. Niessner, Prof. Leal-Taixé 10
Xavier In Initialization • How to ensure the variance of the output is the same as the input? 𝑜𝑊𝑏𝑠(𝑥 𝑊𝑏𝑠 𝑦 ) = 1 𝑊𝑏𝑠 𝑥 = 1 𝑜 I2DL: Prof. Niessner, Prof. Leal-Taixé 11
ReL ReLU Kills Hal alf of of the e Dat ata 𝑊𝑏𝑠 𝑥 = 2 𝑜 It makes a huge difference! [He et al., ICCV’15] He Initialization I2DL: Prof. Niessner, Prof. Leal-Taixé 12
Le Lecture 8 I2DL: Prof. Niessner, Prof. Leal-Taixé 13
Da Data ta Augm ugmen enta tati tion on I2DL: Prof. Niessner, Prof. Leal-Taixé 14
Da Data ta Augm gmenta tati tion • A classifier has to be invariant to a wide variety of transformations I2DL: Prof. Niessner, Prof. Leal-Taixé 16
Pose Appearance Illumination I2DL: Prof. Niessner, Prof. Leal-Taixé 17
Da Data ta Augm gmenta tati tion • A classifier has to be invariant to a wide variety of transformations • Helping the classifier: synthesize data simulating plausible transformations I2DL: Prof. Niessner, Prof. Leal-Taixé 18
Da Data ta Augm gmenta tati tion [Krizhevsky et al., NIPS’12] ImageNet I2DL: Prof. Niessner, Prof. Leal-Taixé 19
Da Data ta Augm gmenta tati tion: Br Brightnes ess • Random brightness and contrast changes [Krizhevsky et al., NIPS’12] ImageNet I2DL: Prof. Niessner, Prof. Leal-Taixé 20
Da Data ta Augm gmenta tati tion: Random Crops ps • Training: random crops – Pick a random L in [256,480] – Resize training image, short side L – Randomly sample crops of 224x224 • Testing: fixed set of crops – Resize image at N scales – 10 fixed crops of 224x224: (4 corners + 1 center ) × 2 flips [Krizhevsky et al., NIPS’12] ImageNet I2DL: Prof. Niessner, Prof. Leal-Taixé 21
Da Data ta Augm gmenta tati tion • When comparing two networks make sure to use the same data augmentation! • Consider data augmentation a part of your network design I2DL: Prof. Niessner, Prof. Leal-Taixé 22
Ad Advanc vanced Regula larization I2DL: Prof. Niessner, Prof. Leal-Taixé 23
Wei Weight Dec ecay ay • L2 regularization Θ &'$ = Θ & − 𝜗𝛼 ( Θ & , 𝑦, 𝑧 − 𝜇𝜄 & Learning rate Gradient Gradient of L2-regularization • Penalizes large weights Θ/2 Θ/2 Θ 0 • Improves generalization I2DL: Prof. Niessner, Prof. Leal-Taixé 24
Ea Early Stop oppin ing Overfitting I2DL: Prof. Niessner, Prof. Leal-Taixé 25
Ea Early Stop oppin ing • Easy form of regularization 𝜗 𝜗 … … Θ $ Θ ∗ Θ # Θ - Θ % Overfitting 𝜐 I2DL: Prof. Niessner, Prof. Leal-Taixé 26
Bag Bagging an and d En Ensemb emble e Met Methods ods • Train multiple models and average their results • E.g., use a different algorithm for optimization or change the objective function / loss function. • If errors are uncorrelated, the expected combined error will decrease linearly with the ensemble size I2DL: Prof. Niessner, Prof. Leal-Taixé 27
Bag Bagging an and d En Ensemb emble e Met Methods ods • Bagging: uses k different datasets Training Set 3 Training Set 2 Training Set 1 Image Source: [Srivastava et al., JMLR’14] Dropout I2DL: Prof. Niessner, Prof. Leal-Taixé 28
Dropout Dropout I2DL: Prof. Niessner, Prof. Leal-Taixé 29
Dr Dropo pout • Disable a random set of neurons (typically 50%) Forward [Srivastava et al., JMLR’14] Dropout I2DL: Prof. Niessner, Prof. Leal-Taixé 30
Dropout: In Intuition • Using half the network = half capacity Redundant representations Furry Has two eyes Has a tail Has paws Has two ears [Srivastava et al., JMLR’14] Dropout I2DL: Prof. Niessner, Prof. Leal-Taixé 31
Dropout: In Intuition • Using half the network = half capacity – Redundant representations – Base your scores on more features • Consider it as a model ensemble [Srivastava et al., JMLR’14] Dropout I2DL: Prof. Niessner, Prof. Leal-Taixé 32
Dropout: In Intuition • Two models in one Model 1 Model 2 [Srivastava et al., JMLR’14] Dropout I2DL: Prof. Niessner, Prof. Leal-Taixé 33
Dropout: In Intuition • Using half the network = half capacity – Redundant representations – Base your scores on more features • Consider it as two models in one – Training a large ensemble of models, each on different set of data (mini-batch) and with SHARED parameters Reducing co-adaptation between neurons [Srivastava et al., JMLR’14] Dropout I2DL: Prof. Niessner, Prof. Leal-Taixé 34
Dr Dropo pout: t: Te Test t Ti Time • All neurons are “turned on” – no dropout Conditions at train and test time are not the same [Srivastava et al., JMLR’14] Dropout I2DL: Prof. Niessner, Prof. Leal-Taixé 35
Dropo Dr pout: t: Te Test t Ti Time Dropout probability 𝑨 = (𝜄 Q 𝑦 Q + 𝜄 R 𝑦 R ) 5 𝑞 𝑞 = 0.5 • Test: 𝐹 𝑨 = 1 4 (𝜄 Q 0 + 𝜄 R 0 𝑨 • Train: 𝜄 # 𝜄 $ + 𝜄 Q 𝑦 Q + 𝜄 R 0 + 𝜄 Q 0 + 𝜄 R 𝑦 R 𝑦 $ 𝑦 # + 𝜄 Q 𝑦 Q + 𝜄 R 𝑦 R ) = 1 2 (𝜄 Q 𝑦 Q + 𝜄 R 𝑦 R ) Weight scaling inference rule [Srivastava et al., JMLR’14] Dropout I2DL: Prof. Niessner, Prof. Leal-Taixé 36
Dr Dropo pout: t: Verdict ct • Efficient bagging method with parameter sharing • Try it! • Dropout reduces the effective capacity of a model à larger models, more training time [Srivastava et al., JMLR’14] Dropout I2DL: Prof. Niessner, Prof. Leal-Taixé 37
Batch Normali lization I2DL: Prof. Niessner, Prof. Leal-Taixé 38
Our Our Go Goal • All we want is that our activations do not die out I2DL: Prof. Niessner, Prof. Leal-Taixé 39
Recommend
More recommend