Activation Functions: Scaled Exponential Linear Unit (SELU) Scaled version of ELU that - works better for deep networks “Self-Normalizing” property; - can train deep SELU networks without BatchNorm Derivation takes 91 pages of math in appendix… 𝛽 = 1.6732632423543772848170429916717 𝜇 = 1.0507009873554804934193349852946 Klambauer et al, Self-Normalizing Neural Networks, ICLR 2017 Justin Johnson October 5, 2020 Lecture 10 - 32
Activation Functions: Gaussian Error Linear Unit (GELU) Idea : Multiply input by 0 or 1 - at random; large values more likely to be multiplied by 1, small values more likely to be multiplied by 0 (data-dependent dropout) Take expectation over - randomness 𝑌~𝑂 0, 1 Very common in Transformers - 𝑓𝑚𝑣 𝑦 = 𝑦𝑄 𝑌 ≤ 𝑦 = 𝑦 2 1 + erf 𝑦/√2 (BERT, GPT, GPT-2, GPT-3) ≈ 𝑦𝜏 1.702𝑦 Hendrycks and Gimpel, Gaussian Error Linear Units (GELUs), 2016 Justin Johnson October 5, 2020 Lecture 10 - 33
Accuracy on CIFAR10 Ramachandran et al, “Searching for activation functions”, ICLR Workshop 2018 ReLU Leaky ReLU Parametric ReLU Softplus ELU SELU GELU Swish 96 95.6 95.5 95.5 95.3 95.1 94.9 95 94.8 94.8 94.8 94.7 94.7 94.7 94.6 94.5 94.4 94.3 94.2 94.1 94.1 94.1 93.9 93.8 94 93.2 93 93 92 91 90 ResNet Wide ResNet DenseNet Justin Johnson October 5, 2020 Lecture 10 - 34
Activation Functions: Summary Don’t think too hard. Just use ReLU - Try out Leaky ReLU / ELU / SELU / GELU - if you need to squeeze that last 0.1% Don’t use sigmoid or tanh - Justin Johnson October 5, 2020 Lecture 10 - 35
Data Preprocessing Justin Johnson October 5, 2020 Lecture 10 - 36
Data Preprocessing (Assume X [NxD] is data matrix, each example in a row) Justin Johnson October 5, 2020 Lecture 10 - 37
Remember: Consider what happens when the input to a neuron is always positive... allowed gradient update directions (ℓ) = # (ℓ'() + 𝑐 ! (ℓ) 𝜏 ℎ % ℓ ℎ ! 𝑥 !,% allowed % gradient update directions hypothetical optimal w What can we say about the gradients on w ? vector Always all positive or all negative :( (this is also why you want zero-mean data!) Justin Johnson October 5, 2020 Lecture 10 - 38
Data Preprocessing (Assume X [NxD] is data matrix, each example in a row) Justin Johnson October 5, 2020 Lecture 10 - 39
Data Preprocessing In practice, you may also see PCA and Whitening of the data (data has diagonal (covariance matrix is covariance matrix) the identity matrix) Justin Johnson October 5, 2020 Lecture 10 - 40
Data Preprocessing After normalization : less sensitive to Before normalization : classification small changes in weights; easier to loss very sensitive to changes in optimize weight matrix; hard to optimize Justin Johnson October 5, 2020 Lecture 10 - 41
Data Preprocessing for Images e.g. consider CIFAR-10 example with [32,32,3] images Subtract the mean image (e.g. AlexNet) - (mean image = [32,32,3] array) Subtract per-channel mean (e.g. VGGNet) - (mean along each channel = 3 numbers) Subtract per-channel mean and - Not common to Divide by per-channel std (e.g. ResNet) do PCA or (mean along each channel = 3 numbers) whitening Justin Johnson October 5, 2020 Lecture 10 - 42
Weight Initialization Justin Johnson October 5, 2020 Lecture 10 - 43
Weight Initialization Q : What happens if we initialize all W=0, b=0? Justin Johnson October 5, 2020 Lecture 10 - 44
Weight Initialization Q : What happens if we initialize all W=0, b=0? A : All outputs are 0, all gradients are the same! No “symmetry breaking” Justin Johnson October 5, 2020 Lecture 10 - 45
Weight Initialization Next idea: small random numbers (Gaussian with zero mean, std=0.01) Justin Johnson October 5, 2020 Lecture 10 - 46
Weight Initialization Next idea: small random numbers (Gaussian with zero mean, std=0.01) Works ~okay for small networks, but problems with deeper networks. Justin Johnson October 5, 2020 Lecture 10 - 47
Weight Initialization: Activation Statistics Forward pass for a 6-layer net with hidden size 4096 Justin Johnson October 5, 2020 Lecture 10 - 48
Weight Initialization: Activation Statistics Forward pass for a 6-layer All activations tend to zero for net with hidden size 4096 deeper network layers Q : What do the gradients dL/dW look like? Justin Johnson October 5, 2020 Lecture 10 - 49
Weight Initialization: Activation Statistics Forward pass for a 6-layer All activations tend to zero for net with hidden size 4096 deeper network layers Q : What do the gradients dL/dW look like? A : All zero, no learning =( Justin Johnson October 5, 2020 Lecture 10 - 50
Weight Initialization: Activation Statistics Increase std of initial weights from 0.01 to 0.05 Justin Johnson October 5, 2020 Lecture 10 - 51
Weight Initialization: Activation Statistics Increase std of initial weights All activations saturate from 0.01 to 0.05 Q : What do the gradients look like? Justin Johnson October 5, 2020 Lecture 10 - 52
Weight Initialization: Activation Statistics Increase std of initial weights All activations saturate from 0.01 to 0.05 Q : What do the gradients look like? A : Local gradients all zero, no learning =( Justin Johnson October 5, 2020 Lecture 10 - 53
Weight Initialization: Xavier Initialization “Xavier” initialization: std = 1/sqrt(Din) Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010 Justin Johnson October 5, 2020 Lecture 10 - 54
Weight Initialization: Xavier Initialization “Xavier” initialization: “Just right”: Activations are std = 1/sqrt(Din) nicely scaled for all layers! Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010 Justin Johnson October 5, 2020 Lecture 10 - 55
Weight Initialization: Xavier Initialization “Xavier” initialization: “Just right”: Activations are std = 1/sqrt(Din) nicely scaled for all layers! For conv layers, Din is kernel_size 2 * input_channels Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010 Justin Johnson October 5, 2020 Lecture 10 - 56
Weight Initialization: Xavier Initialization “Xavier” initialization: std = 1/sqrt(Din) Derivation: Variance of output = Variance of input '#( 𝑧 # = 8 𝑦 $ 𝑥 y = Wx $ $%& Var(y i ) = Din * Var(x i w i ) [Assume x, w are iid] 2 ] - E[x i ] 2 E[w i ] 2 ) [Assume x, w independent] = Din * (E[x i 2 ]E[w i = Din * Var(x i ) * Var(w i ) [Assume x, w are zero-mean] If Var(w i ) = 1/Din then Var(y i ) = Var(x i ) Justin Johnson October 5, 2020 Lecture 10 - 57
Weight Initialization: Xavier Initialization “Xavier” initialization: std = 1/sqrt(Din) Derivation: Variance of output = Variance of input '#( 𝑧 # = 8 𝑦 $ 𝑥 y = Wx $ $%& Var(y i ) = Din * Var(x i w i ) [Assume x, w are iid] 2 ] - E[x i ] 2 E[w i ] 2 ) [Assume x, w independent] = Din * (E[x i 2 ]E[w i = Din * Var(x i ) * Var(w i ) [Assume x, w are zero-mean] If Var(w i ) = 1/Din then Var(y i ) = Var(x i ) Justin Johnson October 5, 2020 Lecture 10 - 58
Weight Initialization: Xavier Initialization “Xavier” initialization: std = 1/sqrt(Din) Derivation: Variance of output = Variance of input '#( 𝑧 # = 8 𝑦 $ 𝑥 y = Wx $ $%& Var(y i ) = Din * Var(x i w i ) [Assume x, w are iid] 2 ] - E[x i ] 2 E[w i ] 2 ) [Assume x, w independent] = Din * (E[x i 2 ]E[w i = Din * Var(x i ) * Var(w i ) [Assume x, w are zero-mean] If Var(w i ) = 1/Din then Var(y i ) = Var(x i ) Justin Johnson October 5, 2020 Lecture 10 - 59
Weight Initialization: Xavier Initialization “Xavier” initialization: std = 1/sqrt(Din) Derivation: Variance of output = Variance of input '#( 𝑧 # = 8 𝑦 $ 𝑥 y = Wx $ $%& Var(y i ) = Din * Var(x i w i ) [Assume x, w are iid] 2 ] - E[x i ] 2 E[w i ] 2 ) [Assume x, w independent] = Din * (E[x i 2 ]E[w i = Din * Var(x i ) * Var(w i ) [Assume x, w are zero-mean] If Var(w i ) = 1/Din then Var(y i ) = Var(x i ) Justin Johnson October 5, 2020 Lecture 10 - 60
Weight Initialization: Xavier Initialization “Xavier” initialization: std = 1/sqrt(Din) Derivation: Variance of output = Variance of input '#( 𝑧 # = 8 𝑦 $ 𝑥 y = Wx $ $%& Var(y i ) = Din * Var(x i w i ) [Assume x, w are iid] 2 ] - E[x i ] 2 E[w i ] 2 ) [Assume x, w independent] = Din * (E[x i 2 ]E[w i = Din * Var(x i ) * Var(w i ) [Assume x, w are zero-mean] If Var(w i ) = 1/Din then Var(y i ) = Var(x i ) Justin Johnson October 5, 2020 Lecture 10 - 61
Weight Initialization: What about ReLU? Change from tanh to ReLU Justin Johnson October 5, 2020 Lecture 10 - 62
Weight Initialization: What about ReLU? Xavier assumes zero centered Change from tanh to ReLU activation function Activations collapse to zero again, no learning =( Justin Johnson October 5, 2020 Lecture 10 - 63
Weight Initialization: Kaiming / MSRA Initialization ReLU correction: std = sqrt(2 / Din) ”Just right” – activations nicely scaled for all layers He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, ICCV 2015 Justin Johnson October 5, 2020 Lecture 10 - 64
Weight Initialization: Residual Networks relu F(x) + x If we initialize with MSRA: then Var(F(x)) = Var(x) But then Var(F(x) + x) > Var(x) – variance grows conv with each block! F(x) relu conv X Residual Block Justin Johnson October 5, 2020 Lecture 10 - 65
Weight Initialization: Residual Networks If we initialize with MSRA: relu F(x) + x then Var(F(x)) = Var(x) But then Var(F(x) + x) > Var(x) conv variance grows with each block! F(x) relu conv Solution : Initialize first conv with MSRA, initialize second conv to X zero. Then Var(x + F(x)) = Var(x) Residual Block Zhang et al, “Fixup Initialization: Residual Learning Without Normalization”, ICLR 2019 Justin Johnson October 5, 2020 Lecture 10 - 66
Proper initialization is an active area of research Understanding the difficulty of training deep feedforward neural networks by Glorot and Bengio, 2010 Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013 Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014 Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015 Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015 All you need is a good init , Mishkin and Matas, 2015 Fixup Initialization: Residual Learning Without Normalization , Zhang et al, 2019 The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , Frankle and Carbin, 2019 Justin Johnson October 5, 2020 Lecture 10 - 67
Now your model is training … but it overfits! Regularization Justin Johnson October 5, 2020 Lecture 10 - 68
Regularization: Add term to the loss In common use: L2 regularization (Weight decay) L1 regularization Elastic net (L1 + L2) Justin Johnson October 5, 2020 Lecture 10 - 69
Regularization: Dropout In each forward pass, randomly set some neurons to zero Probability of dropping is a hyperparameter; 0.5 is common Srivastava et al, “Dropout: A simple way to prevent neural networks from overfitting”, JMLR 2014 Justin Johnson October 5, 2020 Lecture 10 - 70
Regularization: Dropout Example forward pass with a 3-layer network using dropout Justin Johnson October 5, 2020 Lecture 10 - 71
Regularization: Dropout Forces the network to have a redundant representation; Prevents co-adaptation of features has an ear X has a tail X cat cat is furry score score has claws X mischievous look Justin Johnson October 5, 2020 Lecture 10 - 72
Regularization: Dropout Another interpretation: Dropout is training a large ensemble of models (that share parameters). Each binary mask is one model An FC layer with 4096 units has 2 4096 ~ 10 1233 possible masks! Only ~ 10 82 atoms in the universe... Justin Johnson October 5, 2020 Lecture 10 - 73
Dropout: Test Time Output Input (label) (image) Random 𝒛 = 𝑔 # 𝒚, 𝒜 Dropout makes our output random! mask Want to “average out” the randomness at test-time 𝑧 = 𝑔 𝑦 = 𝐹 ) 𝑔 𝑦, 𝑨 = . 𝑞 𝑨 𝑔 𝑦, 𝑨 𝑒𝑨 But this integral seems hard … Justin Johnson October 5, 2020 Lecture 10 - 74
Dropout: Test Time Want to approximate 𝑧 = 𝑔 𝑦 = 𝐹 ) 𝑔 𝑦, 𝑨 = ; 𝑞 𝑨 𝑔 𝑦, 𝑨 𝑒𝑨 the integral Consider a single neuron: a At test time we have: 𝐹 𝑏 = 𝑥 % 𝑦 + 𝑥 & 𝑧 w 1 w 2 x y Justin Johnson October 5, 2020 Lecture 10 - 75
Dropout: Test Time Want to approximate 𝑧 = 𝑔 𝑦 = 𝐹 ) 𝑔 𝑦, 𝑨 = ; 𝑞 𝑨 𝑔 𝑦, 𝑨 𝑒𝑨 the integral Consider a single neuron: a At test time we have: 𝐹 𝑏 = 𝑥 % 𝑦 + 𝑥 & 𝑧 During training we have: 𝐹 𝑏 = ! " 𝑥 ! 𝑦 + 𝑥 # 𝑧 + ! w 1 " 𝑥 ! 𝑦 + 0𝑧 w 2 + ! " 0𝑦 + 0𝑧 + ! " 0𝑦 + 𝑥 # 𝑧 x y = ! # 𝑥 ! 𝑦 + 𝑥 # 𝑧 Justin Johnson October 5, 2020 Lecture 10 - 76
Dropout: Test Time Want to approximate 𝑧 = 𝑔 𝑦 = 𝐹 ) 𝑔 𝑦, 𝑨 = ; 𝑞 𝑨 𝑔 𝑦, 𝑨 𝑒𝑨 the integral Consider a single neuron: a At test time we have: 𝐹 𝑏 = 𝑥 % 𝑦 + 𝑥 & 𝑧 During training we have: 𝐹 𝑏 = ! " 𝑥 ! 𝑦 + 𝑥 # 𝑧 + ! w 1 " 𝑥 ! 𝑦 + 0𝑧 w 2 + ! " 0𝑦 + 0𝑧 + ! At test time, drop " 0𝑦 + 𝑥 # 𝑧 x y nothing and multiply = ! # 𝑥 ! 𝑦 + 𝑥 # 𝑧 by dropout probability Justin Johnson October 5, 2020 Lecture 10 - 77
Dropout: Test Time At test time all neurons are active always => We must scale the activations so that for each neuron: output at test time = expected output at training time Justin Johnson October 5, 2020 Lecture 10 - 78
Dropout Summary drop in forward pass scale at test time Justin Johnson October 5, 2020 Lecture 10 - 79
More common: “Inverted dropout” Drop and scale during training test time is unchanged! Justin Johnson October 5, 2020 Lecture 10 - 80
Dropout architectures Recall AlexNet, VGG have most of their parameters in fully-connected layers ; usually Dropout is applied there AlexNet vs VGG-16 (Params, M) Dropout here! 120000 100000 80000 60000 40000 20000 0 conv1 conv2 conv3 conv4 conv5 fc6 fc7 fc8 AlexNet VGG-16 Justin Johnson October 5, 2020 Lecture 10 - 81
Dropout architectures Recall AlexNet, VGG have most of their parameters in fully-connected layers ; usually Dropout is applied there AlexNet vs VGG-16 (Params, M) Later architectures (GoogLeNet, Dropout here! 120000 ResNet, etc) use global average 100000 pooling instead of fully-connected 80000 60000 layers: they don’t use dropout at all! 40000 20000 0 conv1 conv2 conv3 conv4 conv5 fc6 fc7 fc8 AlexNet VGG-16 Justin Johnson October 5, 2020 Lecture 10 - 82
Re Regularization : A common pattern Training : Add some kind of randomness 𝑧 = 𝑔 > 𝑦, 𝑨 Testing: Average out randomness (sometimes approximate) 𝑧 = 𝑔 𝑦 = 𝐹 ) 𝑔 𝑦, 𝑨 = ; 𝑞 𝑨 𝑔 𝑦, 𝑨 𝑒𝑨 Justin Johnson October 5, 2020 Lecture 10 - 83
Re Regularization : A common pattern Example : Batch Training : Add some kind of Normalization randomness 𝑧 = 𝑔 > 𝑦, 𝑨 Training : Normalize using stats from random minibatches Testing: Average out randomness (sometimes approximate) Testing : Use fixed 𝑧 = 𝑔 𝑦 = 𝐹 ) 𝑔 𝑦, 𝑨 = ; 𝑞 𝑨 𝑔 𝑦, 𝑨 𝑒𝑨 stats to normalize Justin Johnson October 5, 2020 Lecture 10 - 84
Re Regularization : A common pattern Example : Batch Training : Add some kind of Normalization randomness For ResNet and later, often L2 and Batch 𝑧 = 𝑔 > 𝑦, 𝑨 Training : Normalize Normalization are the only regularizers! using stats from random minibatches Testing: Average out randomness (sometimes approximate) Testing : Use fixed 𝑧 = 𝑔 𝑦 = 𝐹 ) 𝑔 𝑦, 𝑨 = ; 𝑞 𝑨 𝑔 𝑦, 𝑨 𝑒𝑨 stats to normalize Justin Johnson October 5, 2020 Lecture 10 - 85
Da Data Au Augmen gmentati tion Load image “cat” and label Compute loss CNN This image by Nikita is licensed under CC-BY 2.0 Justin Johnson October 5, 2020 Lecture 10 - 86
Da Data Au Augmen gmentati tion Load image “cat” and label Compute loss CNN Transform image Justin Johnson October 5, 2020 Lecture 10 - 87
Da Data Au Augmen gmentati tion : Horizontal Flips Justin Johnson October 5, 2020 Lecture 10 - 88
Da Data Au Augmen gmentati tion : Random Crops and Scales Training : sample random crops / scales ResNet: Pick random L in range [256, 480] 1. Resize training image, short side = L 2. Sample random 224 x 224 patch 3. Justin Johnson October 5, 2020 Lecture 10 - 89
Da Data Au Augmen gmentati tion : Random Crops and Scales Training : sample random crops / scales ResNet: Pick random L in range [256, 480] 1. Resize training image, short side = L 2. Sample random 224 x 224 patch 3. Testing : average a fixed set of crops ResNet: Resize image at 5 scales: {224, 256, 384, 480, 640} 1. For each size, use 10 224 x 224 crops: 4 corners + center, + flips 2. Justin Johnson October 5, 2020 Lecture 10 - 90
Da Data Au Augmen gmentati tion : Color Jitter More Complex : 1. Apply PCA to all [R, G, B] Simple: Randomize contrast and brightness pixels in training set 2. Sample a “color offset” along principal component directions 3. Add offset to all pixels of a training image (Used in AlexNet, ResNet, etc) Justin Johnson October 5, 2020 Lecture 10 - 91
Da Data Au Augmen gmentati tion : Get creative for your problem! Random mix/combinations of : translation - rotation - stretching - shearing, - lens distortions, … (go crazy) - Justin Johnson October 5, 2020 Lecture 10 - 92
Re Regularization : A common pattern Training : Add some randomness Testing : Marginalize over randomness Examples : Dropout Batch Normalization Data Augmentation Wan et al, “Regularization of Neural Networks using DropConnect”, ICML 2013 Justin Johnson October 5, 2020 Lecture 10 - 93
Re Regularization : DropConnect Training : Drop random connections between neurons (set weight=0) Testing : Use all the connections Examples : Dropout Batch Normalization Data Augmentation DropConnect Wan et al, “Regularization of Neural Networks using DropConnect”, ICML 2013 Justin Johnson October 5, 2020 Lecture 10 - 94
Re Regularization : Fractional Pooling Training : Use randomized pooling regions Testing : Average predictions over different samples Examples : Dropout Batch Normalization Data Augmentation DropConnect Fractional Max Pooling Graham, “Fractional Max Pooling”, arXiv 2014 Justin Johnson October 5, 2020 Lecture 10 - 95
Re Regularization : Stochastic Depth Training : Skip some residual blocks in ResNet Testing : Use the whole network Examples : Dropout Batch Normalization Data Augmentation DropConnect Fractional Max Pooling Stochastic Depth Huang et al, “Deep Networks with Stochastic Depth”, ECCV 2016 Justin Johnson October 5, 2020 Lecture 10 - 96
Re Regularization : Stochastic Depth Training : Set random images regions to 0 Testing : Use the whole image Examples : Dropout Batch Normalization Data Augmentation DropConnect Fractional Max Pooling Stochastic Depth Cutout Works very well for small datasets like CIFAR, less common for large datasets like ImageNet DeVries and Taylor, “Improved Regularization of Convolutional Neural Networks with Cutout”, arXiv 2017 Justin Johnson October 5, 2020 Lecture 10 - 97
Re Regularization : Mixup Sample blend probability from a beta distribution Beta(a, b) Training : Train on random blends of images with a=b≈0 so blend weights are close to 0/1 Testing : Use original images Examples : Dropout Target label: Batch Normalization CNN cat: 0.4 Data Augmentation dog: 0.6 DropConnect Fractional Max Pooling Stochastic Depth Randomly blend the pixels of Cutout pairs of training images, e.g. Mixup 40% cat, 60% dog Zhang et al, “ mixup : Beyond Empirical Risk Minimization”, ICLR 2018 Justin Johnson October 5, 2020 Lecture 10 - 98
Re Regularization : Mixup Training : Train on random blends of images Testing : Use original images Examples : Dropout Target label: Batch Normalization CNN cat: 0.4 Data Augmentation dog: 0.6 DropConnect Fractional Max Pooling Stochastic Depth Randomly blend the pixels of Cutout pairs of training images, e.g. Mixup 40% cat, 60% dog Zhang et al, “ mixup : Beyond Empirical Risk Minimization”, ICLR 2018 Justin Johnson October 5, 2020 Lecture 10 - 99
Re Regularization : Mixup Training : Train on random blends of images Testing : Use original images Examples : Consider dropout for large fully- - Dropout connected layers Batch Normalization Batch normalization and data - Data Augmentation augmentation almost always a DropConnect good idea Fractional Max Pooling Try cutout and mixup especially - Stochastic Depth for small classification datasets Cutout Mixup Zhang et al, “ mixup : Beyond Empirical Risk Minimization”, ICLR 2018 Justin Johnson October 5, 2020 Lecture 10 - 100
Recommend
More recommend