Remember: Consider what happens when the input to a neuron is always positive... allowed gradient update directions allowed gradient update directions hypothetical What can we say about the gradients on w ? optimal w vector Always all positive or all negative :( (this is also why you want zero-mean data!) Justin Johnson Lecture 10 - 36 October 7, 2019
Data Preprocessing (Assume X [NxD] is data matrix, each example in a row) Justin Johnson Lecture 10 - 37 October 7, 2019
Data Preprocessing In practice, you may also see PCA and Whitening of the data (data has diagonal (covariance matrix is covariance matrix) the identity matrix) Justin Johnson Lecture 10 - 38 October 7, 2019
Data Preprocessing After normalization : less sensitive to Before normalization : classification small changes in weights; easier to loss very sensitive to changes in optimize weight matrix; hard to optimize Justin Johnson October 7, 2019 Lecture 10 - 39
Data Preprocessing for Images e.g. consider CIFAR-10 example with [32,32,3] images Subtract the mean image (e.g. AlexNet) - (mean image = [32,32,3] array) Subtract per-channel mean (e.g. VGGNet) - (mean along each channel = 3 numbers) Subtract per-channel mean and - Not common to Divide by per-channel std (e.g. ResNet) do PCA or (mean along each channel = 3 numbers) whitening Justin Johnson Lecture 10 - 40 October 7, 2019
Weight Initialization Justin Johnson Lecture 10 - 41 October 7, 2019
Weight Initialization Q : What happens if we initialize all W=0, b=0? Justin Johnson Lecture 10 - 42 October 7, 2019
Weight Initialization Q : What happens if we initialize all W=0, b=0? A : All outputs are 0, all gradients are the same! No “symmetry breaking” Justin Johnson Lecture 10 - 43 October 7, 2019
Weight Initialization Next idea: small random numbers (Gaussian with zero mean, std=0.01) Justin Johnson Lecture 10 - 44 October 7, 2019
Weight Initialization Next idea: small random numbers (Gaussian with zero mean, std=0.01) Works ~okay for small networks, but problems with deeper networks. Justin Johnson Lecture 10 - 45 October 7, 2019
Weight Initialization: Activation Statistics Forward pass for a 6-layer net with hidden size 4096 Justin Johnson Lecture 10 - 46 October 7, 2019
Weight Initialization: Activation Statistics Forward pass for a 6-layer All activations tend to zero for net with hidden size 4096 deeper network layers Q : What do the gradients dL/dW look like? Justin Johnson Lecture 10 - 47 October 7, 2019
Weight Initialization: Activation Statistics Forward pass for a 6-layer All activations tend to zero for net with hidden size 4096 deeper network layers Q : What do the gradients dL/dW look like? A : All zero, no learning =( Justin Johnson Lecture 10 - 48 October 7, 2019
Weight Initialization: Activation Statistics Increase std of initial weights from 0.01 to 0.05 Justin Johnson Lecture 10 - 49 October 7, 2019
Weight Initialization: Activation Statistics Increase std of initial weights All activations saturate from 0.01 to 0.05 Q : What do the gradients look like? Justin Johnson Lecture 10 - 50 October 7, 2019
Weight Initialization: Activation Statistics Increase std of initial weights All activations saturate from 0.01 to 0.05 Q : What do the gradients look like? A : Local gradients all zero, no learning =( Justin Johnson Lecture 10 - 51 October 7, 2019
Weight Initialization: Xavier Initialization “Xavier” initialization: std = 1/sqrt(Din) Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010 Justin Johnson Lecture 10 - 52 October 7, 2019
Weight Initialization: Xavier Initialization “Just right”: Activations are “Xavier” initialization: std = 1/sqrt(Din) nicely scaled for all layers! Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010 Justin Johnson Lecture 10 - 53 October 7, 2019
Weight Initialization: Xavier Initialization “Just right”: Activations are “Xavier” initialization: std = 1/sqrt(Din) nicely scaled for all layers! For conv layers, Din is kernel_size 2 * input_channels Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010 Justin Johnson Lecture 10 - 54 October 7, 2019
Weight Initialization: Xavier Initialization “Xavier” initialization: std = 1/sqrt(Din) Derivation: Variance of output = Variance of input (") 𝑧 " = $ 𝑦 & 𝑥 y = Wx & &*+ Var(y i ) = Din * Var(x i w i ) [Assume x, w are iid] 2 ] - E[x i ] 2 E[w i ] 2 ) [Assume x, w independent] = Din * (E[x i 2 ]E[w i = Din * Var(x i ) * Var(w i ) [Assume x, w are zero-mean] If Var(w i ) = 1/Din then Var(y i ) = Var(x i ) Justin Johnson Lecture 10 - 55 October 7, 2019
Weight Initialization: Xavier Initialization “Xavier” initialization: std = 1/sqrt(Din) Derivation: Variance of output = Variance of input (") 𝑧 " = $ 𝑦 & 𝑥 y = Wx & &*+ Var(y i ) = Din * Var(x i w i ) [Assume x, w are iid] 2 ] - E[x i ] 2 E[w i ] 2 ) [Assume x, w independent] = Din * (E[x i 2 ]E[w i = Din * Var(x i ) * Var(w i ) [Assume x, w are zero-mean] If Var(w i ) = 1/Din then Var(y i ) = Var(x i ) Justin Johnson Lecture 10 - 56 October 7, 2019
Weight Initialization: Xavier Initialization “Xavier” initialization: std = 1/sqrt(Din) Derivation: Variance of output = Variance of input (") 𝑧 " = $ 𝑦 & 𝑥 y = Wx & &*+ Var(y i ) = Din * Var(x i w i ) [Assume x, w are iid] 2 ] - E[x i ] 2 E[w i ] 2 ) [Assume x, w independent] = Din * (E[x i 2 ]E[w i = Din * Var(x i ) * Var(w i ) [Assume x, w are zero-mean] If Var(w i ) = 1/Din then Var(y i ) = Var(x i ) Justin Johnson Lecture 10 - 57 October 7, 2019
Weight Initialization: Xavier Initialization “Xavier” initialization: std = 1/sqrt(Din) Derivation: Variance of output = Variance of input (") 𝑧 " = $ 𝑦 & 𝑥 y = Wx & &*+ Var(y i ) = Din * Var(x i w i ) [Assume x, w are iid] 2 ] - E[x i ] 2 E[w i ] 2 ) [Assume x, w independent] = Din * (E[x i 2 ]E[w i = Din * Var(x i ) * Var(w i ) [Assume x, w are zero-mean] If Var(w i ) = 1/Din then Var(y i ) = Var(x i ) Justin Johnson Lecture 10 - 58 October 7, 2019
Weight Initialization: Xavier Initialization “Xavier” initialization: std = 1/sqrt(Din) Derivation: Variance of output = Variance of input (") 𝑧 " = $ 𝑦 & 𝑥 y = Wx & &*+ Var(y i ) = Din * Var(x i w i ) [Assume x, w are iid] 2 ] - E[x i ] 2 E[w i ] 2 ) [Assume x, w independent] = Din * (E[x i 2 ]E[w i = Din * Var(x i ) * Var(w i ) [Assume x, w are zero-mean] If Var(w i ) = 1/Din then Var(y i ) = Var(x i ) Justin Johnson Lecture 10 - 59 October 7, 2019
Weight Initialization: What about ReLU? Change from tanh to ReLU Justin Johnson Lecture 10 - 60 October 7, 2019
Weight Initialization: What about ReLU? Xavier assumes zero centered Change from tanh to ReLU activation function Activations collapse to zero again, no learning =( Justin Johnson Lecture 10 - 61 October 7, 2019
Weight Initialization: Kaiming / MSRA Initialization ”Just right” – activations nicely ReLU correction: std = sqrt(2 / Din) scaled for all layers He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, ICCV 2015 Justin Johnson Lecture 10 - 62 October 7, 2019
Weight Initialization: Residual Networks relu F(x) + x If we initialize with MSRA: then Var(F(x)) = Var(x) But then Var(F(x) + x) > Var(x) – variance grows conv with each block! relu F(x) conv X Residual Block Justin Johnson Lecture 10 - 63 October 7, 2019
Weight Initialization: Residual Networks relu F(x) + x If we initialize with MSRA: then Var(F(x)) = Var(x) But then Var(F(x) + x) > Var(x) – variance grows conv with each block! relu F(x) Solution : Initialize first conv with MSRA, initialize conv second conv to zero. Then Var(x + F(x)) = Var(x) X Residual Block Zhang et al, “Fixup Initialization: Residual Learning Without Normalization”, ICLR 2019 Justin Johnson Lecture 10 - 64 October 7, 2019
Proper initialization is an active area of research Understanding the difficulty of training deep feedforward neural networks by Glorot and Bengio, 2010 Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013 Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014 Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015 Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015 All you need is a good init , Mishkin and Matas, 2015 Fixup Initialization: Residual Learning Without Normalization , Zhang et al, 2019 The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , Frankle and Carbin, 2019 Justin Johnson Lecture 10 - 65 October 7, 2019
Now your model is training … but it overfits! Regularization Justin Johnson Lecture 10 - 66 October 7, 2019
Regularization: Add term to the loss In common use: L2 regularization (Weight decay) L1 regularization Elastic net (L1 + L2) Justin Johnson Lecture 10 - 67 October 7, 2019
Regularization: Dropout In each forward pass, randomly set some neurons to zero Probability of dropping is a hyperparameter; 0.5 is common Srivastava et al, “Dropout: A simple way to prevent neural networks from overfitting”, JMLR 2014 Justin Johnson Lecture 10 - 68 October 7, 2019
Regularization: Dropout Example forward pass with a 3-layer network using dropout Justin Johnson Lecture 10 - 69 October 7, 2019
Regularization: Dropout Forces the network to have a redundant representation; Prevents co-adaptation of features has an ear X has a tail X cat cat is furry score score has claws X mischievous look Justin Johnson Lecture 10 - 70 October 7, 2019
Regularization: Dropout Another interpretation: Dropout is training a large ensemble of models (that share parameters). Each binary mask is one model An FC layer with 4096 units has 2 4096 ~ 10 1233 possible masks! Only ~ 10 82 atoms in the universe... Justin Johnson Lecture 10 - 71 October 7, 2019
Dropout: Test Time Output Input (label) (image) Random Dropout makes our output random! mask Want to “average out” the randomness at test-time But this integral seems hard … Justin Johnson Lecture 10 - 72 October 7, 2019
Dropout: Test Time Want to approximate the integral Consider a single neuron. a w 1 w 2 x y Justin Johnson Lecture 10 - 73 October 7, 2019
Dropout: Test Time Want to approximate the integral Consider a single neuron. a At test time we have: w 1 w 2 x y Justin Johnson Lecture 10 - 74 October 7, 2019
Dropout: Test Time Want to approximate the integral Consider a single neuron. a At test time we have: During training we have: w 1 w 2 x y Justin Johnson Lecture 10 - 75 October 7, 2019
Dropout: Test Time Want to approximate the integral Consider a single neuron. a At test time we have: During training we have: w 1 w 2 At test time, drop x y nothing and multiply by dropout probability Justin Johnson Lecture 10 - 76 October 7, 2019
Dropout: Test Time At test time all neurons are active always => We must scale the activations so that for each neuron: output at test time = expected output at training time Justin Johnson Lecture 10 - 77 October 7, 2019
Dropout Summary drop in forward pass scale at test time Justin Johnson Lecture 10 - 78 October 7, 2019
More common: “Inverted dropout” Drop and scale during training test time is unchanged! Justin Johnson Lecture 10 - 79 October 7, 2019
Dropout architectures Recall AlexNet, VGG have most of their parameters in fully-connected layers ; usually Dropout is applied there AlexNet vs VGG-16 (Params, M) Dropout here! 120000 100000 80000 60000 40000 20000 0 conv1 conv2 conv3 conv4 conv5 fc6 fc7 fc8 AlexNet VGG-16 Justin Johnson Lecture 10 - 80 October 7, 2019
Dropout architectures Recall AlexNet, VGG have most of their parameters in fully-connected layers ; usually Dropout is applied there AlexNet vs VGG-16 (Params, M) Dropout here! Later architectures (GoogLeNet, 120000 ResNet, etc) use global average 100000 pooling instead of fully-connected 80000 60000 layers: they don’t use dropout at all! 40000 20000 0 conv1 conv2 conv3 conv4 conv5 fc6 fc7 fc8 AlexNet VGG-16 Justin Johnson Lecture 10 - 81 October 7, 2019
Regularization : A common pattern Re Training : Add some kind of randomness Testing: Average out randomness (sometimes approximate) Justin Johnson Lecture 10 - 82 October 7, 2019
Regularization : A common pattern Re Example : Batch Training : Add some kind of Normalization randomness Training : Normalize using stats from random minibatches Testing: Average out randomness (sometimes approximate) Testing : Use fixed stats to normalize Justin Johnson Lecture 10 - 83 October 7, 2019
Regularization : A common pattern Re Example : Batch Training : Add some kind of Normalization randomness For ResNet and later, often L2 and Batch Training : Normalize Normalization are the only regularizers! using stats from random minibatches Testing: Average out randomness (sometimes approximate) Testing : Use fixed stats to normalize Justin Johnson Lecture 10 - 84 October 7, 2019
Da Data Au Augme mentati tion Load image “cat” and label Compute loss CNN This image by Nikita is licensed under CC-BY 2.0 Justin Johnson Lecture 10 - 85 October 7, 2019
Da Data Au Augme mentati tion Load image “cat” and label Compute loss CNN Transform image Justin Johnson Lecture 10 - 86 October 7, 2019
tion : Horizontal Flips Da Data Au Augme mentati Justin Johnson Lecture 10 - 87 October 7, 2019
tion : Random Crops and Scales Da Data Au Augme mentati Training : sample random crops / scales ResNet: Pick random L in range [256, 480] 1. Resize training image, short side = L 2. Sample random 224 x 224 patch 3. Justin Johnson Lecture 10 - 88 October 7, 2019
tion : Random Crops and Scales Da Data Au Augme mentati Training : sample random crops / scales ResNet: Pick random L in range [256, 480] 1. Resize training image, short side = L 2. Sample random 224 x 224 patch 3. Testing : average a fixed set of crops ResNet: Resize image at 5 scales: {224, 256, 384, 480, 640} 1. For each size, use 10 224 x 224 crops: 4 corners + center, + flips 2. Justin Johnson Lecture 10 - 89 October 7, 2019
tion : Color Jitter Da Data Au Augme mentati More Complex : 1. Apply PCA to all [R, G, B] Simple: Randomize contrast and brightness pixels in training set 2. Sample a “color offset” along principal component directions 3. Add offset to all pixels of a training image (Used in AlexNet, ResNet, etc) Justin Johnson Lecture 10 - 90 October 7, 2019
tion : Get creative for your problem! Da Data Au Augme mentati Random mix/combinations of : translation - rotation - stretching - shearing, - lens distortions, … (go crazy) - Justin Johnson Lecture 10 - 91 October 7, 2019
Regularization : A common pattern Re Training : Add some randomness Testing : Marginalize over randomness Examples : Dropout Batch Normalization Data Augmentation Wan et al, “Regularization of Neural Networks using DropConnect”, ICML 2013 Justin Johnson Lecture 10 - 92 October 7, 2019
Regularization : DropConnect Re Training : Drop random connections between neurons (set weight=0) Testing : Use all the connections Examples : Dropout Batch Normalization Data Augmentation DropConnect Wan et al, “Regularization of Neural Networks using DropConnect”, ICML 2013 Justin Johnson Lecture 10 - 93 October 7, 2019
Regularization : Fractional Pooling Re Training : Use randomized pooling regions Testing : Average predictions over different samples Examples : Dropout Batch Normalization Data Augmentation DropConnect Fractional Max Pooling Graham, “Fractional Max Pooling”, arXiv 2014 Justin Johnson Lecture 10 - 94 October 7, 2019
Regularization : Stochastic Depth Re Training : Skip some residual blocks in ResNet Testing : Use the whole network Examples : Dropout Batch Normalization Data Augmentation DropConnect Fractional Max Pooling Stochastic Depth Huang et al, “Deep Networks with Stochastic Depth”, ECCV 2016 Justin Johnson Lecture 10 - 95 October 7, 2019
Regularization : Stochastic Depth Re Training : Set random images regions to 0 Testing : Use the whole image Examples : Dropout Batch Normalization Data Augmentation DropConnect Fractional Max Pooling Stochastic Depth Cutout Works very well for small datasets like CIFAR, less common for large datasets like ImageNet DeVries and Taylor, “Improved Regularization of Convolutional Neural Networks with Cutout”, arXiv 2017 Justin Johnson Lecture 10 - 96 October 7, 2019
Regularization : Mixup Re Sample blend probability from a beta distribution Beta(a, b) Training : Train on random blends of images with a=b≈0 so blend weights are close to 0/1 Testing : Use original images Examples : Dropout Target label: Batch Normalization CNN cat: 0.4 Data Augmentation dog: 0.6 DropConnect Fractional Max Pooling Stochastic Depth Randomly blend the pixels of Cutout pairs of training images, e.g. Mixup 40% cat, 60% dog Zhang et al, “ mixup : Beyond Empirical Risk Minimization”, ICLR 2018 Justin Johnson Lecture 10 - 97 October 7, 2019
Regularization : Mixup Re Training : Train on random blends of images Testing : Use original images Examples : Dropout Target label: Batch Normalization CNN cat: 0.4 Data Augmentation dog: 0.6 DropConnect Fractional Max Pooling Stochastic Depth Randomly blend the pixels of Cutout pairs of training images, e.g. Mixup 40% cat, 60% dog Zhang et al, “ mixup : Beyond Empirical Risk Minimization”, ICLR 2018 Justin Johnson Lecture 10 - 98 October 7, 2019
Regularization : Mixup Re Training : Train on random blends of images Testing : Use original images Examples : Consider dropout for large fully- Dropout - connected layers Batch Normalization Batch normalization and data Data Augmentation - augmentation almost always a DropConnect good idea Fractional Max Pooling Try cutout and mixup especially Stochastic Depth - for small classification datasets Cutout Mixup Zhang et al, “ mixup : Beyond Empirical Risk Minimization”, ICLR 2018 Justin Johnson Lecture 10 - 99 October 7, 2019
Summary 1.One time setup Today Activation functions, data preprocessing, weight initialization, regularization 2.Training dynamics Learning rate schedules; large-batch training; hyperparameter optimization Next time 3.After training Model ensembles, transfer learning Justin Johnson Lecture 10 - 100 October 7, 2019
Recommend
More recommend