DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY ARULRAJ L E C T U R E # 1 2 : T R A I N I N G N E U R A L N E T W O R K S ( P T 1 )
administrivia • Reminders – Integration with Eva – Code reviews – Each team must send Pull Requests to Eva GT 8803 // Fall 2019 2
Where we are now... Hardware + Software PyTorch TensorFlow 3 GT 8803 // Fall 2018
OVERVIEW • One time setup – Activation Functions, Preprocessing, Weight Initialization, Regularization, Gradient Checking • Training dynamics – Babysitting the Learning Process, Parameter updates, Hyperparameter Optimization • Evaluation – Model ensembles, Test-time augmentation GT 8803 // Fall 2019 4
TODAY’s AGENDA • Training Neural Networks – Activation Functions – Data Preprocessing – Weight Initialization – Batch Normalization GT 8803 // Fall 2019 5
ACTIVATION FUNCTIONS 6 GT 8803 // Fall 2018
Activation Functions 7 GT 8803 // Fall 2018
Activation Functions Leaky ReLU Sigmoid tanh Maxout ELU ReLU 8 GT 8803 // Fall 2018
Activation Functions Squashes numbers to range [0,1] • Historically popular since they have • nice interpretation as a saturating “firing rate” of a neuron Sigmoid 9 GT 8803 // Fall 2018
Activation Functions Squashes numbers to range [0,1] • Historically popular since they have • nice interpretation as a saturating “firing rate” of a neuron 3 problems: Saturated neurons “kill” the 1. Sigmoid gradients 10 GT 8803 // Fall 2018
x sigmoid gate What happens when x = -10? What happens when x = 0? What happens when x = 10? 11 GT 8803 // Fall 2018
Activation Functions Squashes numbers to range [0,1] • Historically popular since they have • nice interpretation as a saturating “firing rate” of a neuron 3 problems: Saturated neurons “kill” the 1. Sigmoid gradients Sigmoid outputs are not zero- 2. centered 12 GT 8803 // Fall 2018
Consider what happens when the input to a neuron is always positive... What can we say about the gradients on w ? 13 GT 8803 // Fall 2018
Consider what happens when the input to a neuron is always positive... allowed gradient update directions zig zag path allowed gradient update directions hypothetical What can we say about the gradients on w ? optimal w vector Always all positive or all negative :( 14 GT 8803 // Fall 2018
Consider what happens when the input to a neuron is always positive... allowed gradient update directions zig zag path allowed gradient update directions hypothetical What can we say about the gradients on w ? optimal w vector Always all positive or all negative :( (For a single element! Minibatches help) 15 GT 8803 // Fall 2018
Activation Functions Squashes numbers to range [0,1] • Historically popular since they have • nice interpretation as a saturating “firing rate” of a neuron 3 problems: Saturated neurons “kill” the 1. Sigmoid gradients Sigmoid outputs are not zero- 2. centered exp() is a bit compute expensive 3. 16 GT 8803 // Fall 2018
Activation Functions Squashes numbers to range [-1,1] • zero centered (nice) • still kills gradients when saturated :( • tanh(x) [LeCun et al., 1991] 17 GT 8803 // Fall 2018
Computes f(x) = max(0,x) • Activation Functions Does not saturate (in +region) • Very computationally efficient • Converges much faster than • sigmoid/tanh in practice (e.g. 6x) ReLU (Rectified Linear Unit) [Krizhevsky et al., 2012] 18 GT 8803 // Fall 2018
Computes f(x) = max(0,x) • Activation Functions Does not saturate (in +region) • Very computationally efficient • Converges much faster than • sigmoid/tanh in practice (e.g. 6x) Not zero-centered output ReLU • (Rectified Linear Unit) [Krizhevsky et al., 2012] 19 GT 8803 // Fall 2018
Computes f(x) = max(0,x) • Activation Functions Does not saturate (in +region) • Very computationally efficient • Converges much faster than • sigmoid/tanh in practice (e.g. 6x) Not zero-centered output ReLU • (Rectified Linear Unit) [Krizhevsky et al., 2012] 20 GT 8803 // Fall 2018
Computes f(x) = max(0,x) • Activation Functions Does not saturate (in +region) • Very computationally efficient • Converges much faster than • sigmoid/tanh in practice (e.g. 6x) Not zero-centered output ReLU • An annoyance: (Rectified Linear Unit) - hint: what is the gradient when x < 0? 21 GT 8803 // Fall 2018
x ReLU gate What happens when x = -10? What happens when x = 0? What happens when x = 10? 22 GT 8803 // Fall 2018
active ReLU DATA CLOUD dead ReLU will never activate => never update 23 GT 8803 // Fall 2018
active ReLU DATA CLOUD => people like to initialize dead ReLU ReLU neurons with slightly will never activate positive biases (e.g. 0.01) => never update 24 GT 8803 // Fall 2018
[Mass et al., 2013] Activation Functions [He et al., 2015] Does not saturate • Computationally efficient • Converges much faster than • sigmoid/tanh in practice! (e.g. 6x) will not “die”. • Leaky ReLU 25 GT 8803 // Fall 2018
[Mass et al., 2013] Activation Functions [He et al., 2015] Does not saturate • Computationally efficient • Converges much faster than • sigmoid/tanh in practice! (e.g. 6x) will not “die”. • Leaky ReLU Parametric Rectifier (PReLU) backprop into \alpha (parameter) 26 GT 8803 // Fall 2018
[Clevert et al., 2015] Activation Functions Exponential Linear Units (ELU) All benefits of ReLU • Closer to zero mean outputs • Negative saturation regime • compared with Leaky ReLU adds some robustness to noise Computation requires exp() • 27 GT 8803 // Fall 2018
[Goodfellow et al., 2013] Maxout “Neuron” • Does not have the basic form of dot product -> nonlinearity • Generalizes ReLU and Leaky ReLU • Linear Regime! Does not saturate! Does not die! Problem: doubles the number of parameters/neuron :( 28 GT 8803 // Fall 2018
TLDR: In practice: Use ReLU. Be careful with your learning rates • Try out Leaky ReLU / Maxout / ELU • Try out tanh but don’t expect much • Don’t use sigmoid • 29 GT 8803 // Fall 2018
DATA PREPROCESSING 30 GT 8803 // Fall 2018
DATA PREPROCESSING (Assume X [NxD] is data matrix, each example in a row) 31 GT 8803 // Fall 2018
Remember: Consider what happens when the input to a neuron is always positive... allowed gradient update directions zig zag path allowed gradient update directions hypothetical What can we say about the gradients on w ? optimal w vector Always all positive or all negative :( (this is also why you want zero-mean data!) 32 GT 8803 // Fall 2018
DATA PREPROCESSING (Assume X [NxD] is data matrix, each example in a row) 33 GT 8803 // Fall 2018
DATA PREPROCESSING (data has diagonal (covariance matrix is covariance matrix) the identity matrix) In practice, you may also see PCA and Whitening of the data 34 GT 8803 // Fall 2018
DATA PREPROCESSING After normalization : less sensitive Before normalization : to small changes in weights; easier classification loss very sensitive to to optimize changes in weight matrix; hard to optimize 35 GT 8803 // Fall 2018
TLDR: In practice for Images: center only e.g. consider CIFAR-10 example with [32,32,3] images • Subtract the mean image (e.g. AlexNet) (mean image = [32,32,3] array) • Subtract per-channel mean (e.g. VGGNet) (mean along each channel = 3 numbers) • Subtract per-channel mean and Divide by per-channel std (e.g. ResNet) Not common to do (mean along each channel = 3 numbers) PCA or whitening 36 GT 8803 // Fall 2018
WEIGHT INITIALIZATION 37 GT 8803 // Fall 2018
Q: what happens when W=constant init is used? 38 GT 8803 // Fall 2018
First idea: Small random numbers (gaussian with zero mean and 1e-2 standard deviation) 39 GT 8803 // Fall 2018
First idea: Small random numbers (gaussian with zero mean and 1e-2 standard deviation) Works ~okay for small networks, but problems with deeper networks. 40 GT 8803 // Fall 2018
Weight Initialization: Activation statistics Forward pass for a 6-layer net with hidden size 4096 41 GT 8803 // Fall 2018
Weight Initialization: Activation statistics Forward pass for a 6-layer All activations tend to zero net with hidden size 4096 for deeper network layers Q : What do the gradients dL/dW look like? 42 GT 8803 // Fall 2018
Weight Initialization: Activation statistics Forward pass for a 6-layer All activations tend to zero net with hidden size 4096 for deeper network layers Q : What do the gradients dL/dW look like? A : All zero, no learning =( 43 GT 8803 // Fall 2018
Weight Initialization: Activation statistics Increase std of initial weights from 0.01 to 0.05 44 GT 8803 // Fall 2018
Weight Initialization: Activation statistics Increase std of initial weights All activations saturate from 0.01 to 0.05 Q : What do the gradients look like? 45 GT 8803 // Fall 2018
Recommend
More recommend