data analytics using deep learning
play

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY - PowerPoint PPT Presentation

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY ARULRAJ L E C T U R E # 1 2 : T R A I N I N G N E U R A L N E T W O R K S ( P T 1 ) administrivia Reminders Integration with Eva Code reviews Each team must send


  1. DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY ARULRAJ L E C T U R E # 1 2 : T R A I N I N G N E U R A L N E T W O R K S ( P T 1 )

  2. administrivia • Reminders – Integration with Eva – Code reviews – Each team must send Pull Requests to Eva GT 8803 // Fall 2019 2

  3. Where we are now... Hardware + Software PyTorch TensorFlow 3 GT 8803 // Fall 2018

  4. OVERVIEW • One time setup – Activation Functions, Preprocessing, Weight Initialization, Regularization, Gradient Checking • Training dynamics – Babysitting the Learning Process, Parameter updates, Hyperparameter Optimization • Evaluation – Model ensembles, Test-time augmentation GT 8803 // Fall 2019 4

  5. TODAY’s AGENDA • Training Neural Networks – Activation Functions – Data Preprocessing – Weight Initialization – Batch Normalization GT 8803 // Fall 2019 5

  6. ACTIVATION FUNCTIONS 6 GT 8803 // Fall 2018

  7. Activation Functions 7 GT 8803 // Fall 2018

  8. Activation Functions Leaky ReLU Sigmoid tanh Maxout ELU ReLU 8 GT 8803 // Fall 2018

  9. Activation Functions Squashes numbers to range [0,1] • Historically popular since they have • nice interpretation as a saturating “firing rate” of a neuron Sigmoid 9 GT 8803 // Fall 2018

  10. Activation Functions Squashes numbers to range [0,1] • Historically popular since they have • nice interpretation as a saturating “firing rate” of a neuron 3 problems: Saturated neurons “kill” the 1. Sigmoid gradients 10 GT 8803 // Fall 2018

  11. x sigmoid gate What happens when x = -10? What happens when x = 0? What happens when x = 10? 11 GT 8803 // Fall 2018

  12. Activation Functions Squashes numbers to range [0,1] • Historically popular since they have • nice interpretation as a saturating “firing rate” of a neuron 3 problems: Saturated neurons “kill” the 1. Sigmoid gradients Sigmoid outputs are not zero- 2. centered 12 GT 8803 // Fall 2018

  13. Consider what happens when the input to a neuron is always positive... What can we say about the gradients on w ? 13 GT 8803 // Fall 2018

  14. Consider what happens when the input to a neuron is always positive... allowed gradient update directions zig zag path allowed gradient update directions hypothetical What can we say about the gradients on w ? optimal w vector Always all positive or all negative :( 14 GT 8803 // Fall 2018

  15. Consider what happens when the input to a neuron is always positive... allowed gradient update directions zig zag path allowed gradient update directions hypothetical What can we say about the gradients on w ? optimal w vector Always all positive or all negative :( (For a single element! Minibatches help) 15 GT 8803 // Fall 2018

  16. Activation Functions Squashes numbers to range [0,1] • Historically popular since they have • nice interpretation as a saturating “firing rate” of a neuron 3 problems: Saturated neurons “kill” the 1. Sigmoid gradients Sigmoid outputs are not zero- 2. centered exp() is a bit compute expensive 3. 16 GT 8803 // Fall 2018

  17. Activation Functions Squashes numbers to range [-1,1] • zero centered (nice) • still kills gradients when saturated :( • tanh(x) [LeCun et al., 1991] 17 GT 8803 // Fall 2018

  18. Computes f(x) = max(0,x) • Activation Functions Does not saturate (in +region) • Very computationally efficient • Converges much faster than • sigmoid/tanh in practice (e.g. 6x) ReLU (Rectified Linear Unit) [Krizhevsky et al., 2012] 18 GT 8803 // Fall 2018

  19. Computes f(x) = max(0,x) • Activation Functions Does not saturate (in +region) • Very computationally efficient • Converges much faster than • sigmoid/tanh in practice (e.g. 6x) Not zero-centered output ReLU • (Rectified Linear Unit) [Krizhevsky et al., 2012] 19 GT 8803 // Fall 2018

  20. Computes f(x) = max(0,x) • Activation Functions Does not saturate (in +region) • Very computationally efficient • Converges much faster than • sigmoid/tanh in practice (e.g. 6x) Not zero-centered output ReLU • (Rectified Linear Unit) [Krizhevsky et al., 2012] 20 GT 8803 // Fall 2018

  21. Computes f(x) = max(0,x) • Activation Functions Does not saturate (in +region) • Very computationally efficient • Converges much faster than • sigmoid/tanh in practice (e.g. 6x) Not zero-centered output ReLU • An annoyance: (Rectified Linear Unit) - hint: what is the gradient when x < 0? 21 GT 8803 // Fall 2018

  22. x ReLU gate What happens when x = -10? What happens when x = 0? What happens when x = 10? 22 GT 8803 // Fall 2018

  23. active ReLU DATA CLOUD dead ReLU will never activate => never update 23 GT 8803 // Fall 2018

  24. active ReLU DATA CLOUD => people like to initialize dead ReLU ReLU neurons with slightly will never activate positive biases (e.g. 0.01) => never update 24 GT 8803 // Fall 2018

  25. [Mass et al., 2013] Activation Functions [He et al., 2015] Does not saturate • Computationally efficient • Converges much faster than • sigmoid/tanh in practice! (e.g. 6x) will not “die”. • Leaky ReLU 25 GT 8803 // Fall 2018

  26. [Mass et al., 2013] Activation Functions [He et al., 2015] Does not saturate • Computationally efficient • Converges much faster than • sigmoid/tanh in practice! (e.g. 6x) will not “die”. • Leaky ReLU Parametric Rectifier (PReLU) backprop into \alpha (parameter) 26 GT 8803 // Fall 2018

  27. [Clevert et al., 2015] Activation Functions Exponential Linear Units (ELU) All benefits of ReLU • Closer to zero mean outputs • Negative saturation regime • compared with Leaky ReLU adds some robustness to noise Computation requires exp() • 27 GT 8803 // Fall 2018

  28. [Goodfellow et al., 2013] Maxout “Neuron” • Does not have the basic form of dot product -> nonlinearity • Generalizes ReLU and Leaky ReLU • Linear Regime! Does not saturate! Does not die! Problem: doubles the number of parameters/neuron :( 28 GT 8803 // Fall 2018

  29. TLDR: In practice: Use ReLU. Be careful with your learning rates • Try out Leaky ReLU / Maxout / ELU • Try out tanh but don’t expect much • Don’t use sigmoid • 29 GT 8803 // Fall 2018

  30. DATA PREPROCESSING 30 GT 8803 // Fall 2018

  31. DATA PREPROCESSING (Assume X [NxD] is data matrix, each example in a row) 31 GT 8803 // Fall 2018

  32. Remember: Consider what happens when the input to a neuron is always positive... allowed gradient update directions zig zag path allowed gradient update directions hypothetical What can we say about the gradients on w ? optimal w vector Always all positive or all negative :( (this is also why you want zero-mean data!) 32 GT 8803 // Fall 2018

  33. DATA PREPROCESSING (Assume X [NxD] is data matrix, each example in a row) 33 GT 8803 // Fall 2018

  34. DATA PREPROCESSING (data has diagonal (covariance matrix is covariance matrix) the identity matrix) In practice, you may also see PCA and Whitening of the data 34 GT 8803 // Fall 2018

  35. DATA PREPROCESSING After normalization : less sensitive Before normalization : to small changes in weights; easier classification loss very sensitive to to optimize changes in weight matrix; hard to optimize 35 GT 8803 // Fall 2018

  36. TLDR: In practice for Images: center only e.g. consider CIFAR-10 example with [32,32,3] images • Subtract the mean image (e.g. AlexNet) (mean image = [32,32,3] array) • Subtract per-channel mean (e.g. VGGNet) (mean along each channel = 3 numbers) • Subtract per-channel mean and Divide by per-channel std (e.g. ResNet) Not common to do (mean along each channel = 3 numbers) PCA or whitening 36 GT 8803 // Fall 2018

  37. WEIGHT INITIALIZATION 37 GT 8803 // Fall 2018

  38. Q: what happens when W=constant init is used? 38 GT 8803 // Fall 2018

  39. First idea: Small random numbers (gaussian with zero mean and 1e-2 standard deviation) 39 GT 8803 // Fall 2018

  40. First idea: Small random numbers (gaussian with zero mean and 1e-2 standard deviation) Works ~okay for small networks, but problems with deeper networks. 40 GT 8803 // Fall 2018

  41. Weight Initialization: Activation statistics Forward pass for a 6-layer net with hidden size 4096 41 GT 8803 // Fall 2018

  42. Weight Initialization: Activation statistics Forward pass for a 6-layer All activations tend to zero net with hidden size 4096 for deeper network layers Q : What do the gradients dL/dW look like? 42 GT 8803 // Fall 2018

  43. Weight Initialization: Activation statistics Forward pass for a 6-layer All activations tend to zero net with hidden size 4096 for deeper network layers Q : What do the gradients dL/dW look like? A : All zero, no learning =( 43 GT 8803 // Fall 2018

  44. Weight Initialization: Activation statistics Increase std of initial weights from 0.01 to 0.05 44 GT 8803 // Fall 2018

  45. Weight Initialization: Activation statistics Increase std of initial weights All activations saturate from 0.01 to 0.05 Q : What do the gradients look like? 45 GT 8803 // Fall 2018

Recommend


More recommend