lecture 7 lecture 7 april 22 2019 april 22 2019 1
play

Lecture 7 - Lecture 7 - April 22, 2019 April 22, 2019 1 - PowerPoint PPT Presentation

Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - Lecture 7 - April 22, 2019 April 22, 2019 1 Administrative: Project Proposal Due tomorrow, 4/24 on GradeScope 1 person per


  1. Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - Lecture 7 - April 22, 2019 April 22, 2019 1

  2. Administrative: Project Proposal Due tomorrow, 4/24 on GradeScope 1 person per group needs to submit, but tag all group members Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - Lecture 7 - April 22, 2019 April 22, 2019 2

  3. Administrative: Alternate Midterm See Piazza for form to request alternate midterm time or other midterm accommodations Alternate midterm requests due Thursday! Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - Lecture 7 - April 22, 2019 April 22, 2019 3

  4. Administrative: A2 A2 is out, due Wednesday 5/1 We recommend using Google Cloud for the assignment, especially if your local machine uses Windows Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - Lecture 7 - April 22, 2019 April 22, 2019 4

  5. Where we are now... Computational graphs x s (scores) * hinge L + loss W R Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019 5

  6. Where we are now... Neural Networks Linear score function: 2-layer Neural Network x h s W1 W2 10 3072 100 Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - Lecture 7 - April 22, 2019 April 22, 2019 6

  7. Where we are now... Convolutional Neural Networks Illustration of LeCun et al. 1998 from CS231n 2017 Lecture 1 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019 7

  8. Where we are now... Convolutional Layer activation map 32x32x3 image 5x5x3 filter 32 28 convolve (slide) over all spatial locations 28 32 3 1 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019 8

  9. Where we are now... For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps: Convolutional Layer activation maps 32 28 Convolution Layer 28 32 3 6 We stack these up to get a “new image” of size 28x28x6! Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019 9

  10. Where we are now... Learning network parameters through optimization Landscape image is CC0 1.0 public domain Walking man image is CC0 1.0 public domain Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019 10

  11. Where we are now... Mini-batch SGD Loop: 1. Sample a batch of data 2. Forward prop it through the graph (network), get loss 3. Backprop to calculate the gradients 4. Update the parameters using the gradient Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019 11

  12. Where we are now... Hardware + Software PyTorch TensorFlow Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 22, 2019 12

  13. Next: Training Neural Networks Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - Lecture 7 - April 22, 2019 April 22, 2019 13

  14. Overview 1. One time setup activation functions, preprocessing, weight initialization, regularization, gradient checking 2. Training dynamics babysitting the learning process, parameter updates, hyperparameter optimization 3. Evaluation model ensembles, test-time augmentation Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - Lecture 7 - April 22, 2019 April 22, 2019 14

  15. Part 1 - Activation Functions - Data Preprocessing - Weight Initialization - Batch Normalization - Babysitting the Learning Process - Hyperparameter Optimization Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - Lecture 7 - April 22, 2019 April 22, 2019 15

  16. Activation Functions Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - Lecture 7 - April 22, 2019 April 22, 2019 16

  17. Activation Functions Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - Lecture 7 - April 22, 2019 April 22, 2019 17

  18. Activation Functions Leaky ReLU Sigmoid tanh Maxout ELU ReLU Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - Lecture 7 - April 22, 2019 April 22, 2019 18

  19. Activation Functions - Squashes numbers to range [0,1] - Historically popular since they have nice interpretation as a saturating “firing rate” of a neuron Sigmoid Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - Lecture 7 - April 22, 2019 April 22, 2019 19

  20. Activation Functions - Squashes numbers to range [0,1] - Historically popular since they have nice interpretation as a saturating “firing rate” of a neuron 3 problems: 1. Saturated neurons “kill” the Sigmoid gradients Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - Lecture 7 - April 22, 2019 April 22, 2019 20

  21. x sigmoid gate What happens when x = -10? What happens when x = 0? What happens when x = 10? Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - Lecture 7 - April 22, 2019 April 22, 2019 21

  22. Activation Functions - Squashes numbers to range [0,1] - Historically popular since they have nice interpretation as a saturating “firing rate” of a neuron 3 problems: 1. Saturated neurons “kill” the Sigmoid gradients 2. Sigmoid outputs are not zero-centered Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - Lecture 7 - April 22, 2019 April 22, 2019 22

  23. Consider what happens when the input to a neuron is always positive... What can we say about the gradients on w ? Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - Lecture 7 - April 22, 2019 April 22, 2019 23

  24. Consider what happens when the input to a neuron is always positive... allowed gradient update directions zig zag path allowed gradient update directions hypothetical What can we say about the gradients on w ? optimal w vector Always all positive or all negative :( Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - Lecture 7 - April 22, 2019 April 22, 2019 24

  25. Consider what happens when the input to a neuron is always positive... allowed gradient update directions zig zag path allowed gradient update directions hypothetical What can we say about the gradients on w ? optimal w vector Always all positive or all negative :( (For a single element! Minibatches help) Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - Lecture 7 - April 22, 2019 April 22, 2019 25

  26. Activation Functions - Squashes numbers to range [0,1] - Historically popular since they have nice interpretation as a saturating “firing rate” of a neuron 3 problems: 1. Saturated neurons “kill” the Sigmoid gradients 2. Sigmoid outputs are not zero-centered 3. exp() is a bit compute expensive Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - Lecture 7 - April 22, 2019 April 22, 2019 26

  27. Activation Functions - Squashes numbers to range [-1,1] - zero centered (nice) - still kills gradients when saturated :( tanh(x) [LeCun et al., 1991] Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - Lecture 7 - April 22, 2019 April 22, 2019 27

  28. - Computes f(x) = max(0,x) Activation Functions - Does not saturate (in +region) - Very computationally efficient - Converges much faster than sigmoid/tanh in practice (e.g. 6x) ReLU (Rectified Linear Unit) [Krizhevsky et al., 2012] Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - Lecture 7 - April 22, 2019 April 22, 2019 28

  29. - Computes f(x) = max(0,x) Activation Functions - Does not saturate (in +region) - Very computationally efficient - Converges much faster than sigmoid/tanh in practice (e.g. 6x) - Not zero-centered output ReLU (Rectified Linear Unit) Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - Lecture 7 - April 22, 2019 April 22, 2019 29

  30. - Computes f(x) = max(0,x) Activation Functions - Does not saturate (in +region) - Very computationally efficient - Converges much faster than sigmoid/tanh in practice (e.g. 6x) - Not zero-centered output - An annoyance: ReLU (Rectified Linear Unit) hint: what is the gradient when x < 0? Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - Lecture 7 - April 22, 2019 April 22, 2019 30

Recommend


More recommend