Lecture 6: Training Neural Networks, Part 2 Fei-Fei Li & - PowerPoint PPT Presentation

Lecture 6: Training Neural Networks, Part 2 Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 1

Administrative A2 is out. It’s meaty. It’s due Feb 5 (next Friday) You’ll implement: Neural Nets (with Layer Forward/Backward API) Batch Norm Dropout ConvNets Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 2

Mini-batch SGD Loop: 1. Sample a batch of data 2. Forward prop it through the graph, get loss 3. Backprop to calculate the gradients 4. Update the parameters using the gradient Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 3

Leaky ReLU Activation Functions max(0.1x, x) Sigmoid Maxout tanh tanh(x) ELU ReLU max(0,x) Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 4

Data Preprocessing Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 5

“Xavier initialization” [Glorot et al., 2010] Reasonable initialization. (Mathematical derivation assumes linear activations) Weight Initialization Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 6

[Ioffe and Szegedy, 2015] Batch Normalization - Improves gradient flow Normalize: through the network - Allows higher learning rates - Reduces the strong dependence on initialization And then allow the network to squash - Acts as a form of the range if it wants to: regularization in a funny way, and slightly reduces the need for dropout, maybe Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 7

Babysitting the Cross-validation learning process Loss barely changing: Learning rate is probably too low Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 8

TODO - Parameter update schemes - Learning rate schedules - Dropout - Gradient checking - Model ensembles Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 9

Parameter Updates Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 10

Training a neural network, main loop: Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 11

Training a neural network, main loop: simple gradient descent update now: complicate. Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 12

Image credits: Alec Radford Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 13

Suppose loss function is steep vertically but shallow horizontally: Q: What is the trajectory along which we converge towards the minimum with SGD? Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 14

Suppose loss function is steep vertically but shallow horizontally: Q: What is the trajectory along which we converge towards the minimum with SGD? Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 15

Suppose loss function is steep vertically but shallow horizontally: Q: What is the trajectory along which we converge towards the minimum with SGD? very slow progress along flat direction, jitter along steep one Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 16

Momentum update - Physical interpretation as ball rolling down the loss function + friction (mu coefficient). - mu = usually ~0.5, 0.9, or 0.99 (Sometimes annealed over time, e.g. from 0.5 -> 0.99) Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 17

Momentum update - Allows a velocity to “build up” along shallow directions - Velocity becomes damped in steep direction due to quickly changing sign Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 18

SGD vs Momentum notice momentum overshooting the target, but overall getting to the minimum much faster. Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 19

Nesterov Momentum update Ordinary momentum update: momentum step actual step gradient step Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 20

Nesterov Momentum update Nesterov momentum update Momentum update “lookahead” gradient step (bit different than momentum momentum original) step step actual step actual step gradient step Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 21

Nesterov Momentum update Nesterov momentum update Momentum update “lookahead” gradient step (bit different than momentum momentum original) step step actual step actual step gradient Nesterov: the only difference... step Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 22

Nesterov Momentum update Slightly inconvenient… usually we have : Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 23

Nesterov Momentum update Slightly inconvenient… usually we have : Variable transform and rearranging saves the day: Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 24

Nesterov Momentum update Slightly inconvenient… usually we have : Variable transform and rearranging saves the day: Replace all thetas with phis, rearrange and obtain: Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 25

nag = Nesterov Accelerated Gradient Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 26

[Duchi et al., 2011] AdaGrad update Added element-wise scaling of the gradient based on the historical sum of squares in each dimension Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 27

AdaGrad update Q: What happens with AdaGrad? Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 28

AdaGrad update Q2: What happens to the step size over long time? Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 29

[Tieleman and Hinton, 2012] RMSProp update Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 30

Introduced in a slide in Geoff Hinton’s Coursera class, lecture 6 Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 31

Introduced in a slide in Geoff Hinton’s Coursera class, lecture 6 Cited by several papers as: Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 32

adagrad rmsprop Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 33

Lecture 6: Training Neural Networks, Part 2 Fei-Fei Li & - PowerPoint PPT Presentation

Lecture 6: Training Neural Networks, Part 2 Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 1 Administrative A2 is out. Its

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Artificial Neural Networks Oliver Schulte - CMPT 726 Feed-forward Networks Network Training

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Lecture 4: Recurrent neural networks for natural language processing Plan of the lecture Part

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Lecture 11: Neural Networks (Part 3) March 2nd, 2020 Lecturer: Steven Wu Scribe: Steven Wu 1

Relaxation and Hopfield Networks Neural Networks Neural Networks - Hopfield 1 Bibliography

The Application of Airborne Remote Sensing for OSI Aled Rowlands and Rainier Arndt OSI/Equipment

Decompositions of log-correlated fields with applications Eero Saksman (University of Helsinki)

Video Fields: Fusing Multiple Surveillance Videos into a Dynamic Virtual Environment Ruofei Du,

August 15, 2012: Riot Grrrl, Interrupted Photo: William Webster/Reuters WEEK 2 IAT100

ASSISTANT SUPERINTENDENTS MEETING DECEMBER 2, 2016 Assistant Superintendents Meeting

Regeneration Kattie Sepehri Every species is capable of regeneration but some species can

Recruitment & Support of the Vital Few Created By the Members of the Ohio Family Care

Presented by: Laurie Darst Mayo Clinic Mary Lynn Bushman NGS February 16, 2016

Lecture 6: Training Neural Networks, Part 2 Fei-Fei Li & - PowerPoint PPT Presentation

Lecture 6: Training Neural Networks, Part 2 Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 1 Administrative A2 is out. Its

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Artificial Neural Networks Oliver Schulte - CMPT 726 Feed-forward Networks Network Training

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Lecture 4: Recurrent neural networks for natural language processing Plan of the lecture Part

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Lecture 11: Neural Networks (Part 3) March 2nd, 2020 Lecturer: Steven Wu Scribe: Steven Wu 1

Relaxation and Hopfield Networks Neural Networks Neural Networks - Hopfield 1 Bibliography

The Application of Airborne Remote Sensing for OSI Aled Rowlands and Rainier Arndt OSI/Equipment

Decompositions of log-correlated fields with applications Eero Saksman (University of Helsinki)

Video Fields: Fusing Multiple Surveillance Videos into a Dynamic Virtual Environment Ruofei Du,

August 15, 2012: Riot Grrrl, Interrupted Photo: William Webster/Reuters WEEK 2 IAT100

ASSISTANT SUPERINTENDENTS MEETING DECEMBER 2, 2016 Assistant Superintendents Meeting

Regeneration Kattie Sepehri Every species is capable of regeneration but some species can

Recruitment &amp; Support of the Vital Few Created By the Members of the Ohio Family Care

Presented by: Laurie Darst Mayo Clinic Mary Lynn Bushman NGS February 16, 2016

Recruitment & Support of the Vital Few Created By the Members of the Ohio Family Care