Lecture 6: Training Neural Networks, Part 2 Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 1
Administrative A2 is out. It’s meaty. It’s due Feb 5 (next Friday) You’ll implement: Neural Nets (with Layer Forward/Backward API) Batch Norm Dropout ConvNets Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 2
Mini-batch SGD Loop: 1. Sample a batch of data 2. Forward prop it through the graph, get loss 3. Backprop to calculate the gradients 4. Update the parameters using the gradient Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 3
Leaky ReLU Activation Functions max(0.1x, x) Sigmoid Maxout tanh tanh(x) ELU ReLU max(0,x) Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 4
Data Preprocessing Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 5
“Xavier initialization” [Glorot et al., 2010] Reasonable initialization. (Mathematical derivation assumes linear activations) Weight Initialization Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 6
[Ioffe and Szegedy, 2015] Batch Normalization - Improves gradient flow Normalize: through the network - Allows higher learning rates - Reduces the strong dependence on initialization And then allow the network to squash - Acts as a form of the range if it wants to: regularization in a funny way, and slightly reduces the need for dropout, maybe Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 7
Babysitting the Cross-validation learning process Loss barely changing: Learning rate is probably too low Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 8
TODO - Parameter update schemes - Learning rate schedules - Dropout - Gradient checking - Model ensembles Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 9
Parameter Updates Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 10
Training a neural network, main loop: Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 11
Training a neural network, main loop: simple gradient descent update now: complicate. Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 12
Image credits: Alec Radford Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 13
Suppose loss function is steep vertically but shallow horizontally: Q: What is the trajectory along which we converge towards the minimum with SGD? Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 14
Suppose loss function is steep vertically but shallow horizontally: Q: What is the trajectory along which we converge towards the minimum with SGD? Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 15
Suppose loss function is steep vertically but shallow horizontally: Q: What is the trajectory along which we converge towards the minimum with SGD? very slow progress along flat direction, jitter along steep one Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 16
Momentum update - Physical interpretation as ball rolling down the loss function + friction (mu coefficient). - mu = usually ~0.5, 0.9, or 0.99 (Sometimes annealed over time, e.g. from 0.5 -> 0.99) Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 17
Momentum update - Allows a velocity to “build up” along shallow directions - Velocity becomes damped in steep direction due to quickly changing sign Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 18
SGD vs Momentum notice momentum overshooting the target, but overall getting to the minimum much faster. Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 19
Nesterov Momentum update Ordinary momentum update: momentum step actual step gradient step Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 20
Nesterov Momentum update Nesterov momentum update Momentum update “lookahead” gradient step (bit different than momentum momentum original) step step actual step actual step gradient step Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 21
Nesterov Momentum update Nesterov momentum update Momentum update “lookahead” gradient step (bit different than momentum momentum original) step step actual step actual step gradient Nesterov: the only difference... step Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 22
Nesterov Momentum update Slightly inconvenient… usually we have : Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 23
Nesterov Momentum update Slightly inconvenient… usually we have : Variable transform and rearranging saves the day: Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 24
Nesterov Momentum update Slightly inconvenient… usually we have : Variable transform and rearranging saves the day: Replace all thetas with phis, rearrange and obtain: Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 25
nag = Nesterov Accelerated Gradient Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 26
[Duchi et al., 2011] AdaGrad update Added element-wise scaling of the gradient based on the historical sum of squares in each dimension Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 27
AdaGrad update Q: What happens with AdaGrad? Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 28
AdaGrad update Q2: What happens to the step size over long time? Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 29
[Tieleman and Hinton, 2012] RMSProp update Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 30
Introduced in a slide in Geoff Hinton’s Coursera class, lecture 6 Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 31
Introduced in a slide in Geoff Hinton’s Coursera class, lecture 6 Cited by several papers as: Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 32
adagrad rmsprop Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - Lecture 6 - 25 Jan 2016 25 Jan 2016 33
Recommend
More recommend