Train inin ing Neural l Networks I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 1
Lecture 5 Recap I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 2
Gra radie ient Descent fo for r Neura ral Networks Loss function ππ π§ π β π§ π 2 π π = ΰ· ππ₯ 0,0,0 β¦ β 0 β¦ π¦ 0 ππ π§ 0 ΰ· β 1 π§ 0 πΌ πΏ,π π π,π (πΏ) = ππ₯ π,π,π π¦ 1 β¦ π§ 1 ΰ· β 2 π§ 1 β¦ ππ π¦ 2 β 3 ππ π,π π§ π = π΅(π 1,π + ΰ· ΰ· β π π₯ 1,π,π ) π Just simple: β π = π΅(π 0,π + ΰ· π¦ π π₯ 0,π,π ) π΅ π¦ = max(0, π¦) π I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 3
Stochastic Gra radient Descent (S (SGD) πΎ π+1 = πΎ π β π½πΌ πΎ π(πΎ π , π {1..π} , π {1..π} ) π now refers to π -th iteration π πΌ πΎ π π 1 π Ο π=1 πΌ πΎ π = π training samples in the current minibatch Gradient for the π -th minibatch : I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 4
Gra radie ient Descent wit ith Momentum π π+1 = πΎ β π π + πΌ πΎ π(πΎ π ) accumulation rate Gradient of current minibatch velocity (βfrictionβ, momentum) πΎ π+1 = πΎ π β π½ β π π+1 velocity model learning rate Exponentially-weighted average of gradient Important: velocity π π is vector-valued! I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 5
RMSProp Large gradients Y-Direction Source: A. Ng X-direction Small gradients (Uncentered) variance of gradients π π+1 = πΎ β π π + (1 β πΎ)[πΌ πΎ π β πΌ πΎ π] β second momentum Weβre dividing by square gradients: πΌ πΎ π πΎ π+1 = πΎ π β π½ β - Division in Y-Direction will be π π+1 + π large - Division in X-Direction will be Can increase learning rate! small I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 6
Adam β’ Combines Momentum and RMSProp π π+1 = πΎ 1 β π π + 1 β πΎ 1 πΌ πΎ π πΎ π π π+1 = πΎ 2 β π π + (1 β πΎ 2 )[πΌ πΎ π πΎ π β πΌ πΎ π πΎ π β’ π π+1 and π π+1 are initialized with zero β bias towards zero β Typically, bias-corrected moment updates π π+1 π π+1 π π+1 π π+1 = π π+1 = πΎ π+1 = πΎ π β π½ β ΰ· ΰ· ΰ· π+1 π+1 1 β πΎ 1 1 β πΎ 2 π π+1 +π ΰ· I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 7
Train inin ing Neural l Nets I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 8
Learnin ing Rate: : Im Implic ications β’ What if too high? β’ What if too low? Source: http://cs231n.github.io/neural-networks-3/ I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 9
Learnin ing Rate Need high learning rate when far away Need low learning rate when close I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 10
Learnin ing Rate Decay 1 β’ π½ = 1+πππππ§_π ππ’πβππππβ β π½ 0 β E.g., π½ 0 = 0.1 , πππππ§_π ππ’π = 1.0 Learning Rate over Epochs 0.12 β Epoch 0: 0.1 0.1 β Epoch 1: 0.05 0.08 β Epoch 2: 0.06 0.033 0.04 β Epoch 3: 0.025 0.02 ... 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 11
Learnin ing Rate Decay Many options: β’ Step decay π½ = π½ β π’ β π½ (only every n steps) β T is decay rate (often 0.5) β’ Exponential decay π½ = π’ ππππβ β π½ 0 β t is decay rate (t < 1.0) π’ β’ π½ = ππππβ β π 0 β t is decay rate β’ Etc. I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 12
Tra rain ining Schedule le Manually specify learning rate for entire training process β’ Manually set learning rate every n-epochs β’ How? β Trial and error (the hard way) β Some experience (only generalizes to some degree) Consider: #epochs, training set size, network size, etc. I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 13
Basic Recip ipe fo for r Tra rain ining β’ Given ground dataset with ground labels β {π¦ π , π§ π } β’ π¦ π is the π π’β training image, with label π§ π β’ Often dim π¦ β« dim(π§) (e.g., for classification) β’ π is often in the 100-thousands or millions β Take network π and its parameters π₯, π β Use SGD (or variation) to find optimal parameters π₯, π β’ Gradients from backpropagation I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 14
Gra radie ient Descent on Tra rain in Set β’ Given large train set with ( π ) training samples {π π , π π } β Letβs say 1 million labeled images β Letβs say our network has 500k parameters β’ Gradient has 500k dimensions β’ π = 1 πππππππ β’ Extr xtremely ly exp xpensive to to compute I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 15
Learnin ing β’ Learning means generalization to unknown dataset β (So far no βrealβ learning) β I.e., train on known dataset β test with optimized parameters on unknown dataset β’ Basically, we hope that based on the train set, the optimized parameters will give similar results on different data (i.e., test data) I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 16
Learnin ing β’ Training set (β train β): β Use for training your neural network β’ Validation set (β val β): β Hyperparameter optimization β Check generalization progress β’ Test set (β test β): β Only for the very end β NEVER TO TOUCH DURIN ING DEVELOPMENT OR TR TRAINING I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 17
Learnin ing β’ Typical splits β Train (60%), Val (20%), Test (20%) β Train (80%), Val (10%), Test (10%) β’ During training: β Train error comes from average minibatch error β Typically take subset of validation every n iterations I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 18
Basic Recip ipe fo for r Machine Learning β’ Split your data 60% 20% 20% validation test train Find your hyperparameters I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 19
Basic Recip ipe fo for r Machine Learning β’ Split your data 60% 20% 20% validation test train Example scenario Bias Ground truth error β¦... 1% (underfitting) Training set error β¦.... 5% Variance (overfitting) Val/test set error β¦.... 8% I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 20
Basic Recip ipe fo for r Machine Learning Done Credits: A. Ng I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 21
Over- and Underf rfitting Underfitted Appropriate Overfitted Source: Deep Learning by Adam Gibson, Josh Patterson, OβReily Media Inc., 2017 I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 22
Over- and Underf rfitting Source: https://srdas.github.io/DLBook/ImprovingModelGeneralization.html I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 23
Learnin ing Curv rves β’ Training graphs - Accuracy - Loss I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 24
Learnin ing Curv rves val t e s t Source: https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/ I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 25
Overf rfit ittin ing Curv rves Val t e s t Source: https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/ I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 26
Other r Curv rves Validation Set is easier than Training set Underfitting (loss still decreasing) Source: https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/ I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 27
To Summariz ize β’ Underfitting β Training and validation losses decrease even at the end of training β’ Overfitting β Training loss decreases and validation loss increases β’ Ideal Training β Small gap between training and validation loss, and both go down at same rate (stable without fluctuations). I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 28
To Summariz ize β’ Bad Signs β Training error not going down β Validation error not going down β Performance on validation better than on training set β Tests on train set different than during training β’ Bad Practice Never touch during β Training set contains test data development or β Debug algorithm on test data training I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 29
Hyperparameters β’ Network architecture (e.g., num layers, #weights) β’ Number of iterations β’ Learning rate(s) (i.e., solver parameters, decay, etc.) β’ Regularization (more later next lecture) β’ Batch size β’ β¦ β’ Overall: learning setup + optimization = hyperparameters I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 30
Hyperparameter Tunin ing Grid search β’ Methods: 1 Second Parameter 0.8 β Manual search: 0.6 β’ most common ο 0.4 β Gri rid search (structured, for βrealβ applications) 0.2 0 β’ Define ranges for all parameters spaces and 0 0.2 0.4 0.6 0.8 1 select points First Parameter Random search β’ Usually pseudo-uniformly distributed 1 β Iterate over all possible configurations Second Parameter 0.8 β Rand ndom search: 0.6 0.4 Like grid search but one picks points at random 0.2 in the predefined ranges 0 0 0.2 0.4 0.6 0.8 1 First Parameter I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 31
Recommend
More recommend