train inin ing neural l
play

Train inin ing Neural l Networks I2DL: Prof. Niessner, Prof. - PowerPoint PPT Presentation

Train inin ing Neural l Networks I2DL: Prof. Niessner, Prof. Leal-Taix 1 Lecture 5 Recap I2DL: Prof. Niessner, Prof. Leal-Taix 2 Gra radie ient Descent fo for r Neura ral Networks Loss function 2


  1. Train inin ing Neural l Networks I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 1

  2. Lecture 5 Recap I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 2

  3. Gra radie ient Descent fo for r Neura ral Networks Loss function πœ–π‘” 𝑧 𝑗 βˆ’ 𝑧 𝑗 2 𝑀 𝑗 = ො πœ–π‘₯ 0,0,0 … β„Ž 0 … 𝑦 0 πœ–π‘” 𝑧 0 ො β„Ž 1 𝑧 0 𝛼 𝑿,𝒄 𝑔 π’š,𝒛 (𝑿) = πœ–π‘₯ π‘š,𝑛,π‘œ 𝑦 1 … 𝑧 1 ො β„Ž 2 𝑧 1 … πœ–π‘” 𝑦 2 β„Ž 3 πœ–π‘ π‘š,𝑛 𝑧 𝑗 = 𝐡(𝑐 1,𝑗 + ෍ ො β„Ž π‘˜ π‘₯ 1,𝑗,π‘˜ ) π‘˜ Just simple: β„Ž π‘˜ = 𝐡(𝑐 0,π‘˜ + ෍ 𝑦 𝑙 π‘₯ 0,π‘˜,𝑙 ) 𝐡 𝑦 = max(0, 𝑦) 𝑙 I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 3

  4. Stochastic Gra radient Descent (S (SGD) 𝜾 𝑙+1 = 𝜾 𝑙 βˆ’ 𝛽𝛼 𝜾 𝑀(𝜾 𝑙 , π’š {1..𝑛} , 𝒛 {1..𝑛} ) 𝑙 now refers to 𝑙 -th iteration 𝑛 𝛼 𝜾 𝑀 𝑗 1 𝑛 Οƒ 𝑗=1 𝛼 𝜾 𝑀 = 𝑛 training samples in the current minibatch Gradient for the 𝑙 -th minibatch : I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 4

  5. Gra radie ient Descent wit ith Momentum π’˜ 𝑙+1 = 𝛾 β‹… π’˜ 𝑙 + 𝛼 𝜾 𝑀(𝜾 𝑙 ) accumulation rate Gradient of current minibatch velocity (β€˜friction’, momentum) 𝜾 𝑙+1 = 𝜾 𝑙 βˆ’ 𝛽 β‹… π’˜ 𝑙+1 velocity model learning rate Exponentially-weighted average of gradient Important: velocity π’˜ 𝑙 is vector-valued! I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 5

  6. RMSProp Large gradients Y-Direction Source: A. Ng X-direction Small gradients (Uncentered) variance of gradients 𝒕 𝑙+1 = 𝛾 β‹… 𝒕 𝑙 + (1 βˆ’ 𝛾)[𝛼 𝜾 𝑀 ∘ 𝛼 𝜾 𝑀] β†’ second momentum We’re dividing by square gradients: 𝛼 𝜾 𝑀 𝜾 𝑙+1 = 𝜾 𝑙 βˆ’ 𝛽 β‹… - Division in Y-Direction will be 𝒕 𝑙+1 + πœ— large - Division in X-Direction will be Can increase learning rate! small I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 6

  7. Adam β€’ Combines Momentum and RMSProp 𝒏 𝑙+1 = 𝛾 1 β‹… 𝒏 𝑙 + 1 βˆ’ 𝛾 1 𝛼 𝜾 𝑀 𝜾 𝑙 π’˜ 𝑙+1 = 𝛾 2 β‹… π’˜ 𝑙 + (1 βˆ’ 𝛾 2 )[𝛼 𝜾 𝑀 𝜾 𝑙 ∘ 𝛼 𝜾 𝑀 𝜾 𝑙 β€’ 𝒏 𝑙+1 and π’˜ 𝑙+1 are initialized with zero β†’ bias towards zero β†’ Typically, bias-corrected moment updates 𝒏 𝑙+1 π’˜ 𝑙+1 𝒏 𝑙+1 𝒏 𝑙+1 = π’˜ 𝑙+1 = 𝜾 𝑙+1 = 𝜾 𝑙 βˆ’ 𝛽 β‹… ෝ ෝ ෝ 𝑙+1 𝑙+1 1 βˆ’ 𝛾 1 1 βˆ’ 𝛾 2 π’˜ 𝑙+1 +πœ— ෝ I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 7

  8. Train inin ing Neural l Nets I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 8

  9. Learnin ing Rate: : Im Implic ications β€’ What if too high? β€’ What if too low? Source: http://cs231n.github.io/neural-networks-3/ I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 9

  10. Learnin ing Rate Need high learning rate when far away Need low learning rate when close I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 10

  11. Learnin ing Rate Decay 1 β€’ 𝛽 = 1+𝑒𝑓𝑑𝑏𝑧_π‘ π‘π‘’π‘“βˆ—π‘“π‘žπ‘π‘‘β„Ž β‹… 𝛽 0 – E.g., 𝛽 0 = 0.1 , 𝑒𝑓𝑑𝑏𝑧_𝑠𝑏𝑒𝑓 = 1.0 Learning Rate over Epochs 0.12 β†’ Epoch 0: 0.1 0.1 β†’ Epoch 1: 0.05 0.08 β†’ Epoch 2: 0.06 0.033 0.04 β†’ Epoch 3: 0.025 0.02 ... 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 11

  12. Learnin ing Rate Decay Many options: β€’ Step decay 𝛽 = 𝛽 βˆ’ 𝑒 β‹… 𝛽 (only every n steps) – T is decay rate (often 0.5) β€’ Exponential decay 𝛽 = 𝑒 π‘“π‘žπ‘π‘‘β„Ž β‹… 𝛽 0 – t is decay rate (t < 1.0) 𝑒 β€’ 𝛽 = π‘“π‘žπ‘π‘‘β„Ž β‹… 𝑏 0 – t is decay rate β€’ Etc. I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 12

  13. Tra rain ining Schedule le Manually specify learning rate for entire training process β€’ Manually set learning rate every n-epochs β€’ How? – Trial and error (the hard way) – Some experience (only generalizes to some degree) Consider: #epochs, training set size, network size, etc. I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 13

  14. Basic Recip ipe fo for r Tra rain ining β€’ Given ground dataset with ground labels – {𝑦 𝑗 , 𝑧 𝑗 } β€’ 𝑦 𝑗 is the 𝑗 π‘’β„Ž training image, with label 𝑧 𝑗 β€’ Often dim 𝑦 ≫ dim(𝑧) (e.g., for classification) β€’ 𝑗 is often in the 100-thousands or millions – Take network 𝑔 and its parameters π‘₯, 𝑐 – Use SGD (or variation) to find optimal parameters π‘₯, 𝑐 β€’ Gradients from backpropagation I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 14

  15. Gra radie ient Descent on Tra rain in Set β€’ Given large train set with ( π‘œ ) training samples {π’š 𝑗 , 𝒛 𝑗 } – Let’s say 1 million labeled images – Let’s say our network has 500k parameters β€’ Gradient has 500k dimensions β€’ π‘œ = 1 π‘›π‘—π‘šπ‘šπ‘—π‘π‘œ β€’ Extr xtremely ly exp xpensive to to compute I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 15

  16. Learnin ing β€’ Learning means generalization to unknown dataset – (So far no β€˜real’ learning) – I.e., train on known dataset β†’ test with optimized parameters on unknown dataset β€’ Basically, we hope that based on the train set, the optimized parameters will give similar results on different data (i.e., test data) I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 16

  17. Learnin ing β€’ Training set (β€˜ train ’): – Use for training your neural network β€’ Validation set (β€˜ val ’): – Hyperparameter optimization – Check generalization progress β€’ Test set (β€˜ test ’): – Only for the very end – NEVER TO TOUCH DURIN ING DEVELOPMENT OR TR TRAINING I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 17

  18. Learnin ing β€’ Typical splits – Train (60%), Val (20%), Test (20%) – Train (80%), Val (10%), Test (10%) β€’ During training: – Train error comes from average minibatch error – Typically take subset of validation every n iterations I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 18

  19. Basic Recip ipe fo for r Machine Learning β€’ Split your data 60% 20% 20% validation test train Find your hyperparameters I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 19

  20. Basic Recip ipe fo for r Machine Learning β€’ Split your data 60% 20% 20% validation test train Example scenario Bias Ground truth error …... 1% (underfitting) Training set error ….... 5% Variance (overfitting) Val/test set error ….... 8% I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 20

  21. Basic Recip ipe fo for r Machine Learning Done Credits: A. Ng I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 21

  22. Over- and Underf rfitting Underfitted Appropriate Overfitted Source: Deep Learning by Adam Gibson, Josh Patterson, Oβ€˜Reily Media Inc., 2017 I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 22

  23. Over- and Underf rfitting Source: https://srdas.github.io/DLBook/ImprovingModelGeneralization.html I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 23

  24. Learnin ing Curv rves β€’ Training graphs - Accuracy - Loss I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 24

  25. Learnin ing Curv rves val t e s t Source: https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/ I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 25

  26. Overf rfit ittin ing Curv rves Val t e s t Source: https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/ I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 26

  27. Other r Curv rves Validation Set is easier than Training set Underfitting (loss still decreasing) Source: https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/ I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 27

  28. To Summariz ize β€’ Underfitting – Training and validation losses decrease even at the end of training β€’ Overfitting – Training loss decreases and validation loss increases β€’ Ideal Training – Small gap between training and validation loss, and both go down at same rate (stable without fluctuations). I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 28

  29. To Summariz ize β€’ Bad Signs – Training error not going down – Validation error not going down – Performance on validation better than on training set – Tests on train set different than during training β€’ Bad Practice Never touch during – Training set contains test data development or – Debug algorithm on test data training I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 29

  30. Hyperparameters β€’ Network architecture (e.g., num layers, #weights) β€’ Number of iterations β€’ Learning rate(s) (i.e., solver parameters, decay, etc.) β€’ Regularization (more later next lecture) β€’ Batch size β€’ … β€’ Overall: learning setup + optimization = hyperparameters I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 30

  31. Hyperparameter Tunin ing Grid search β€’ Methods: 1 Second Parameter 0.8 – Manual search: 0.6 β€’ most common  0.4 – Gri rid search (structured, for β€˜real’ applications) 0.2 0 β€’ Define ranges for all parameters spaces and 0 0.2 0.4 0.6 0.8 1 select points First Parameter Random search β€’ Usually pseudo-uniformly distributed 1 β†’ Iterate over all possible configurations Second Parameter 0.8 – Rand ndom search: 0.6 0.4 Like grid search but one picks points at random 0.2 in the predefined ranges 0 0 0.2 0.4 0.6 0.8 1 First Parameter I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 31

Recommend


More recommend