lecture 5 recap
play

Lecture 5 recap 1 Prof. Leal-Taix and Prof. Niessner Neural - PowerPoint PPT Presentation

Lecture 5 recap 1 Prof. Leal-Taix and Prof. Niessner Neural Network Width Depth 2 Prof. Leal-Taix and Prof. Niessner Gra radie ient Descent fo for r Neura ral Networks 0 0 0 0 1 0,0,0


  1. Lecture 5 recap 1 Prof. Leal-Taixé and Prof. Niessner

  2. Neural Network Width Depth 2 Prof. Leal-Taixé and Prof. Niessner

  3. Gra radie ient Descent fo for r Neura ral Networks ℎ 0 𝜖𝑔 𝑦 0 𝑧 0 𝑢 0 ℎ 1 𝜖𝑥 0,0,0 … 𝑦 1 … 𝑧 1 𝑢 1 ℎ 2 𝜖𝑔 𝛼 𝑥,𝑐 𝑔 𝑦,𝑢 (𝑥) = 𝑦 2 𝜖𝑥 𝑚,𝑛,𝑜 … ℎ 3 … 𝜖𝑔 𝑀 𝑗 = 𝑧 𝑗 − 𝑢 𝑗 2 𝜖𝑐 𝑚,𝑛 𝑧 𝑗 = 𝐵(𝑐 1,𝑗 + ෍ ℎ 𝑘 𝑥 1,𝑗,𝑘 ) ℎ 𝑘 = 𝐵(𝑐 0,𝑘 + ෍ 𝑦 𝑙 𝑥 0,𝑘,𝑙 ) 𝑘 𝑙 Just simple: 𝐵 𝑦 = max(0, 𝑦) Prof. Leal-Taixé and Prof. Niessner 3

  4. Stochastic Gra radient Descent (S (SGD) 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽𝛼 𝜄 𝑀(𝜄 𝑙 , 𝑦 {1..𝑛} , 𝑧 {1..𝑛} ) 𝑛 𝛼 𝜄 𝑀 𝑗 1 𝑛 σ 𝑗=1 𝛼 𝜄 𝑀 = 𝑙 now refers to 𝑙 -th iteration 𝑛 training samples in the current batch Gradient for the 𝑙 -th batch Note the terminology: iteration vs epoch Prof. Leal-Taixé and Prof. Niessner 4

  5. Gra radie ient Descent wit ith Momentum 𝑤 𝑙+1 = 𝛾 ⋅ 𝑤 𝑙 + 𝛼 𝜄 𝑀(𝜄 𝑙 ) accumulation rate Gradient of current minibatch velocity (‘friction’, momentum) 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ 𝑤 𝑙+1 velocity model learning rate Exponentially-weighted average of gradient Important: velocity 𝑤 𝑙 is vector-valued! Prof. Leal-Taixé and Prof. Niessner 5

  6. Gra radie ient Descent wit ith Momentum Step will be largest when a sequence of gradients all point to the same direction Hyperparameters are 𝛽, 𝛾 𝛾 is often set to 0.9 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ 𝑤 𝑙+1 Prof. Leal-Taixé and Prof. Niessner 6 Fig. credit: I. Goodfellow

  7. RMSProp 𝑡 𝑙+1 = 𝛾 ⋅ 𝑡 𝑙 + (1 − 𝛾)[𝛼 𝜄 𝑀 ∘ 𝛼 𝜄 𝑀] Element-wise multiplication 𝛼 𝜄 𝑀 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ 𝑡 𝑙+1 + 𝜗 Hyperparameters: 𝛽 , 𝛾 , 𝜗 Needs tuning! Often 0.9 Typically 10 −8 Prof. Leal-Taixé and Prof. Niessner 7

  8. RMSProp Large gradients Y-Direction X-direction Small gradients (uncentered) variance of gradients 𝑡 𝑙+1 = 𝛾 ⋅ 𝑡 𝑙 + (1 − 𝛾)[𝛼 𝜄 𝑀 ∘ 𝛼 𝜄 𝑀] -> second momentum 𝛼 𝜄 𝑀 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ 𝑡 𝑙+1 + 𝜗 We’re dividing by square gradients: - Division in Y-Direction will be large Can increase learning rate! - Division in X-Direction will be small Prof. Leal-Taixé and Prof. Niessner 8 Fig. credit: A. Ng

  9. Adaptive Moment Estim imatio ion (A (Adam) Combines Momentum and RMSProp First momentum: 𝑛 𝑙+1 = 𝛾 1 ⋅ 𝑛 𝑙 + 1 − 𝛾 1 𝛼 𝜄 𝑀 𝜄 𝑙 mean of gradients 𝑤 𝑙+1 = 𝛾 2 ⋅ 𝑤 𝑙 + (1 − 𝛾 2 )[𝛼 𝜄 𝑀 𝜄 𝑙 ∘ 𝛼 𝜄 𝑀 𝜄 𝑙 ] Second momentum: 𝑛 𝑙+1 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ variance of gradients 𝑤 𝑙+1 +𝜗 Prof. Leal-Taixé and Prof. Niessner 9

  10. Adam Combines Momentum and RMSProp 𝑛 𝑙+1 and 𝑤 𝑙+1 are initialized with zero 𝑛 𝑙+1 = 𝛾 1 ⋅ 𝑛 𝑙 + 1 − 𝛾 1 𝛼 𝜄 𝑀 𝜄 𝑙 -> bias towards zero 𝑤 𝑙+1 = 𝛾 2 ⋅ 𝑤 𝑙 + (1 − 𝛾 2 )[𝛼 𝜄 𝑀 𝜄 𝑙 ∘ 𝛼 𝜄 𝑀 𝜄 𝑙 ] Typically, bias-corrected moment updates 𝑛 𝑙 𝑛 𝑙+1 = ෝ 1 − 𝛾 1 𝑤 𝑙 𝑛 𝑙+1 ෝ 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ 𝑤 𝑙+1 = ො 1 − 𝛾 2 𝑤 𝑙+1 +𝜗 ො Prof. Leal-Taixé and Prof. Niessner 10

  11. Convergence 11 Prof. Leal-Taixé and Prof. Niessner

  12. Convergence 12 Prof. Leal-Taixé and Prof. Niessner

  13. Im Importance of f Learning Rate 13 Prof. Leal-Taixé and Prof. Niessner

  14. Jacobian and Hessian • Derivative • Gradient • Jacobian SECOND • Hessian DERIVATIVE 14 Prof. Leal-Taixé and Prof. Niessner

  15. Newton’s method • Approximate our function by a second-order Taylor series expansion First derivative Second derivative (curvature) 15 https://en.wikipedia.org/wiki/Taylor_series Prof. Leal-Taixé and Prof. Niessner

  16. Newton’s method • Differentiate and equate to zero Update step We got rid of the learning rate! SGD 16 Prof. Leal-Taixé and Prof. Niessner

  17. Newton’s method • Differentiate and equate to zero Update step Parameters Number of Computational of a network elements in complexity of (millions) the Hessian ‘inversion’ per iteration 17 Prof. Leal-Taixé and Prof. Niessner

  18. Newton’s method • SGD (green) • Newton’s method exploits the curvature to take a more direct route 18 Prof. Leal-Taixé and Prof. Niessner Image from Wikipedia

  19. Newton’s method Can you apply Newton’s method for linear regression? What do you get as a result? 19 Prof. Leal-Taixé and Prof. Niessner

  20. BFGS and L-BFGS • Broyden-Fletcher-Goldfarb-Shanno algorithm • Belongs to the family of quasi-Newton methods • Have an approximation of the inverse of the Hessian • BFGS • Limited memory: L-BFGS 20 Prof. Leal-Taixé and Prof. Niessner

  21. Gauss-Newton 𝑔 𝑦 𝑙 −1 𝛼𝑔(𝑦 𝑙 ) • 𝑦 𝑙+1 = 𝑦 𝑙 − 𝐼 – ’true’ 2 nd derivatives are often hard to obtain (e.g., numerics) – 𝐼 𝑈 𝐾 𝐺 𝑔 ≈ 2𝐾 𝐺 • Gauss-Newton (GN): 𝑦 𝑙+1 = 𝑦 𝑙 − [2𝐾 𝐺 𝑦 𝑙 𝑈 𝐾 𝐺 𝑦 𝑙 ] −1 𝛼𝑔(𝑦 𝑙 ) Solve linear system (again, inverting a matrix is unstable): • 2 𝐾 𝐺 𝑦 𝑙 𝑈 𝐾 𝐺 𝑦 𝑙 𝑦 𝑙 − 𝑦 𝑙+1 = 𝛼𝑔(𝑦 𝑙 ) Solve for delta vector Prof. Leal-Taixé and Prof. Niessner 21

  22. Levenberg Levenberg • – “damped” version of Gauss -Newton: Tikhonov – 𝐾 𝐺 𝑦 𝑙 𝑈 𝐾 𝐺 𝑦 𝑙 + 𝜇 ⋅ 𝐽 ⋅ 𝑦 𝑙 − 𝑦 𝑙+1 = 𝛼𝑔(𝑦 𝑙 ) regularization – The damping factor 𝜇 is adjusted in each iteration ensuring: – 𝑔 𝑦 𝑙 > 𝑔(𝑦 𝑙+1 ) • if inequation is not fulfilled increase 𝜇 •  Trust region •  “Interpolation” between Gauss -Newton (small 𝜇 ) and Gradient Descent (large 𝜇 ) Prof. Leal-Taixé and Prof. Niessner 22

  23. Levenberg-Marquardt • Levenberg-Marquardt (LM) 𝐾 𝐺 𝑦 𝑙 𝑈 𝐾 𝐺 𝑦 𝑙 + 𝜇 ⋅ 𝑒𝑗𝑏𝑕(𝐾 𝐺 𝑦 𝑙 𝑈 𝐾 𝐺 𝑦 𝑙 ) ⋅ 𝑦 𝑙 − 𝑦 𝑙+1 = 𝛼𝑔(𝑦 𝑙 ) – Instead of a plain Gradient Descent for large 𝜇 , scale each component of the gradient according to the curvature. • Avoids slow convergence in components with a small gradient Prof. Leal-Taixé and Prof. Niessner 23

  24. Whic ich, what and when? • Standard: Adam • Fallback option: SGD with ith momentum • Newto ton, L-BFGS, GN, , LM only if you can do full batch updates (doesn’t work well for minibatches!!) This practically never happens for DL Theoretically, it would be nice though due to fast convergence 24 Prof. Leal-Taixé and Prof. Niessner

  25. General Optim imiz izatio ion • Linear Systems (Ax = b) – LU, QR, Cholesky, Jacobi, Gauss-Seidel, CG, PCG, etc. • Non-linear (gradient-based) – Newton, Gauss-Newton, LM, (L)BFGS <- second order – Gradient Descent, SGD <- first order • Others: – Genetic algorithms, MCMC, Metropolis-Hastings, etc. – Constrained and convex solvers (Langrage, ADMM, Primal-Dual, etc.) Prof. Leal-Taixé and Prof. Niessner 25

  26. Ple lease Remember! • Think about your problem and optimization at hand • SGD is specifically designed for minibatch • When you can, use 2 nd order method - > it’s just faster • GD or SGD is not not a way to solve a linear system! Prof. Leal-Taixé and Prof. Niessner 26

  27. Im Importance of f Learning Rate 27 Prof. Leal-Taixé and Prof. Niessner

  28. Learnin ing Rate Need high learning rate when far away Need low learning rate when close Prof. Leal-Taixé and Prof. Niessner 28

  29. Learnin ing Rate Decay 1 • 𝛽 = 1+𝑒𝑓𝑑𝑏𝑧𝑠𝑏𝑢𝑓⋅𝑓𝑞𝑝𝑑ℎ ⋅ 𝛽 0 – E.g., 𝛽 0 = 0.1 , 𝑒𝑓𝑑𝑏𝑧𝑠𝑏𝑢𝑓 = 1.0 Learning Rate over Epochs 0.12 – > Epoch 0: 0.1 0.1 0.08 – > Epoch 1: 0.05 0.06 – > Epoch 2: 0.033 0.04 – > Epoch 3: 0.025 0.02 ... 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 Prof. Leal-Taixé and Prof. Niessner 29

  30. Learnin ing Rate Decay Many options: • Step decay 𝛽 = 𝛽 − 𝑢 ⋅ 𝛽 (only every n steps) – T is decay rate (often 0.5) Exponential decay 𝛽 = 𝑢 𝑓𝑞𝑝𝑑ℎ ⋅ 𝛽 0 • – t is decay rate (t < 1.0) 𝑢 • 𝛽 = 𝑓𝑞𝑝𝑑ℎ ⋅ 𝑏 0 – t is decay rate Etc. • Prof. Leal-Taixé and Prof. Niessner 30

  31. Tra rain ining Schedule le Manually specify learning rate for entire training process • Manually set learning rate every n-epochs • How? – Trial and error (the hard way) – Some experience (only generalizes to some degree) Consider: #epochs, training set size, network size, etc. Prof. Leal-Taixé and Prof. Niessner 31

  32. Learnin ing Rate: : Im Implic ications • What if too high? • What if too low? Prof. Leal-Taixé and Prof. Niessner 32

Recommend


More recommend