Lecture 5 recap 1 Prof. Leal-Taixé and Prof. Niessner
Neural Network Width Depth 2 Prof. Leal-Taixé and Prof. Niessner
Gra radie ient Descent fo for r Neura ral Networks ℎ 0 𝜖𝑔 𝑦 0 𝑧 0 𝑢 0 ℎ 1 𝜖𝑥 0,0,0 … 𝑦 1 … 𝑧 1 𝑢 1 ℎ 2 𝜖𝑔 𝛼 𝑥,𝑐 𝑔 𝑦,𝑢 (𝑥) = 𝑦 2 𝜖𝑥 𝑚,𝑛,𝑜 … ℎ 3 … 𝜖𝑔 𝑀 𝑗 = 𝑧 𝑗 − 𝑢 𝑗 2 𝜖𝑐 𝑚,𝑛 𝑧 𝑗 = 𝐵(𝑐 1,𝑗 + ℎ 𝑘 𝑥 1,𝑗,𝑘 ) ℎ 𝑘 = 𝐵(𝑐 0,𝑘 + 𝑦 𝑙 𝑥 0,𝑘,𝑙 ) 𝑘 𝑙 Just simple: 𝐵 𝑦 = max(0, 𝑦) Prof. Leal-Taixé and Prof. Niessner 3
Stochastic Gra radient Descent (S (SGD) 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽𝛼 𝜄 𝑀(𝜄 𝑙 , 𝑦 {1..𝑛} , 𝑧 {1..𝑛} ) 𝑛 𝛼 𝜄 𝑀 𝑗 1 𝑛 σ 𝑗=1 𝛼 𝜄 𝑀 = 𝑙 now refers to 𝑙 -th iteration 𝑛 training samples in the current batch Gradient for the 𝑙 -th batch Note the terminology: iteration vs epoch Prof. Leal-Taixé and Prof. Niessner 4
Gra radie ient Descent wit ith Momentum 𝑤 𝑙+1 = 𝛾 ⋅ 𝑤 𝑙 + 𝛼 𝜄 𝑀(𝜄 𝑙 ) accumulation rate Gradient of current minibatch velocity (‘friction’, momentum) 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ 𝑤 𝑙+1 velocity model learning rate Exponentially-weighted average of gradient Important: velocity 𝑤 𝑙 is vector-valued! Prof. Leal-Taixé and Prof. Niessner 5
Gra radie ient Descent wit ith Momentum Step will be largest when a sequence of gradients all point to the same direction Hyperparameters are 𝛽, 𝛾 𝛾 is often set to 0.9 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ 𝑤 𝑙+1 Prof. Leal-Taixé and Prof. Niessner 6 Fig. credit: I. Goodfellow
RMSProp 𝑡 𝑙+1 = 𝛾 ⋅ 𝑡 𝑙 + (1 − 𝛾)[𝛼 𝜄 𝑀 ∘ 𝛼 𝜄 𝑀] Element-wise multiplication 𝛼 𝜄 𝑀 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ 𝑡 𝑙+1 + 𝜗 Hyperparameters: 𝛽 , 𝛾 , 𝜗 Needs tuning! Often 0.9 Typically 10 −8 Prof. Leal-Taixé and Prof. Niessner 7
RMSProp Large gradients Y-Direction X-direction Small gradients (uncentered) variance of gradients 𝑡 𝑙+1 = 𝛾 ⋅ 𝑡 𝑙 + (1 − 𝛾)[𝛼 𝜄 𝑀 ∘ 𝛼 𝜄 𝑀] -> second momentum 𝛼 𝜄 𝑀 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ 𝑡 𝑙+1 + 𝜗 We’re dividing by square gradients: - Division in Y-Direction will be large Can increase learning rate! - Division in X-Direction will be small Prof. Leal-Taixé and Prof. Niessner 8 Fig. credit: A. Ng
Adaptive Moment Estim imatio ion (A (Adam) Combines Momentum and RMSProp First momentum: 𝑛 𝑙+1 = 𝛾 1 ⋅ 𝑛 𝑙 + 1 − 𝛾 1 𝛼 𝜄 𝑀 𝜄 𝑙 mean of gradients 𝑤 𝑙+1 = 𝛾 2 ⋅ 𝑤 𝑙 + (1 − 𝛾 2 )[𝛼 𝜄 𝑀 𝜄 𝑙 ∘ 𝛼 𝜄 𝑀 𝜄 𝑙 ] Second momentum: 𝑛 𝑙+1 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ variance of gradients 𝑤 𝑙+1 +𝜗 Prof. Leal-Taixé and Prof. Niessner 9
Adam Combines Momentum and RMSProp 𝑛 𝑙+1 and 𝑤 𝑙+1 are initialized with zero 𝑛 𝑙+1 = 𝛾 1 ⋅ 𝑛 𝑙 + 1 − 𝛾 1 𝛼 𝜄 𝑀 𝜄 𝑙 -> bias towards zero 𝑤 𝑙+1 = 𝛾 2 ⋅ 𝑤 𝑙 + (1 − 𝛾 2 )[𝛼 𝜄 𝑀 𝜄 𝑙 ∘ 𝛼 𝜄 𝑀 𝜄 𝑙 ] Typically, bias-corrected moment updates 𝑛 𝑙 𝑛 𝑙+1 = ෝ 1 − 𝛾 1 𝑤 𝑙 𝑛 𝑙+1 ෝ 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ 𝑤 𝑙+1 = ො 1 − 𝛾 2 𝑤 𝑙+1 +𝜗 ො Prof. Leal-Taixé and Prof. Niessner 10
Convergence 11 Prof. Leal-Taixé and Prof. Niessner
Convergence 12 Prof. Leal-Taixé and Prof. Niessner
Im Importance of f Learning Rate 13 Prof. Leal-Taixé and Prof. Niessner
Jacobian and Hessian • Derivative • Gradient • Jacobian SECOND • Hessian DERIVATIVE 14 Prof. Leal-Taixé and Prof. Niessner
Newton’s method • Approximate our function by a second-order Taylor series expansion First derivative Second derivative (curvature) 15 https://en.wikipedia.org/wiki/Taylor_series Prof. Leal-Taixé and Prof. Niessner
Newton’s method • Differentiate and equate to zero Update step We got rid of the learning rate! SGD 16 Prof. Leal-Taixé and Prof. Niessner
Newton’s method • Differentiate and equate to zero Update step Parameters Number of Computational of a network elements in complexity of (millions) the Hessian ‘inversion’ per iteration 17 Prof. Leal-Taixé and Prof. Niessner
Newton’s method • SGD (green) • Newton’s method exploits the curvature to take a more direct route 18 Prof. Leal-Taixé and Prof. Niessner Image from Wikipedia
Newton’s method Can you apply Newton’s method for linear regression? What do you get as a result? 19 Prof. Leal-Taixé and Prof. Niessner
BFGS and L-BFGS • Broyden-Fletcher-Goldfarb-Shanno algorithm • Belongs to the family of quasi-Newton methods • Have an approximation of the inverse of the Hessian • BFGS • Limited memory: L-BFGS 20 Prof. Leal-Taixé and Prof. Niessner
Gauss-Newton 𝑔 𝑦 𝑙 −1 𝛼𝑔(𝑦 𝑙 ) • 𝑦 𝑙+1 = 𝑦 𝑙 − 𝐼 – ’true’ 2 nd derivatives are often hard to obtain (e.g., numerics) – 𝐼 𝑈 𝐾 𝐺 𝑔 ≈ 2𝐾 𝐺 • Gauss-Newton (GN): 𝑦 𝑙+1 = 𝑦 𝑙 − [2𝐾 𝐺 𝑦 𝑙 𝑈 𝐾 𝐺 𝑦 𝑙 ] −1 𝛼𝑔(𝑦 𝑙 ) Solve linear system (again, inverting a matrix is unstable): • 2 𝐾 𝐺 𝑦 𝑙 𝑈 𝐾 𝐺 𝑦 𝑙 𝑦 𝑙 − 𝑦 𝑙+1 = 𝛼𝑔(𝑦 𝑙 ) Solve for delta vector Prof. Leal-Taixé and Prof. Niessner 21
Levenberg Levenberg • – “damped” version of Gauss -Newton: Tikhonov – 𝐾 𝐺 𝑦 𝑙 𝑈 𝐾 𝐺 𝑦 𝑙 + 𝜇 ⋅ 𝐽 ⋅ 𝑦 𝑙 − 𝑦 𝑙+1 = 𝛼𝑔(𝑦 𝑙 ) regularization – The damping factor 𝜇 is adjusted in each iteration ensuring: – 𝑔 𝑦 𝑙 > 𝑔(𝑦 𝑙+1 ) • if inequation is not fulfilled increase 𝜇 • Trust region • “Interpolation” between Gauss -Newton (small 𝜇 ) and Gradient Descent (large 𝜇 ) Prof. Leal-Taixé and Prof. Niessner 22
Levenberg-Marquardt • Levenberg-Marquardt (LM) 𝐾 𝐺 𝑦 𝑙 𝑈 𝐾 𝐺 𝑦 𝑙 + 𝜇 ⋅ 𝑒𝑗𝑏(𝐾 𝐺 𝑦 𝑙 𝑈 𝐾 𝐺 𝑦 𝑙 ) ⋅ 𝑦 𝑙 − 𝑦 𝑙+1 = 𝛼𝑔(𝑦 𝑙 ) – Instead of a plain Gradient Descent for large 𝜇 , scale each component of the gradient according to the curvature. • Avoids slow convergence in components with a small gradient Prof. Leal-Taixé and Prof. Niessner 23
Whic ich, what and when? • Standard: Adam • Fallback option: SGD with ith momentum • Newto ton, L-BFGS, GN, , LM only if you can do full batch updates (doesn’t work well for minibatches!!) This practically never happens for DL Theoretically, it would be nice though due to fast convergence 24 Prof. Leal-Taixé and Prof. Niessner
General Optim imiz izatio ion • Linear Systems (Ax = b) – LU, QR, Cholesky, Jacobi, Gauss-Seidel, CG, PCG, etc. • Non-linear (gradient-based) – Newton, Gauss-Newton, LM, (L)BFGS <- second order – Gradient Descent, SGD <- first order • Others: – Genetic algorithms, MCMC, Metropolis-Hastings, etc. – Constrained and convex solvers (Langrage, ADMM, Primal-Dual, etc.) Prof. Leal-Taixé and Prof. Niessner 25
Ple lease Remember! • Think about your problem and optimization at hand • SGD is specifically designed for minibatch • When you can, use 2 nd order method - > it’s just faster • GD or SGD is not not a way to solve a linear system! Prof. Leal-Taixé and Prof. Niessner 26
Im Importance of f Learning Rate 27 Prof. Leal-Taixé and Prof. Niessner
Learnin ing Rate Need high learning rate when far away Need low learning rate when close Prof. Leal-Taixé and Prof. Niessner 28
Learnin ing Rate Decay 1 • 𝛽 = 1+𝑒𝑓𝑑𝑏𝑧𝑠𝑏𝑢𝑓⋅𝑓𝑞𝑝𝑑ℎ ⋅ 𝛽 0 – E.g., 𝛽 0 = 0.1 , 𝑒𝑓𝑑𝑏𝑧𝑠𝑏𝑢𝑓 = 1.0 Learning Rate over Epochs 0.12 – > Epoch 0: 0.1 0.1 0.08 – > Epoch 1: 0.05 0.06 – > Epoch 2: 0.033 0.04 – > Epoch 3: 0.025 0.02 ... 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 Prof. Leal-Taixé and Prof. Niessner 29
Learnin ing Rate Decay Many options: • Step decay 𝛽 = 𝛽 − 𝑢 ⋅ 𝛽 (only every n steps) – T is decay rate (often 0.5) Exponential decay 𝛽 = 𝑢 𝑓𝑞𝑝𝑑ℎ ⋅ 𝛽 0 • – t is decay rate (t < 1.0) 𝑢 • 𝛽 = 𝑓𝑞𝑝𝑑ℎ ⋅ 𝑏 0 – t is decay rate Etc. • Prof. Leal-Taixé and Prof. Niessner 30
Tra rain ining Schedule le Manually specify learning rate for entire training process • Manually set learning rate every n-epochs • How? – Trial and error (the hard way) – Some experience (only generalizes to some degree) Consider: #epochs, training set size, network size, etc. Prof. Leal-Taixé and Prof. Niessner 31
Learnin ing Rate: : Im Implic ications • What if too high? • What if too low? Prof. Leal-Taixé and Prof. Niessner 32
Recommend
More recommend