Lecture 5 recap 1 Prof. Leal-Taix and Prof. Niessner Neural - PowerPoint PPT Presentation

Lecture 5 recap 1 Prof. Leal-Taixé and Prof. Niessner

Neural Network Width Depth 2 Prof. Leal-Taixé and Prof. Niessner

Gra radie ient Descent fo for r Neura ral Networks ℎ 0 𝜖𝑔 𝑦 0 𝑧 0 𝑢 0 ℎ 1 𝜖𝑥 0,0,0 … 𝑦 1 … 𝑧 1 𝑢 1 ℎ 2 𝜖𝑔 𝛼 𝑥,𝑐 𝑔 𝑦,𝑢 (𝑥) = 𝑦 2 𝜖𝑥 𝑚,𝑛,𝑜 … ℎ 3 … 𝜖𝑔 𝑀 𝑗 = 𝑧 𝑗 − 𝑢 𝑗 2 𝜖𝑐 𝑚,𝑛 𝑧 𝑗 = 𝐵(𝑐 1,𝑗 + ෍ ℎ 𝑘 𝑥 1,𝑗,𝑘 ) ℎ 𝑘 = 𝐵(𝑐 0,𝑘 + ෍ 𝑦 𝑙 𝑥 0,𝑘,𝑙 ) 𝑘 𝑙 Just simple: 𝐵 𝑦 = max(0, 𝑦) Prof. Leal-Taixé and Prof. Niessner 3

Stochastic Gra radient Descent (S (SGD) 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽𝛼 𝜄 𝑀(𝜄 𝑙 , 𝑦 {1..𝑛} , 𝑧 {1..𝑛} ) 𝑛 𝛼 𝜄 𝑀 𝑗 1 𝑛 σ 𝑗=1 𝛼 𝜄 𝑀 = 𝑙 now refers to 𝑙 -th iteration 𝑛 training samples in the current batch Gradient for the 𝑙 -th batch Note the terminology: iteration vs epoch Prof. Leal-Taixé and Prof. Niessner 4

Gra radie ient Descent wit ith Momentum 𝑤 𝑙+1 = 𝛾 ⋅ 𝑤 𝑙 + 𝛼 𝜄 𝑀(𝜄 𝑙 ) accumulation rate Gradient of current minibatch velocity (‘friction’, momentum) 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ 𝑤 𝑙+1 velocity model learning rate Exponentially-weighted average of gradient Important: velocity 𝑤 𝑙 is vector-valued! Prof. Leal-Taixé and Prof. Niessner 5

Gra radie ient Descent wit ith Momentum Step will be largest when a sequence of gradients all point to the same direction Hyperparameters are 𝛽, 𝛾 𝛾 is often set to 0.9 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ 𝑤 𝑙+1 Prof. Leal-Taixé and Prof. Niessner 6 Fig. credit: I. Goodfellow

RMSProp 𝑡 𝑙+1 = 𝛾 ⋅ 𝑡 𝑙 + (1 − 𝛾)[𝛼 𝜄 𝑀 ∘ 𝛼 𝜄 𝑀] Element-wise multiplication 𝛼 𝜄 𝑀 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ 𝑡 𝑙+1 + 𝜗 Hyperparameters: 𝛽 , 𝛾 , 𝜗 Needs tuning! Often 0.9 Typically 10 −8 Prof. Leal-Taixé and Prof. Niessner 7

RMSProp Large gradients Y-Direction X-direction Small gradients (uncentered) variance of gradients 𝑡 𝑙+1 = 𝛾 ⋅ 𝑡 𝑙 + (1 − 𝛾)[𝛼 𝜄 𝑀 ∘ 𝛼 𝜄 𝑀] -> second momentum 𝛼 𝜄 𝑀 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ 𝑡 𝑙+1 + 𝜗 We’re dividing by square gradients: - Division in Y-Direction will be large Can increase learning rate! - Division in X-Direction will be small Prof. Leal-Taixé and Prof. Niessner 8 Fig. credit: A. Ng

Adaptive Moment Estim imatio ion (A (Adam) Combines Momentum and RMSProp First momentum: 𝑛 𝑙+1 = 𝛾 1 ⋅ 𝑛 𝑙 + 1 − 𝛾 1 𝛼 𝜄 𝑀 𝜄 𝑙 mean of gradients 𝑤 𝑙+1 = 𝛾 2 ⋅ 𝑤 𝑙 + (1 − 𝛾 2 )[𝛼 𝜄 𝑀 𝜄 𝑙 ∘ 𝛼 𝜄 𝑀 𝜄 𝑙 ] Second momentum: 𝑛 𝑙+1 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ variance of gradients 𝑤 𝑙+1 +𝜗 Prof. Leal-Taixé and Prof. Niessner 9

Adam Combines Momentum and RMSProp 𝑛 𝑙+1 and 𝑤 𝑙+1 are initialized with zero 𝑛 𝑙+1 = 𝛾 1 ⋅ 𝑛 𝑙 + 1 − 𝛾 1 𝛼 𝜄 𝑀 𝜄 𝑙 -> bias towards zero 𝑤 𝑙+1 = 𝛾 2 ⋅ 𝑤 𝑙 + (1 − 𝛾 2 )[𝛼 𝜄 𝑀 𝜄 𝑙 ∘ 𝛼 𝜄 𝑀 𝜄 𝑙 ] Typically, bias-corrected moment updates 𝑛 𝑙 𝑛 𝑙+1 = ෝ 1 − 𝛾 1 𝑤 𝑙 𝑛 𝑙+1 ෝ 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ 𝑤 𝑙+1 = ො 1 − 𝛾 2 𝑤 𝑙+1 +𝜗 ො Prof. Leal-Taixé and Prof. Niessner 10

Convergence 11 Prof. Leal-Taixé and Prof. Niessner

Convergence 12 Prof. Leal-Taixé and Prof. Niessner

Im Importance of f Learning Rate 13 Prof. Leal-Taixé and Prof. Niessner

Jacobian and Hessian • Derivative • Gradient • Jacobian SECOND • Hessian DERIVATIVE 14 Prof. Leal-Taixé and Prof. Niessner

Newton’s method • Approximate our function by a second-order Taylor series expansion First derivative Second derivative (curvature) 15 https://en.wikipedia.org/wiki/Taylor_series Prof. Leal-Taixé and Prof. Niessner

Newton’s method • Differentiate and equate to zero Update step We got rid of the learning rate! SGD 16 Prof. Leal-Taixé and Prof. Niessner

Newton’s method • Differentiate and equate to zero Update step Parameters Number of Computational of a network elements in complexity of (millions) the Hessian ‘inversion’ per iteration 17 Prof. Leal-Taixé and Prof. Niessner

Newton’s method • SGD (green) • Newton’s method exploits the curvature to take a more direct route 18 Prof. Leal-Taixé and Prof. Niessner Image from Wikipedia

Newton’s method Can you apply Newton’s method for linear regression? What do you get as a result? 19 Prof. Leal-Taixé and Prof. Niessner

BFGS and L-BFGS • Broyden-Fletcher-Goldfarb-Shanno algorithm • Belongs to the family of quasi-Newton methods • Have an approximation of the inverse of the Hessian • BFGS • Limited memory: L-BFGS 20 Prof. Leal-Taixé and Prof. Niessner

Gauss-Newton 𝑔 𝑦 𝑙 −1 𝛼𝑔(𝑦 𝑙 ) • 𝑦 𝑙+1 = 𝑦 𝑙 − 𝐼 – ’true’ 2 nd derivatives are often hard to obtain (e.g., numerics) – 𝐼 𝑈 𝐾 𝐺 𝑔 ≈ 2𝐾 𝐺 • Gauss-Newton (GN): 𝑦 𝑙+1 = 𝑦 𝑙 − [2𝐾 𝐺 𝑦 𝑙 𝑈 𝐾 𝐺 𝑦 𝑙 ] −1 𝛼𝑔(𝑦 𝑙 ) Solve linear system (again, inverting a matrix is unstable): • 2 𝐾 𝐺 𝑦 𝑙 𝑈 𝐾 𝐺 𝑦 𝑙 𝑦 𝑙 − 𝑦 𝑙+1 = 𝛼𝑔(𝑦 𝑙 ) Solve for delta vector Prof. Leal-Taixé and Prof. Niessner 21

Levenberg Levenberg • – “damped” version of Gauss -Newton: Tikhonov – 𝐾 𝐺 𝑦 𝑙 𝑈 𝐾 𝐺 𝑦 𝑙 + 𝜇 ⋅ 𝐽 ⋅ 𝑦 𝑙 − 𝑦 𝑙+1 = 𝛼𝑔(𝑦 𝑙 ) regularization – The damping factor 𝜇 is adjusted in each iteration ensuring: – 𝑔 𝑦 𝑙 > 𝑔(𝑦 𝑙+1 ) • if inequation is not fulfilled increase 𝜇 •  Trust region •  “Interpolation” between Gauss -Newton (small 𝜇 ) and Gradient Descent (large 𝜇 ) Prof. Leal-Taixé and Prof. Niessner 22

Levenberg-Marquardt • Levenberg-Marquardt (LM) 𝐾 𝐺 𝑦 𝑙 𝑈 𝐾 𝐺 𝑦 𝑙 + 𝜇 ⋅ 𝑒𝑗𝑏𝑕(𝐾 𝐺 𝑦 𝑙 𝑈 𝐾 𝐺 𝑦 𝑙 ) ⋅ 𝑦 𝑙 − 𝑦 𝑙+1 = 𝛼𝑔(𝑦 𝑙 ) – Instead of a plain Gradient Descent for large 𝜇 , scale each component of the gradient according to the curvature. • Avoids slow convergence in components with a small gradient Prof. Leal-Taixé and Prof. Niessner 23

Whic ich, what and when? • Standard: Adam • Fallback option: SGD with ith momentum • Newto ton, L-BFGS, GN, , LM only if you can do full batch updates (doesn’t work well for minibatches!!) This practically never happens for DL Theoretically, it would be nice though due to fast convergence 24 Prof. Leal-Taixé and Prof. Niessner

General Optim imiz izatio ion • Linear Systems (Ax = b) – LU, QR, Cholesky, Jacobi, Gauss-Seidel, CG, PCG, etc. • Non-linear (gradient-based) – Newton, Gauss-Newton, LM, (L)BFGS <- second order – Gradient Descent, SGD <- first order • Others: – Genetic algorithms, MCMC, Metropolis-Hastings, etc. – Constrained and convex solvers (Langrage, ADMM, Primal-Dual, etc.) Prof. Leal-Taixé and Prof. Niessner 25

Ple lease Remember! • Think about your problem and optimization at hand • SGD is specifically designed for minibatch • When you can, use 2 nd order method - > it’s just faster • GD or SGD is not not a way to solve a linear system! Prof. Leal-Taixé and Prof. Niessner 26

Im Importance of f Learning Rate 27 Prof. Leal-Taixé and Prof. Niessner

Learnin ing Rate Need high learning rate when far away Need low learning rate when close Prof. Leal-Taixé and Prof. Niessner 28

Learnin ing Rate Decay 1 • 𝛽 = 1+𝑒𝑓𝑑𝑏𝑧𝑠𝑏𝑢𝑓⋅𝑓𝑞𝑝𝑑ℎ ⋅ 𝛽 0 – E.g., 𝛽 0 = 0.1 , 𝑒𝑓𝑑𝑏𝑧𝑠𝑏𝑢𝑓 = 1.0 Learning Rate over Epochs 0.12 – > Epoch 0: 0.1 0.1 0.08 – > Epoch 1: 0.05 0.06 – > Epoch 2: 0.033 0.04 – > Epoch 3: 0.025 0.02 ... 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 Prof. Leal-Taixé and Prof. Niessner 29

Learnin ing Rate Decay Many options: • Step decay 𝛽 = 𝛽 − 𝑢 ⋅ 𝛽 (only every n steps) – T is decay rate (often 0.5) Exponential decay 𝛽 = 𝑢 𝑓𝑞𝑝𝑑ℎ ⋅ 𝛽 0 • – t is decay rate (t < 1.0) 𝑢 • 𝛽 = 𝑓𝑞𝑝𝑑ℎ ⋅ 𝑏 0 – t is decay rate Etc. • Prof. Leal-Taixé and Prof. Niessner 30

Tra rain ining Schedule le Manually specify learning rate for entire training process • Manually set learning rate every n-epochs • How? – Trial and error (the hard way) – Some experience (only generalizes to some degree) Consider: #epochs, training set size, network size, etc. Prof. Leal-Taixé and Prof. Niessner 31

Learnin ing Rate: : Im Implic ications • What if too high? • What if too low? Prof. Leal-Taixé and Prof. Niessner 32

Lecture 5 recap 1 Prof. Leal-Taix and Prof. Niessner Neural - PowerPoint PPT Presentation

Lecture 5 recap 1 Prof. Leal-Taix and Prof. Niessner Neural Network Width Depth 2 Prof. Leal-Taix and Prof. Niessner Gra radie ient Descent fo for r Neura ral Networks 0 0 0 0 1 0,0,0

Probabilistic Computation Lecture 13 BPP vs. PH 1 Recap 2 Recap Probabilistic computation 2

Probabilistic Computation Lecture 13 Understanding BPP 1 Recap 2 Recap Probabilistic

TOTAL RECAP INFOGR Computer Graphics Jacco Bikker - April-July 2015 - Lecture 13: Grand

TOTAL RECAP INFOGR Computer Graphics Jacco Bikker - April-July 2016 - Lecture 14: Grand

TOTAL RECAP INFOGR Computer Graphics Jacco Bikker & Debabrata Panja - April-July 2018

TOTAL RECAP INFOGR Computer Graphics Jacco Bikker & Debabrata Panja - April-July 2017

Semiotics: Recap Examples References Jrg Cassens Data and Process Visualization SoSe 2017

Access Methods 1 / 44 Recap Recap 2 / 44 Recap A More Detailed Architecture granularity:

Trees (Part 2) 1 / 59 Trees (Part 2) Recap Recap 2 / 59 Trees (Part 2) Recap B + Tree A B

Trees (Part 1) 1 / 57 Trees (Part 1) Recap Recap 2 / 57 Trees (Part 1) Recap Hash Tables

Proof of Stake Recap Bitcoin Incentives Block subsidy Transaction fees Recap

Ruby Monstas Session 14 Agenda Recap Standard Library: RSS Exercises Recap Recap: TodoList

61A Lecture 11 Friday, September 21 Midterm 1 Recap 2 Midterm 1 Recap The exam was more

Interactive Proofs Lecture 16 What the all-powerful can convince mere mortals of 1 Recap 2

Lecture 9: Compression 1 / 52 Compression Recap Bu ff er Management Recap 2 / 52 Compression

Welcome! Todays Agenda: Now What TOTAL RECAP The Process / Digest Grand Recap

Kento Sato 1 , Kathryn Mohror 2 , Adam Moody 2 , Todd Gamblin 2 , Bronis R. de Supinski

Background Key trends in interprofessional research: A Macrosociological analysis from 1970 to

Towards Automated Characterization of the Data Movement Complexity of Affine Programs Venmugil

Fiery Cross Reef, Spratly Islands, South China Sea May 31, 2014 Fig. 2 Fiery Cross Reef,

Direct-FUSE: Removing the Middleman for High-Performance FUSE File System Support Yue Zhu*, Teng

QCD Evolution 2019 3-D STRUCTURE OF THE PION AND KAON FROM QCD'S DYSON-SCHWINGER EQUATIONS.

Solar History / Sacha Dobler - slides Fig. 1 NASAs forecast for the next solar cycle (SC 25,

The link between the tropical precipitation and Hadley circulation Author:

Lecture 5 recap 1 Prof. Leal-Taix and Prof. Niessner Neural - PowerPoint PPT Presentation

Lecture 5 recap 1 Prof. Leal-Taix and Prof. Niessner Neural Network Width Depth 2 Prof. Leal-Taix and Prof. Niessner Gra radie ient Descent fo for r Neura ral Networks 0 0 0 0 1 0,0,0

Probabilistic Computation Lecture 13 BPP vs. PH 1 Recap 2 Recap Probabilistic computation 2

Probabilistic Computation Lecture 13 Understanding BPP 1 Recap 2 Recap Probabilistic

TOTAL RECAP INFOGR Computer Graphics Jacco Bikker - April-July 2015 - Lecture 13: Grand

TOTAL RECAP INFOGR Computer Graphics Jacco Bikker - April-July 2016 - Lecture 14: Grand

TOTAL RECAP INFOGR Computer Graphics Jacco Bikker &amp; Debabrata Panja - April-July 2018

TOTAL RECAP INFOGR Computer Graphics Jacco Bikker &amp; Debabrata Panja - April-July 2017

Semiotics: Recap Examples References Jrg Cassens Data and Process Visualization SoSe 2017

Access Methods 1 / 44 Recap Recap 2 / 44 Recap A More Detailed Architecture granularity:

Trees (Part 2) 1 / 59 Trees (Part 2) Recap Recap 2 / 59 Trees (Part 2) Recap B + Tree A B

Trees (Part 1) 1 / 57 Trees (Part 1) Recap Recap 2 / 57 Trees (Part 1) Recap Hash Tables

Proof of Stake Recap Bitcoin Incentives Block subsidy Transaction fees Recap

Ruby Monstas Session 14 Agenda Recap Standard Library: RSS Exercises Recap Recap: TodoList

61A Lecture 11 Friday, September 21 Midterm 1 Recap 2 Midterm 1 Recap The exam was more

Interactive Proofs Lecture 16 What the all-powerful can convince mere mortals of 1 Recap 2

Lecture 9: Compression 1 / 52 Compression Recap Bu ff er Management Recap 2 / 52 Compression

Welcome! Todays Agenda: Now What TOTAL RECAP The Process / Digest Grand Recap

Kento Sato 1 , Kathryn Mohror 2 , Adam Moody 2 , Todd Gamblin 2 , Bronis R. de Supinski

Background Key trends in interprofessional research: A Macrosociological analysis from 1970 to

Towards Automated Characterization of the Data Movement Complexity of Affine Programs Venmugil

Fiery Cross Reef, Spratly Islands, South China Sea May 31, 2014 Fig. 2 Fiery Cross Reef,

Direct-FUSE: Removing the Middleman for High-Performance FUSE File System Support Yue Zhu*, Teng

QCD Evolution 2019 3-D STRUCTURE OF THE PION AND KAON FROM QCD'S DYSON-SCHWINGER EQUATIONS.

Solar History / Sacha Dobler - slides Fig. 1 NASAs forecast for the next solar cycle (SC 25,

The link between the tropical precipitation and Hadley circulation Author:

TOTAL RECAP INFOGR Computer Graphics Jacco Bikker & Debabrata Panja - April-July 2018

TOTAL RECAP INFOGR Computer Graphics Jacco Bikker & Debabrata Panja - April-July 2017