advanced machine learning gradient descent for non convex
play

Advanced Machine Learning Gradient Descent for Non-Convex Functions - PowerPoint PPT Presentation

Advanced Machine Learning Gradient Descent for Non-Convex Functions Amit Sethi Electrical Engineering, IIT Bombay Learning outcomes for the lecture Characterize non-convex loss surfaces with Hessian List issues with non-convex surfaces


  1. Advanced Machine Learning Gradient Descent for Non-Convex Functions Amit Sethi Electrical Engineering, IIT Bombay

  2. Learning outcomes for the lecture • Characterize non-convex loss surfaces with Hessian • List issues with non-convex surfaces • Explain how certain optimization techniques help solve these issues

  3. Contents • Characterizing a non-convex loss surfaces • Issues with gradient decent • Issues with Newton’s method • Stochastic gradient descent to the rescue • Momentum and its variants • Saddle-free Newton

  4. Why do we not get stuck in bad local minima? • Local minima are close to global minima in terms of errors • Saddle points are much more likely at higher portions of the error surface (in high-dimensional weight space) • SGD (and other techniques) allow you to escape the saddle points

  5. Error surfaces and saddle points http://math.etsu.edu/multicalc/prealpha/Chap2/Chap2-8/10-6-53.gif http://pundit.pratt.duke.edu/piki/images/thumb/0/0a/SurfExp04.png/400px-SurfExp04.png

  6. Eigenvalues of Hessian at critical points Local minima Long furrow Plateau Saddle point http://i.stack.imgur.com/NsI2J.png

  7. Saddle point Global minima A realistic picture Local minima Local maxima Image source: https://www.cs.umd.edu/~tomg/projects/landscapes/

  8. Is achieving global minima important? • Global minima for the training data may not be the global minima for the validation or test data • Local minimas are often good enough “ The Loss Surfaces of Multilayer Networks” Choromanska et al. JMLR’15

  9. Under certain assumptions, theoretically also they are of high quality • Results: – Lowest critical values of the random loss form a band – Probability of minima outside that band diminishes exponentially with the size of the network – Empirical verification • Assumptions: – Fully-connected feed-forward neural network – Variable independence – Redundancy in network parametrization – Uniformity “ The Loss Surfaces of Multilayer Networks” Choromanska et al. JMLR’15

  10. Empirically, most minima are of high quality “ Identifying and attacking the saddle pointproblem in high-dimensional non-convex optimization” Dauphin et al., NIPS’15

  11. GD vs. Newton’s method • Gradient descent is based on first-order approx. 𝑔 𝜄 ∗ + ∆𝜄 = 𝑔 𝜄 ∗ + 𝛼𝑔 𝑈 ∆𝜄 Δ𝜄 = −𝜃 𝛼𝑔 • Newton’s method is based on second order 𝑔 𝜄 ∗ + ∆𝜄 = 𝑔 𝜄 ∗ + 𝛼𝑔 𝑈 ∆𝜄 + 1 2 ∆𝜄 𝑈 𝐼 ∆𝜄 Δ𝜄 = −𝐼 −1 𝛼𝑔 𝑜 𝜄 𝑔 𝜄 ∗ + ∆𝜄 = 𝑔 𝜄 ∗ + 1 2 𝜇 𝑗 ∆𝒘 𝑗2 𝑗=1 “ Identifying and attacking the saddle pointproblem in high-dimensional non-convex optimization” Dauphin et al., NIPS’15

  12. Disadvantages of 2 nd order methods • Updates require O(d 3 ) or at least O(d 2 ) • May not work well for non-convex surfaces • Get attracted to saddle points (how?) • Not very good for batch-updates

  13. GD vs. SGD • GD: w t • w t+1 = w t − η g for all samples (w t ) − η g for all samples (w t ) • SGD momentum: w t • w t+1 = w t − η g for a random subset (w t ) − η g for a random subset (w t )

  14. Compare GD with SGD • GD requires more computations per update • SGD is more noisy

  15. SGD helps by changing the loss surface • Different mini-batches (or samples) have their own loss surfaces • The loss surface of the entire training sample (dotted) may be different • Local minima of one loss surface may not be local minima of another one • This helps us escape local minima using stochastic or batch gradient descent • Mini-batch size depends on computational resource utilization

  16. Noise can be added in other ways to escape saddle points • Random mini-batches (SGD) • Add noise to the gradient or the update • Add noise to the input

  17. Learning rate scheduling • High learning rates explore faster earlier – But, they can lead to divergence or high final loss • Low learning rates fine-tune better later – But, they can be very slow to converge • LR scheduling combines advantages of both – Lots of schedules possible: linear, exponential, square-root, step-wise, cosine Training loss Training iterations

  18. Classical and Nesterov Momentum w t • GD: − η g(w t ) • w t+1 = w t − η g(w t ) • Classical momentum: α v t w t − η g(w t ) • v t+1 = α v t − η g(w t ); v t+1 − η g(w t ) v t+1 w t+1 • w t+1 = w t + v t+1 • Nesterov momentum α v t w t • v t+1 = α v t − η g(w t + α v t ); − η g(w t + α v t ) v t+1 w t+1 • w t+1 = w t + v t+1 • Better course-correction for bad velocity Nesterov , “A method of solving a convex programming problem with convergence rate O(1/k^2)”, 1983 Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013

  19. AdaGrad, RMSProp, AdaDelta • Scales the gradient by a running norm of all the previous gradients • Per dimension: 𝑕(𝑥 𝑢 ) 𝑥 𝑢+1 = 𝑥 𝑢 − 𝜃 𝑢 𝑕(𝑥 𝑢 ) 2 + 𝜁 𝑗=1 • Automatically reduces learning rate with t • Parameters with small gradients speed up • RMSProp and AdaDelta use a forgetting factor in grad squared so that the updates do not become too small

  20. Adam optimizer combines AdaGrad and momentum • Initialize • 𝑛 0 = 0 • 𝑤 0 = 0 • Loop over t Get gradient • 𝑕 𝑢 = 𝛼 𝑥 𝑔 𝑢 𝑥 𝑢−1 • 𝑛 𝑢 = 𝛾 1 𝑛 𝑢−1 + 1 − 𝛾 1 𝑕 𝑢 Update first moment (biased) • 𝑤 𝑢 = 𝛾 2 𝑤 𝑢−1 + 1 − 𝛾 2 𝑕 𝑢 2 Update second moment (biased) • 𝑛 𝑢 ) 𝑢 = 𝑛𝑢/(1 − 𝛾 1 Correct bias in first moment • 𝑤 𝑢 ) 𝑢 = 𝑤𝑢/(1 − 𝛾 2 Correct bias in second moment • 𝑥 𝑢 = 𝑥 𝑢−1 − 𝛽 𝑛 𝑢 / 𝑤 𝑢 + 𝜁 Update parameters “ADAM : A method for stochastic optimization” Kingma and Ba, ICLR’15

  21. Visualizing optimizers Source: http://ruder.io/optimizing-gradient-descent/index.html

Recommend


More recommend