Adaptive Gradient Methods And Beyond Liangchen Luo Peking - PowerPoint PPT Presentation

Adaptive Gradient Methods And Beyond Liangchen Luo Peking University, Beijing luolc.witty@gmail.com March, 2019

From SGD to Adam ● SGD (Robbins & Monro, 1951) ○ + Momentum (Qian, 1999) ○ + Nesterov (Nesterov, 1983) ● AdaGrad (Duchi et al., 2011) ● RMSprop (Tieleman & Hinton, 2012) ● Adam (Kingma & Lei Ba, 2015) 2

Stochastic Gradient Decent (Robbins & Monro, 1951) 3 H. Robbins, S. Monro. A stochastic approximation method. Ann. Math. Stat. , 1951.

SGD with Momentum (Qian, 1999) In the original paper: Actual implementation in PyTorch: Ning Qian. On the momentum term in gradient descent learning algorithms. Neu. Net. , 1999. 4 Figure source: https://www.willamette.edu/~gorr/classes/cs449/momrate.html

Nesterov Accelerated Gradient (Nesterov, 1983) Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence O (1/k^2). Doklady AN USSR , 1983. 5 Figure source: http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

AdaGrad (Duchi et al., 2011) is a diagonal matrix where each diagonal element is the sum of the squares of the gradients w.r.t. up to time step 6 J. Duchi, E. Hazan, Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. JMLR , 2011.

RMSprop (Tieleman & Hinton, 2012) Use an exponential moving average instead of the sum used in AdaGrad. 7 http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

Adam (Kingma & Lei Ba, 2015) bias correction 8 D. P. Kingma, J. Lei Ba. Adam: A method for stochastic optimization. ICLR , 2015.

Adaptive Methods: Pros ● Faster training speed ● Smoother learning curve ● Easier to choose hyperparameters (Kingma & Lei Ba, 2015) ● Better when data are very sparse (Dean et al., 2012) 9

Adaptive Methods: Cons ● Worse performance on unseen data (viz. dev/test set; Wilson et al., 2017) ● Convergence issue caused by non-decreasing learning rates (Reddi et al., 2018) ● Convergence issue caused by extreme learning rates ( Luo et al., 2019 ) 10

Worse Performance on Unseen Data (Wilson et al., 2017) The authors construct a binary classification example where different algorithms can find entirely different solutions when initialized from the same point, and particularly, adaptive methods find a solution which has worse out-of-sample error than SGD . 11 A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, B. Recht. The marginal value of adaptive gradient methods in machine learning. NeurIPS , 2017.

Convergence Issue Caused by Non-Decreasing Learning Rates (Reddi et al., 2018) The following quantity is always larger than or equals to zero for SGD, while not necessarily the case for Adam and RMSprop, which translates to non-decreasing learning rates. The authors prove that this can result in undesirable convergence behavior in certain cases. 14 S. J. Reddi, S. Kale, S. Kumar. On the convergence of Adam and beyond. ICLR , 2018.

Convergence Issue Caused by Extreme Learning Rates ( Luo et al., 2019 ) The authors demonstrate the existence of extreme learning rates when the model is close to convergence, and prove that this can lead to undesirable convergence behavior for Adam and RMSprop in certain cases whatever value the initial step size is . 15 Liangchen Luo, Yuanhao Xiong, Yan Liu, Xu Sun. Adaptive gradient methods with dynamic bound of learning rate. ICLR , 2019.

Proposals for Improvement ● AMSGrad (Reddi et al., 2018) ● AdaBound (Luo et al., 2019) ● AdaShift (Zhou et al., 2019) ● Padam (Chen & Gu, 2019) ● NosAdam (Huang et al., 2019) 16

AMSGrad (Reddi et al., 2018) gurantee a non-negative viz. non-increasing learning rates 17 S. J. Reddi, S. Kale, S. Kumar. On the convergence of Adam and beyond. ICLR , 2018.

19 Figure source: https://fdlm.github.io/post/amsgrad/

AdaBound ( Luo et al., 2019 ) Consider applying the following operation in Adam, which clips the learning rate element-wisely such that the output is constrained to be in It follows that SGD(M) and Adam can be considered as the following cases where: For SGD(M): For Adam: 20 Liangchen Luo, Yuanhao Xiong, Yan Liu, Xu Sun. Adaptive gradient methods with dynamic bound of learning rate. ICLR , 2019.

applying bound on learning rates Employ and as functions of instead of constant lower and upper bound, where ● is an increasing function that starts from and converges to asymptotically; ● is a decreasing function that starts from and converges to asymptotically. 21

The Robustness of AdaBound 26

The Limitations of AdaBound ● Based on the assumption that SGD would perform better than Adam w.r.t. the final generalization ability ● The form of bound function ○ gamma as a function of expected global step? ○ other functions ● Fixed final learning rate ○ how to determine the final learning rate automatically? 29

Any questions?

Adaptive Gradient Methods And Beyond Liangchen Luo Peking - PowerPoint PPT Presentation

Adaptive Gradient Methods And Beyond Liangchen Luo Peking University, Beijing luolc.witty@gmail.com March, 2019 From SGD to Adam SGD (Robbins & Monro, 1951) + Momentum (Qian, 1999) + Nesterov (Nesterov, 1983) AdaGrad

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Escaping Saddle Points with Adaptive Gradient Methods Matthew Staib 1 , Sashank Reddi 2 ,

Adaptive Control Chapter 12: Indirect Adaptive Control 1 Adaptive Control Landau, Lozano,

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer

Adaptive Control Chapter 13: Multimodel adaptive control with switching Chapter 13: Multimodel

Adaptive Control Chapter 14: Adaptive regulation Rejection of unknown disturbances 1

How to use Gradient and Multi-Texture 1. Many situations, we need use the gradient texture for our

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient

Highly Efficient Gradient Computation for Highly Efficient Gradient Computation for Density-

Stochastic Gradient Methods for Neural Networks Chih-Jen Lin National Taiwan University Last

More on Speech More on Speech Perception Perception Phoneme Phoneme Discrimination

Boleslaw Szymanski based on slides by Albert-Lszl Barabsi www.BarabasiLab.com and Roberta

emergence of illocutionary force Sophia A. Malamud (Brandeis) smalamud@brandeis.edu Mandarin ba

Brief Announcement: Tracking Distributed Aggregates over Time-based Sliding Windows Graham

The Role of the Business Analyst in Change - the skills and tactics required (Strategy v Delivery)

CSC421/2516 Lecture 6: Automatic Differentiation Roger Grosse and Jimmy Ba Roger Grosse and

Higher connectivity in linear -terms as 3-valent graphs Noam Zeilberger an update on

Symbolism for Generative Grammars The book chapter gives a good explanation of the background