Efficient Full-Matrix Adaptive Regularization Naman Agarwal, Brian - PowerPoint PPT Presentation

Apr 30, 2023 •219 likes •344 views

Efficient Full-Matrix Adaptive Regularization Naman Agarwal, Brian Bullins, Xinyi Chen, Elad Hazan, Karan Singh, Cyril Zhang, Yi Zhang Princeton University Google AI Princeton Adaptive Preconditioning in ML Optimization in ML: training

Efficient Full-Matrix Adaptive Regularization Naman Agarwal, Brian Bullins, Xinyi Chen, Elad Hazan, Karan Singh, Cyril Zhang, Yi Zhang Princeton University Google AI Princeton
Adaptive Preconditioning in ML ● Optimization in ML: training neural nets → minimizing non-convex losses ● Diagonal Adaptive Optimizers: each coordinate has a different learning rate according to past gradients ○ AdaGrad, Adam, RMSProp ○ Works well in practice Theory is only known for convex losses at the time
Adaptive Preconditioning: Intuition Learns the correct basis, Doesn’t adapt to a rotated basis faster optimization Diagonal Full-Matrix Expensive! Can we have a linear time algorithm?
Our Results ● GGT: a new adaptive optimizer Efficient full-matrix (low-rank) AdaGrad ● Experiments : faster training and sometimes better generalization on vision and language tasks ● GPU-friendly Implementation ● Theory : “adaptive” convergence rate on convex and non-convex functions ● Up to O(1/√d) faster than SGD
The GGT Trick ● Scalar Case: ● Matrix Case:
The GGT Trick ● Scalar Case: ● Matrix Case:
The GGT Trick ● Scalar Case: ● Matrix Case: Efficient implementation on the GPU!
Large-Scale Experiments (CIFAR-10, PTB) ● Resnet-26 for CIFAR-10 and LSTM for PTB ● Better and faster training ● Initial acceleration in optimizing the LSTM ● Better validation ppl for the LSTM
Theory ● Define the adaptivity ratio : [DHS10]: for diagonal AdaGrad, sometimes smaller for full-matrix AdaGrad ● Non-Convex reduction : GGT* converges in steps ● First step towards analyzing adaptive methods in non-convex optimization * Idealized modification of GGT for analysis. See paper for details.
A note on the important parameters ● Improving dependence on epsilon: In practice , leading to an improvement of about 3.1 ● Instead our improvement can be as large as the dimension, which can be 1e7 for language models ● Huge untapped potential for large-scale optimization!
Thank You! Poster #209 xinyic@google.com

Recommend

full year results full year results full year results full full year results full year results full

AMP results presentation FULL YEAR RESULTS 2010 full year results full year results full year results full full year results full year results full year results full 2010 full year results Craig Dunn Chief Executive Officer Paul Leaming Chief

694 views • 24 slides

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Neural Nets for Adaptive Filter and Adaptive Pattern Recognition Brian Young Article Context Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition Adaptive Filters Min. Disturb. and LMS

976 views • 23 slides

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano, MSaad, Karimi Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano, MSaad, Karimi Adaptive Control A set of

640 views • 45 slides

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano, MSaad, Karimi Chapter 11: Direct Adaptive Control 2 Adaptive Control Landau, Lozano, MSaad, Karimi Adaptive Control A Basic Scheme

457 views • 24 slides

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

The Matrix [3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer is out there, Neo, and its looking for you, and it will find you if you want it to. The Matrix , 1999 Traditional notion of a matrix:

1.43k views • 120 slides

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n and matrix B is r s , then for the product AB to be valid it must be that n = r . If valid, the product AB has size m s . The columns of the

400 views • 12 slides

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Regularization Overview Regularization Overview Problems & Multicollinearity We will discuss three popular methods for obtaining better estimates of the linear model coefficients Regularization Techniques Principal

305 views • 12 slides

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970 Lecture 3: Stephen Scott Stephen Scott and Vinod and Vinod Regularization Variyam Variyam Machine learning can generally be distilled to an

551 views • 9 slides

Regularization Regularization is a general approach to add a complexity parameter to a

Regularization Regularization is a general approach to add a complexity parameter to a learning algorithm. Requires that the model parameters be continuous. (i.e., Regression OK, IAML: Regularization and Ridge Regression Decision trees

204 views • 3 slides

Adaptive Control Chapter 12: Indirect Adaptive Control 1 Adaptive Control Landau, Lozano,

Adaptive Control Chapter 12: Indirect Adaptive Control 1 Adaptive Control Landau, Lozano, MSaad, Karimi Chapter 12 Indirect Adaptive Control 2 Adaptive Control Landau, Lozano, MSaad, Karimi Pole placement The pole placement

585 views • 25 slides

FULL YEAR RESULTS FULL YEAR RESULTS. 2017 FULL YEAR RESULTS FULL YEAR RESULTS . 2017 . 2017 .

FULL YEAR RESULTS FULL YEAR RESULTS. 2017 FULL YEAR RESULTS FULL YEAR RESULTS . 2017 . 2017 . 2017 Full Year Results Full Year Results. 2017 . 2017 1 1 Full Year Results Full Year Results . 2017 . 2017 1 1 Agenda Agenda Agenda

533 views • 42 slides

Building an IoT Platform with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Building an IoT Platform with Matrix matthew@matrix.org http://www.matrix.org What is Matrix? An open decentralised conversation store and message bus. Why? To create a global communication meta-network that bridges all the existing

471 views • 22 slides

Introductory Matrix Operations Matrix Entries Defn. For matrix A , notation a ij means the en-

Introductory Matrix Operations Matrix Entries Defn. For matrix A , notation a ij means the en- try in row i and column j of A . matOpsONE: 2 Matrix Addition and Scalar Multiplication Matrix addition requires the two ma- Defn. trices have the

230 views • 8 slides

Gov 2000: 10. Multiple Regression in Matrix Form Matthew Blackwell Fall 2016 1 / 64 1. Matrix

Gov 2000: 10. Multiple Regression in Matrix Form Matthew Blackwell Fall 2016 1 / 64 1. Matrix algebra review 2. Matrix Operations 3. Linear model in matrix form 4. OLS in matrix form 5. OLS inference in matrix form 2 / 64 Where are we?

1.64k views • 64 slides

Liberating Communication with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Liberating Communication with Matrix matthew@matrix.org http://www.matrix.org What is Matrix? An open decentralised conversation store and message bus. Why? To create a global communication meta-network that bridges all the existing

813 views • 29 slides

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms Fran

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms Fran cois Caron Department of Statistics, Oxford STATLEARN 2014, Paris April 7, 2014 Joint work with Adrien Todeschini, Marie Chavent (INRIA, U. of

699 views • 45 slides

Accuracy and Stability: recent advances in C.A.G.D. J.M. Pe na* *University of Zaragoza, Spain

Accuracy and Stability: recent advances in C.A.G.D. J.M. Pe na* *University of Zaragoza, Spain M.A.I.A. 2013 , Erice 1 Effects of finite precision arithmetic on numerical algorithms: Roundoff errors. Data uncertainty. Key concepts:

485 views • 44 slides

On the Linear Algebra Employed in the MOSEK Conic Optimizer Monday Jul 13 Erling D. Andersen

On the Linear Algebra Employed in the MOSEK Conic Optimizer Monday Jul 13 Erling D. Andersen www.mosek.com MOSEK summary General LP Convex QP SDP MIP CQP Version 8: Work in progress. The conic optimization problem n n

564 views • 22 slides

Loop Diagonalization Vedant Kumar October 27, 2014 Overview Loop/matrix equivalence Fast

Loop Diagonalization Vedant Kumar October 27, 2014 Overview Loop/matrix equivalence Fast exponentiation through diagonalization Writing the llvm::LoopPass Loop/matrix equivalence Some loops can be fully described by a matrix.

357 views • 10 slides

ON MATRIX D -STABILITY AND RELATED PROPERTIES Olga Kushel Shanghai Jiao Tong University, China

ON MATRIX D -STABILITY AND RELATED PROPERTIES Olga Kushel Shanghai Jiao Tong University, China June 1, 2015 Olga Kushel ON MATRIX D -STABILITY AND RELATED PROPERTIES Outline Olga Kushel ON MATRIX D -STABILITY AND RELATED PROPERTIES Outline 1

769 views • 49 slides

Linear Algebra Chapter 10: Solving Large Systems Section 10.2 The LU -FactorizationProofs of

Linear Algebra Chapter 10: Solving Large Systems Section 10.2 The LU -FactorizationProofs of Theorems July 5, 2020 () Linear Algebra July 5, 2020 1 / 6 Table of contents Theorem 10.A 1 Theorem 10.1. Unique Factorization 2 () Linear

244 views • 11 slides

Orthogonal similarity reduction of any symmetric matrix into a diagonal-plus-semiseparable one

Orthogonal similarity reduction of any symmetric matrix into a diagonal-plus-semiseparable one with free choice of the diagonal Ellen Van Camp, Raf Vandebril, Marc Van Barel and Nicola Mastronardi I. Algorithms Orthogonal similarity

1.53k views • 106 slides

s r t st tr

s r t st tr r st Prr s

730 views • 20 slides

Matrix Calculations: Diagonalisation, Orthogonality, and Applications A. Kissinger Institute for

Eigenvectors and diagonalisation Inner products and orthogonality Radboud University Nijmegen Wrapping up Matrix Calculations: Diagonalisation, Orthogonality, and Applications A. Kissinger Institute for Computing and Information Sciences

758 views • 53 slides