COMPRESSING GRADIENT OPTIMIZERS VIA COUNT-SKETCHES Ryan Spring, - PowerPoint PPT Presentation

Jan 07, 2023 •169 likes •280 views

6/11/2019 Count-Sketch Optimizers COMPRESSING GRADIENT OPTIMIZERS VIA COUNT-SKETCHES Ryan Spring, Anastasios Kyrillidis, Vijai Mohan, Anshumali Shrivastava Rice University, Amazon Search ICML 2019 6/11/2019 Count-Sketch Optimizers Deep

6/11/2019 Count-Sketch Optimizers COMPRESSING GRADIENT OPTIMIZERS VIA COUNT-SKETCHES Ryan Spring, Anastasios Kyrillidis, Vijai Mohan, Anshumali Shrivastava Rice University, Amazon Search ICML 2019
6/11/2019 Count-Sketch Optimizers Deep Learning is Resource Intensive Training deep learning models requires large amounts of time and resources
6/11/2019 Count-Sketch Optimizers Data-Parallelism for faster training! A key tool for reducing training time is to increase the batch size
6/11/2019 Count-Sketch Optimizers Data Parallelism – Memory Limitations Increasing the batch size requires significant amounts of memory
6/11/2019 Count-Sketch Optimizers Faster Training vs. Expressive Model Sacrifice batch size for a larger, more expressive model
6/11/2019 Count-Sketch Optimizers Pesky Popular Optimizers • The auxiliary parameters used by popular optimizers aggravate the memory issue • i.e. Adam, RMSProp, Adagrad, Momentum
6/11/2019 Count-Sketch Optimizers Optimizers – A Concrete Example • Training BERT Transformer on Nvidia V100 16GB* • SGD: 10,800 MB, Adam: 13,362 MB • Auxiliary variables require 2,562 MB extra memory! * Using Activation Checkpointing and Mixed Precision Training
6/11/2019 Count-Sketch Optimizers Our Goal • Compress the auxiliary variables • Maintain convergence rate and accuracy of the full-sized optimizer
6/11/2019 Count-Sketch Optimizers Count-Sketches to the Rescue! • Solution: Compress the auxiliary variables with count- sketches • Intuition: Map multiple model parameters to the same parameter in the count-sketch • Outcome: Free memory for more expressive model and/or larger batch size
6/11/2019 Count-Sketch Optimizers Highlighted Result - LSTM – LM1B Metric Adam Count-Sketch Time (Hrs) 5.28 5.42 Size (MB) 10,813 7,693 Test Perplexity 39.90 40.55 • Count-Sketch optimizer used 5x fewer parameters • Upshot: Reduced memory usage with minimal accuracy or performance loss
6/11/2019 Count-Sketch Optimizers Please visit the poster today! 6:30pm @ Pacific Ballroom #83 GitHub: https://github.com/rdspring1/Count-Sketch-Optimizers

Recommend

Lempel- -Ziv Ziv- -Welch (LZW) Welch (LZW) Lempel Data Compressing Model Data Compressing

Lempel- -Ziv Ziv- -Welch (LZW) Welch (LZW) Lempel Data Compressing Model Data Compressing Model Martin Chakravorti Information Information What is information? Any interaction What is information? Any interaction between objects, when

754 views • 15 slides

Learning with Differentiable Perturbed Optimizers Quentin Berthet Youth in High-dimensions -

Learning with Differentiable Perturbed Optimizers Quentin Berthet Youth in High-dimensions - ICTP - 2020 Q. Berthet M.Blondel O.Teboul M. Cuturi J-P. Vert F.Bach Learning with Differentiable Perturbed Optimizers Preprint:

517 views • 17 slides

Learning with Differentiable Perturbed Optimizers Quentin Berthet Optimization for ML - CIRM -

Learning with Differentiable Perturbed Optimizers Quentin Berthet Optimization for ML - CIRM - 2020 Q. Berthet M.Blondel O.Teboul M. Cuturi J-P. Vert F.Bach Learning with Differentiable Perturbed Optimizers Preprint: arXiv:2002.08676

857 views • 27 slides

An Introduction to Particle Swarm Multi-Objective Optimizers Carlos A. Coello Coello

Outline Notions of Particle Swarm Optimization Multi-Objective Particle Swarm Optimizers An Introduction to Particle Swarm Multi-Objective Optimizers Carlos A. Coello Coello CINVESTAV-IPN Evolutionary Computation Group (EVOCINV) Computer

964 views • 53 slides

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Multivariate Fundamentals: Rotation/Distance Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective: Use one dataset to explain another Use the spatial patterns of each dataset to try and understand the

364 views • 24 slides

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient Descent Quadratic Forms Gradient Descent in Quadratic Forms Eigen vectors and values Gradient Descent Convergence Conjugate

1.1k views • 50 slides

Compressing Strings of the Kernel Wolfram Sang Consultant 21.8.2014, LinuxCon14 Wolfram Sang

. . Compressing Strings of the Kernel Wolfram Sang Consultant 21.8.2014, LinuxCon14 Wolfram Sang (wsa@the-dreams.de) Compressing Strings of the Kernel 21.8.2014, LinuxCon14 1 / 36 The origin: CEWG project 1 kernel debug messages.

909 views • 44 slides

Research Goal : reliable and easy-to-use optimizers for ML. 1 10 Challenges in Optimization

Aaron Mishkin Research Goal : reliable and easy-to-use optimizers for ML. 1 10 Challenges in Optimization for ML Stochastic gradient methods are the most popular algorithms for fitting ML models, w k +1 = w k k SGD: f ( w k ) .

395 views • 10 slides

How to use Gradient and Multi-Texture 1. Many situations, we need use the gradient texture for our

How to use Gradient and Multi-Texture 1. Many situations, we need use the gradient texture for our shape or text, so let s learn how to use our gradient and multi-texture functions. First, crate a shape. 2. Select this shape, switch to color

196 views • 3 slides

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient 1 / 12 Motivation Two classes of optimization procedures used throughout ML (stochastic) gradient descent, with momentum, and maybe

406 views • 12 slides

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020) Learning objectives Basic idea of gradient descent stochastic gradient descent method of momentum using an adaptive learning rate sub-gradient

569 views • 34 slides

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University of Virginia Overview 1. Gradient Descent 2. Stochastic Gradient Descent 3. SGD with Momentum 4. Adaptive Learning Rates 1 Gradient Descent

767 views • 66 slides

Randomized Stress-Testing of Link-Time Optimizers Vu Le, Chengnian Sun , Zhendong Su University of

Randomized Stress-Testing of Link-Time Optimizers Vu Le, Chengnian Sun , Zhendong Su University of California, Davis 1 7/15/2015 General Software Build Process Compiler Linker 2 7/15/2015 General Software Build Process Compiler

861 views • 31 slides

AMMI Introduction to Deep Learning 5.3. PyTorch optimizers Fran cois Fleuret

AMMI Introduction to Deep Learning 5.3. PyTorch optimizers Fran cois Fleuret https://fleuret.org/ammi-2018/ Sat Nov 10 11:27:22 UTC 2018 COLE POLYTECHNIQUE FDRALE DE LAUSANNE The PyTorch module torch.optim provides many

791 views • 14 slides

Two-player games between polynomial optimizers and semidefinite solvers Victor Magron ,

Two-player games between polynomial optimizers and semidefinite solvers Victor Magron , CNRSLAAS Joint work with Jean-Bernard Lasserre (CNRSLAAS) Mohab Safey El Din (Sorbonne Universit) SIAM AG, Bern, 11 July 2019 f Victor Magron

1.16k views • 59 slides

ML HPC: Optimizing Optimizers for Optimization Workshop on the Convergence of ML & HPC

spcl.inf.ethz.ch @spcl_eth Sto Stochasti tic Pe Perfo formance T AL B EN -N UN ML HPC: Optimizing Optimizers for Optimization Workshop on the Convergence of ML & HPC @ ASPLOS 2020 Zoom W ITH CONTRIBUTIONS FROM D AN A LISTARH , N

947 views • 48 slides

Computer System Overview Chapter 1 1 Operating System Exploits the hardware resources of

Computer System Overview Chapter 1 1 Operating System Exploits the hardware resources of one or more processors Provides a set of services to system users Manages secondary memory and I/O devices 2 1 Basic Elements

424 views • 26 slides

COMP 110-001 Computer Basics Yi Hong May 13, 2015 Today Hardware and memory

COMP 110-001 Computer Basics Yi Hong May 13, 2015 Today Hardware and memory Programs and compiling Your first program 2 Before Programming Need to know basics of a computer Understand what your program is

522 views • 24 slides

COMPUTING BASICS http://www.flickr.com/photos/oskay/472097903/ CSCI 135 - Fundamentals of

COMPUTING BASICS http://www.flickr.com/photos/oskay/472097903/ CSCI 135 - Fundamentals of Computer Science I 2 Outline Computer Basics Programs and Languages Hardware and Memory Most modern computers have similar components

540 views • 12 slides

Stanford CS193p Developing Applications for iOS Spring 2016 CS193p Spring 2016 Today Core Data

Stanford CS193p Developing Applications for iOS Spring 2016 CS193p Spring 2016 Today Core Data Object-Oriented Database CS193p Spring 2016 Core Data Database Sometimes you need to store large amounts of data or query it in a sophisticated

1.34k views • 70 slides

Syntactic Theory Lecture 3 (11.11.2010) PD Dr.Valia Kordoni Email: kordoni@coli.uni-sb.de

Syntactic Theory Lecture 3 (11.11.2010) PD Dr.Valia Kordoni Email: kordoni@coli.uni-sb.de http://www.coli.uni-saarland.de/courses/syntactic-theory/2010/ Context-Free Grammars (CFGs) Syntactic Theory Lecture 3 (11.11.10) 2 Parsing:

617 views • 36 slides

Fundamentele Informatica 1 (I&E) najaar 2011 http://www.liacs.nl/home/rvvliet/fi1ie/ Rudy

Fundamentele Informatica 1 (I&E) najaar 2011 http://www.liacs.nl/home/rvvliet/fi1ie/ Rudy van Vliet rvvliet(at)liacs.nl college 9, 21 november 2011 4. Context-Free Languages 4.5. Simplified Forms and Normal Forms 5. Pushdown Automata

487 views • 22 slides

Fresh Breeze Streams Programming Model and Architecture for Real Time Streaming Jack Dennis MIT

Fresh Breeze Streams Programming Model and Architecture for Real Time Streaming Jack Dennis MIT Computer Science and Artificial Intelligence Laboratory Wh What is a a Program am E Execution on M Model? Application Code Software

534 views • 17 slides

EDUCATION 1 50-113c, 8/19/0 Preparing for Midyear Retirements Agenda The importance of

STATE TEACHERS RETIREMENT SYSTEM OF OHIO Preparing for Midyear Retirements On-Demand Version Preparing for Midyear Retirements EMPLOYER EDUCATION 1 50-113c, 8/19/0 Preparing for Midyear Retirements Agenda The importance of service

190 views • 8 slides