Communication-efficient Distributed SGD with Sketching Nikita - PowerPoint PPT Presentation

Communication-efficient Distributed SGD with Sketching Nikita Ivkin*, Daniel Rothchild*, Enayat Ullah*, Vladimir Braverman, Ion Stoica, Raman Arora * equal contribution

Going distributed: why? ● Large scale machine learning is moving to the distributed setting due to growing size of datasets/models, and modern learning paradigms like Federated learning. ● Large scale machine learning is moving to the distributed setting due to growing size of datasets, which does not fit in one GPU, and modern learning paradigms like Federated learning. ● Master-workers topology . Workers compute gradients, communicate to master; master aggregates these gradients, updates the model, and communicates back the updated parameters. ● Problem - Slow communication overwhelms local computations. ● Resolution(s) - Compress the gradients ○ Intrinsic low dimensional structure ○ Trade-off communication with convergence ● Example of compression - sparsification, quantization

Going distributed: how? data model hybrid most popular

Going distributed: how? parameter server hybrid sync topology all-gather batch 1 batch 2 batch m

Going distributed: how? Synchronization with the parameter server: parameter server worker 1 worker 2 worker m workers batch 1 batch 2 batch m data

Going distributed: how? Synchronization with the parameter server: parameter server - mini-batches distributed among workers worker 1 worker 2 worker m workers batch 1 batch 2 batch m data

Going distributed: how? Synchronization with the parameter server: parameter server - mini-batches distributed among workers - each worker makes forward-backward pass and computes the gradients g 1 g 2 g m worker 1 worker 2 worker m workers batch 1 batch 2 batch m data

Going distributed: how? Synchronization with the parameter server: parameter server - mini-batches distributed among workers - each worker makes forward-backward pass and computes the gradients - workers send gradients to parameter server g 1 g 2 g m worker 1 worker 2 worker m workers batch 1 batch 2 batch m data

Going distributed: how? Synchronization with the parameter server: parameter server - mini-batches distributed among workers - each worker makes forward-backward pass g 1 , g 2 , …, g m and computes the gradients - workers send gradients to parameter server worker 1 worker 2 worker m workers batch 1 batch 2 batch m data

Going distributed: how? Synchronization with the parameter server: parameter server - mini-batches distributed among workers - each worker makes forward-backward pass G = g 1 + g 2 + … + g m and computes the gradients - workers send gradients to parameter server - parameter server sums it up and sends it back to all workers worker 1 worker 2 worker m workers batch 1 batch 2 batch m data

Going distributed: how? Synchronization with the parameter server: parameter server - mini-batches distributed among workers - each worker makes forward-backward pass G and computes the gradients - workers send gradients to parameter server - parameter server sums it up and sends it back to all workers worker 1 worker 2 worker m workers batch 1 batch 2 batch m data

Going distributed: how? Synchronization with the parameter server: parameter server - mini-batches distributed among workers - each worker makes forward-backward pass and computes the gradients - workers send gradients to parameter server - parameter server sums it up and sends it back to all workers G G G worker 1 worker 2 worker m workers batch 1 batch 2 batch m data

Going distributed: how? Synchronization with the parameter server: parameter server - mini-batches distributed among workers - each worker makes forward-backward pass and computes the gradients - workers send gradients to parameter server - parameter server sums it up and sends it back to all workers - each worker makes a step worker 1 worker 2 worker m workers batch 1 batch 2 batch m data

Going distributed: what’s the problem? ● Slow communication overwhelms local parameter computations: server ○ parameter vector for large models can weight up to 0.5 GB ○ synchronize every fraction of a second entire parameter vector every synchronization worker 1 worker 2 worker m workers batch 1 batch 2 batch m data

Going distributed: what’s the problem? ● Slow communication overwhelms local parameter computations: server ○ parameter vector for large models can weight up to 0.5 GB ○ synchronize every fraction of a second entire parameter vector every synchronization ● Mini batch size has limit to its growth worker 1 worker 2 worker m computation resources are wasted workers batch 1 batch 2 batch m data

Going distributed: how others deal with it? ● Compressing the gradients: Quantization Sparsification

Quantization ● Quantizing gradients can give a constant factor decrease in communication cost. ● Simplest quantization to 16-bit, but all the way to 2-bit (TernGrad [1]) and 1-bit (SignSGD [2]) have been successful. ● Quantization techniques can in principle be combined with gradient sparsification [1] Wen, Wei, et al. "Terngrad: Ternary gradients to reduce communication in distributed deep learning." Advances in neural information processing systems . 2017. [2] Bernstein, Jeremy, et al. "signSGD: Compressed optimisation for non-convex problems." arXiv preprint arXiv:1802.04434 (2018). [3] Karimireddy, Sai Praneeth, et al. "Error Feedback Fixes SignSGD and other Gradient Compression Schemes." arXiv preprint arXiv:1901.09847 (2019). APA

Sparsification ● Existing techniques either communicate Ω(Wd) in the worst case, or are heuristics; W - number of workers, d - dimension of gradient. ● [1] showed that SGD (on 1 machine) with top- k gradient updates and error accumulation has desirable convergence properties. ● Q. Can we extend the top- k to the distributed setting? ○ MEM-SGD [1] (for 1 machine, extension to distributed setting is sequential) ○ top-k SGD [2] (assumes that global top k is close to sum of local top k) ○ Deep gradient compression [3] (no theoretical guarantees). ● We resolve the above using sketches! [1] Stich, Sebastian U., Jean-aptiste Cordonnier, and Martin Jaggi. "Sparsified sgd with memory." Advances in Neural Information Processing Systems . 2018. [2] Alistarh, Dan, et al. "The convergence of sparsified gradient methods." Advances in Neural Information Processing Systems . 2018. [3] Lin, Yujun, et al. "Deep gradient compression: Reducing the communication bandwidth for distributed training." arXiv preprint arXiv:1712.01887 (2017). APA

Want to find: 9 4 2 5 2 frequencies 3 of balls

+1 +1 +1 +1 -1 +1 +1 +1 +1 +1 -1 -1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 -1 -1 -1 -1 1 2 3 4 3 4 5 6 7 8 7 6 7 8 9 10 11 12 13 14 15 16 15 14 13 12 0 +1 +1 -1 +1 -1 +/-1 -1 equiprobably, independent

+1 +1 +1 +1 -1 +1 +1 +1 +1 +1 -1 -1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 -1 -1 -1 -1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 -1 -1 +1 +1 +1 +1 +1 +1 -1 -1 -1 -1 -1 +1 -1 +1 -1 -1 +/-1 equiprobably, independent 22

Count Sketch coordinate updates sign bucket hash hash 7 +1

Count Sketch

Mergebility

Compression scheme Synchronization with the parameter server: parameter server worker 1 worker 2 worker m workers batch 1 batch 2 batch m data

Compression scheme Synchronization with the parameter server: parameter server - mini-batches distributed among workers worker 1 worker 2 worker m workers batch 1 batch 2 batch m data

Compression scheme Synchronization with the parameter server: parameter server - mini-batches distributed among workers - each worker makes forward-backward pass and computes and sketch the gradients g 1 g 2 g m worker 1 worker 2 worker m workers batch 1 batch 2 batch m data

Compression scheme Synchronization with the parameter server: parameter server - mini-batches distributed among workers - each worker makes forward-backward pass and computes and sketch the gradients S(g 1 ) S(g 2 ) S(g m ) worker 1 worker 2 worker m workers batch 1 batch 2 batch m data

Compression scheme Synchronization with the parameter server: parameter server - mini-batches distributed among workers - each worker makes forward-backward pass and computes and sketch the gradients - workers send sketches to parameter server S(g 1 ) S(g 2 ) S(g m ) worker 1 worker 2 worker m workers batch 1 batch 2 batch m data

Compression scheme Synchronization with the parameter server: parameter server - mini-batches distributed among workers - each worker makes forward-backward pass S 1 , S 2 , …, S m and computes and sketch the gradients - workers send sketches to parameter server worker 1 worker 2 worker m workers batch 1 batch 2 batch m data

Compression scheme Synchronization with the parameter server: parameter server - mini-batches distributed among workers - each worker makes forward-backward pass S = S 1 + S 2 + … + S m and computes and sketch the gradients - workers send sketches to parameter server - parameter server merge the sketches, extract top k and send it back worker 1 worker 2 worker m workers batch 1 batch 2 batch m data

Communication-efficient Distributed SGD with Sketching Nikita - PowerPoint PPT Presentation

Communication-efficient Distributed SGD with Sketching Nikita Ivkin, Daniel Rothchild, Enayat Ullah, Vladimir Braverman, Ion Stoica, Raman Arora equal contribution Going distributed: why? Large scale machine learning is moving to the

Iterative Sketching Agile Arizona 2017 Agenda Who am I? The Power of Sketching When

SGD and Averaging Instructor: Sham Kakade 1 SGD and optimality There is a strong sense in which

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Free Form Sketching System for Free Form Sketching System for Product Design Using Virtual

Curve Sketching Michael Freeze MAT 151 UNC Wilmington Summer 2013 1 / 10 Section 5.4 :: Curve

Soft Gamma-ray Polarimetry with ASTRO-H SGD August 23, 2014 HEAPA Symposium on Future

cpSGD: c ommunication-efficient and differentially- p rivate distributed SGD Naman Agarwal, Ananda

Poster 158 1 / 4 Poster 158 Security in Distributed ML Zeno: distributed synchronous SGD that

Communication trade-offs for synchronized distributed SGD with large step size Aymeric DIEULEVEUT

Graph Sketching, Sampling, Streaming, and Space Efficient Optimization (Part II) Sudipto Guha

Congealing or Finding the Platonic Gate Jason Fennell & Joe Simons Outline Sketching as

Sketching as a tool for Algorithmic Design Alex Andoni (Columbia University) Find similar pairs

Plan of the Lecture Review: rules for sketching root loci; introduction to dynamic

On Sketching Quadratic Forms Bo Qin The Hong Kong University of Science and Technology January

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

Federated Learning Min Du Postdoc, UC Berkeley Outline q Preliminary: deep learning and SGD q

36 W 10.8 kJ 306 kJ, so 3:45 hr of operation Temperature (degrees Celsius) Temperature

Herringbone Accordion Tent Dome Amount of Cardboard 107 96 84.5 72 (ft^2) Weight (lbs)

Hypothesis Testing for High-Dimensional Regression: Nearly Optimal Sample Size Adel Javanmard

Size vs height in a Binary Tree After today, you should be able to use the relationship

Sketching and Streaming for Distributions Piotr Indyk Andrew McGregor Massachusetts Institute of

CountMin and Count Sketches Lecture 10 February 14, 2019 Chandra (UIUC) CS498ABD 1 Spring

CSE 440: Introduction to HCI User Interface Design, Prototyping, and Evaluation Lecture 06:

Efficient Private Statistics with Succinct Sketches Luca Melis , George Danezis, Emiliano De

Communication-efficient Distributed SGD with Sketching Nikita - PowerPoint PPT Presentation

Communication-efficient Distributed SGD with Sketching Nikita Ivkin*, Daniel Rothchild*, Enayat Ullah*, Vladimir Braverman, Ion Stoica, Raman Arora * equal contribution Going distributed: why? Large scale machine learning is moving to the

Iterative Sketching Agile Arizona 2017 Agenda Who am I? The Power of Sketching When

SGD and Averaging Instructor: Sham Kakade 1 SGD and optimality There is a strong sense in which

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Free Form Sketching System for Free Form Sketching System for Product Design Using Virtual

Curve Sketching Michael Freeze MAT 151 UNC Wilmington Summer 2013 1 / 10 Section 5.4 :: Curve

Soft Gamma-ray Polarimetry with ASTRO-H SGD August 23, 2014 HEAPA Symposium on Future

cpSGD: c ommunication-efficient and differentially- p rivate distributed SGD Naman Agarwal, Ananda

Poster 158 1 / 4 Poster 158 Security in Distributed ML Zeno: distributed synchronous SGD that

Communication trade-offs for synchronized distributed SGD with large step size Aymeric DIEULEVEUT

Graph Sketching, Sampling, Streaming, and Space Efficient Optimization (Part II) Sudipto Guha

Congealing or Finding the Platonic Gate Jason Fennell &amp; Joe Simons Outline Sketching as

Sketching as a tool for Algorithmic Design Alex Andoni (Columbia University) Find similar pairs

Plan of the Lecture Review: rules for sketching root loci; introduction to dynamic

On Sketching Quadratic Forms Bo Qin The Hong Kong University of Science and Technology January

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

Federated Learning Min Du Postdoc, UC Berkeley Outline q Preliminary: deep learning and SGD q

36 W 10.8 kJ 306 kJ, so 3:45 hr of operation Temperature (degrees Celsius) Temperature

Herringbone Accordion Tent Dome Amount of Cardboard 107 96 84.5 72 (ft^2) Weight (lbs)

Hypothesis Testing for High-Dimensional Regression: Nearly Optimal Sample Size Adel Javanmard

Size vs height in a Binary Tree After today, you should be able to use the relationship

Sketching and Streaming for Distributions Piotr Indyk Andrew McGregor Massachusetts Institute of

CountMin and Count Sketches Lecture 10 February 14, 2019 Chandra (UIUC) CS498ABD 1 Spring

CSE 440: Introduction to HCI User Interface Design, Prototyping, and Evaluation Lecture 06:

Efficient Private Statistics with Succinct Sketches Luca Melis , George Danezis, Emiliano De

Communication-efficient Distributed SGD with Sketching Nikita Ivkin, Daniel Rothchild, Enayat Ullah, Vladimir Braverman, Ion Stoica, Raman Arora equal contribution Going distributed: why? Large scale machine learning is moving to the

Congealing or Finding the Platonic Gate Jason Fennell & Joe Simons Outline Sketching as