High-Performance Communication in Machine Learning Lunch talk at - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T. H OEFLER High-Performance Communication in Machine Learning Lunch talk at ICL/UTK, Knoxville, TN W ITH CONTRIBUTIONS FROM T AL B EN -N UN , D AN A LISTARH , S HOSHANA J AKOBOVITS , C EDRIC R ENGGLI , AND OTHERS AT SPCL AND IST A USTRIA https://www.arxiv.org/abs/1802.09941

spcl.inf.ethz.ch @spcl_eth I’m here during the afternoon – what to talk about ☺ 1 professor, 7 scientific staff, 14 PhD students 6.5k staff, 20k students, focus on research Deep Learning Climate Sim Vector APP/ aCE

spcl.inf.ethz.ch @spcl_eth … What is Deep Learning good for? Image Captioning Digit Recognition Object Classification Gameplay AI Neural Computers Translation Segmentation A very active area of research! 23 papers per day! number of papers per year 1989 2012 2013 2014 2016 2017

spcl.inf.ethz.ch @spcl_eth How does Deep Learning work? Canziani et al. 2017 0.8 bn Number of users Deep Learning is Supercomputing! f(x) Cat Cat 0.54 1.00 0.28 0.00 Dog Dog 0.07 0.00 Airplane Airplane 0.04 0.00 Horse Horse 0.33 0.00 Bicycle Bicycle 0.00 0.02 0.00 0.02 Truck Truck layer-wise weight update ▪ ImageNet (1k): 180 GB ▪ 10-22k labels ▪ 100-200 layers deep ▪ ImageNet (22k): A few TB ▪ growing (e.g., face recognition) ▪ ~100M-2B parameters ▪ Industry: Much larger ▪ weeks to train ▪ 0.1-8 GiB parameter storage

spcl.inf.ethz.ch @spcl_eth A brief theory of supervised deep learning 𝑔(𝑦) Cat Cat 0.54 1.00 0.28 0.00 Dog Dog 0.07 0.00 Airplane Airplane 0.04 0.00 Horse Horse 0.33 0.00 Bicycle Bicycle 0.02 0.00 0.02 0.00 Truck Truck layer-wise weight update labeled samples 𝑦 ∈ 𝑌 ⊂ 𝒠 label domain 𝑍 true label 𝑚(𝑦) 𝑔 𝑦 = 𝑔 𝑜 𝑔 𝑜−1 𝑔 𝑜−2 … 𝑔 1 𝑦 … 𝑔 𝑦 : 𝑌 → 𝑍 2 ℓ 𝑡𝑟 𝑥, 𝑦 = 𝑔 𝑦 − 𝑚 𝑦 … 𝑔(𝑦) 𝑔 1 𝑦 𝑔 2 𝑔 1 𝑦 ℓ 0−1 𝑥, 𝑦 = ቊ0 𝑔 𝑦 = 𝑚(𝑦) fully connected convolution 1 convolution 3 convolution 2 network structure weights 𝑥 1 𝑔 𝑦 ≠ 𝑚(𝑦) pooling (fixed) (learned) 𝑓 𝑔 𝑦 𝑗 ℓ 𝑑𝑓 𝑥, 𝑦 = − ෍ 𝑚 𝑦 𝑗 ⋅ log σ 𝑙 𝑓 𝑔 𝑦 𝑙 𝑥 ∗ = argmin 𝑥∈ℝ 𝑒 𝔽 𝑦~𝒠 ℓ 𝑥, 𝑦 𝑗 5

spcl.inf.ethz.ch @spcl_eth 𝑥 ∗ = argmin 𝑥∈ℝ 𝑒 𝔽 𝑦~𝒠 ℓ 𝑥, 𝑦 Stochastic Gradient Descent 𝑔 1 (𝑦 ) convolution 1 𝑔 2 𝑔 1 𝑦 convolution 2 pooling … convolution 3 𝑔(𝑦) fully connected ▪ Layer storage = 𝑥 𝑚 + 𝑔 𝑚 𝑝 𝑚−1 + 𝛼𝑥 𝑚 + 𝛼𝑝 𝑚 6 T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arXiv Feb 2018

spcl.inf.ethz.ch @spcl_eth Trends in deep learning: hardware and multi-node The field is moving fast – trying everything imaginable – survey results from 227 papers in the area of parallel deep learning Hardware used Shared vs. distributed memory Deep Learning is largely on distributed memory today! 7 T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arXiv Feb 2018

spcl.inf.ethz.ch @spcl_eth Trends in distributed deep learning: node count and communication The field is moving fast – trying everything imaginable – survey results from 227 papers in the area of parallel deep learning Communication mode Deep Learning research is converging to MPI! 8 T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arXiv Feb 2018

spcl.inf.ethz.ch @spcl_eth Minibatch Stochastic Gradient Descent (SGD) Cat Cat 0.54 1.00 0.28 0.00 Dog Dog 0.07 0.00 Airplane Airplane 0.04 0.00 Horse Horse 0.03 0.00 Bicycle Bicycle 0.02 0.00 0.02 0.00 Truck Truck 13 T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arXiv Feb 2018

spcl.inf.ethz.ch @spcl_eth Microbatching (µ-cuDNN) – how to implement layers best in practice? ▪ In cuDNN there are ~16 convolution implementations ▪ Performance depends on temporary memory (workspace) size ▪ Key idea: segment minibatch into microbatches, reuse Fast (up to 4.54x faster on DeepBench) workspace, use different algorithms Microbatching Strategy ▪ How to choose microbatch sizes and algorithms? none (undivided) powers-of-two only Dynamic Programming (Space Reuse) any (unrestricted) Integer Linear Programming (Space Sharing) 14 Yosuke Oyama, Tal Ben-Nun, TH and Satoshi Matsuoka: µ-cuDNN: Accelerating Deep Learning Frameworks with Micro-Batching, Cluster 2018

spcl.inf.ethz.ch @spcl_eth Model parallelism – limited by network size … 1 3 ▪ Parameters can be distributed across processors ▪ Mini-batch has to be copied to all processors ▪ Backpropagation requires all-to-all communication every layer U.A. Muller and A. Gunzinger : Neural Net Simulation on Parallel Computers, IEEE Int’l Conf. on Neural Networks 1994 15

spcl.inf.ethz.ch @spcl_eth Pipeline parallelism – limited by network size … ▪ Layers/parameters can be distributed across processors ▪ Sparse communication pattern (only pipeline stages) ▪ Mini-batch has to be copied through all processors G. Blelloch and C.R. Rosenberg: Network Learning on the Connection Machine, IJCAI’87 16

spcl.inf.ethz.ch @spcl_eth Data parallelism – limited by batch-size … … … ▪ Simple and efficient solution, easy to implement ▪ Duplicate parameters at all processors X. Zhang et al.: An Efficient Implementation of the Back-propagation Algorithm on the Connection Machine CM- 2, NIPS’89 17

spcl.inf.ethz.ch @spcl_eth Data Model Hybrid parallelism Parallelism Parallelism … … … Layer (pipeline) Parallelism ▪ Layers/parameters can be distributed across processors ▪ Can distribute minibatch ▪ Often specific to layer-types (e.g., distribute fc layers but handle conv layers data-parallel) ▪ Enables arbitrary combinations of data, model, and pipeline parallelism – very powerful! A. Krizhevsky: One weird trick for parallelizing convolutional neural networks, arXiv 2014 J. Dean et al.: Large scale distributed deep networks, NIPS ’12. 18 T. Ben-Nun, T. Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arXiv Feb 2018

spcl.inf.ethz.ch @spcl_eth Updating parameters in distributed data parallelism Decentral collective allreduce of 𝒙 Central Training Agent Training Agent Training Agent Training Agent parameter server (sharded) 𝑥’ = 𝑣(𝑥, 𝛼𝑥) 𝑈 = 2𝑀 + 2𝑄 𝛿𝑛/𝑡 𝐻 - Collective operations - Topologies - Neighborhood collectives 𝛼𝑥 𝑥 𝑈 = 2𝑀 log 2 𝑄 + - RMA? 2𝛿𝑛𝐻(𝑄 − 1)/𝑄 Hierarchical Parameter Server Adaptive Minibatch Size S. Gupta et al.: Model Accuracy and S. L. Smith et al.: Don't Decay the Runtime Tradeoff in Distributed Deep Learning Rate, Increase the Batch Size, Learning: A Systematic arXiv 2017 Study. ICDM’16 19 Training Agent Training Agent Training Agent Training Agent

spcl.inf.ethz.ch @spcl_eth parameter server (sharded) 𝑥’ = 𝑣(𝑥, 𝛼𝑥) Parameter (and Model) consistency - centralized 𝛼𝑥 𝑥 ▪ Parameter exchange frequency can be controlled, while still attaining convergence: Training Agent Training Agent Training Agent Training Agent Max. Staleness Synchronization Sync. Sync. Agent 1 𝑥 3,1 𝑥 4,1 𝑥 1,1 𝑥 2,1 Agent 1 Agent 1 𝑥 1 𝑥 2 𝑥 1,1 𝑥 2,1 𝑥 3,1 . . . . . . . . . … Parameter Server 𝑥 𝑈 𝑥 0 … … Parameter Server Parameter Server 𝑥 0 𝑥 𝑈 𝑥 0 𝑥 𝑈 Agent m 𝑥 1,𝑛 𝑥 2,𝑛 Agent m Agent m 𝑥 1 𝑥 2 𝑥 1,𝑛 𝑥 2,𝑛 𝑥 3,𝑛 Time Time Time Synchronous Stale Synchronous / Bounded Asynchronous Asynchronous ▪ Started with Hogwild! [Niu et al. 2011] – shared memory, by chance ▪ DistBelief [Dean et al. 2012] moved the idea to distributed ▪ Trades off “statistical performance” for “hardware performance” J. Dean et al.: Large scale distributed deep networks, NIPS ’12. 20 F. Niu et al.: Hogwild: A lock-free approach to parallelizing stochastic gradient descent, NIPS ’11.

spcl.inf.ethz.ch @spcl_eth Parameter (and Model) consistency - decentralized collective allreduce of 𝒙 ▪ Parameter exchange frequency can be controlled, while still attaining convergence: Training Agent Training Agent Training Agent Training Agent Max. Staleness … Agent 1 … 𝑥 1,1 𝑥 2,1 𝑥 3,1 𝑥 3,1 𝑥 4,1 𝑥 𝑈 Agent 1 Agent 1 𝑥 1 𝑥 2 𝑥 1,1 𝑥 2,1 . . . . 𝑥 2,𝑠 𝑥 3,𝑠 𝑥 4,𝑠 . . . Agent r 𝑥 1,𝑠 𝑥 5,𝑠 Merge 𝑥 (0) All- All- 𝑥 (0) All- All- 𝑥 (𝑈) . Reduce Reduce Reduce Reduce Agent k 𝑥 1,𝑙 𝑥 2,𝑙 𝑥 3,𝑙 . … Agent m … 𝑥 1,𝑛 𝑥 2,𝑛 𝑥 3,𝑛 𝑥 𝑈 Agent m 𝑥 1 𝑥 2 Agent m 𝑥 1,𝑛 𝑥 2,𝑛 Time Time Time Synchronous Stale Synchronous / Bounded Asynchronous Asynchronous ▪ May also consider limited/slower distribution – gossip [Jin et al. 2016] 21 Peter H. Jin et al., “How to scale distributed deep learning?”, NIPS MLSystems 2016

High-Performance Communication in Machine Learning Lunch talk at - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T. H OEFLER High-Performance Communication in Machine Learning Lunch talk at ICL/UTK, Knoxville, TN W ITH CONTRIBUTIONS FROM T AL B EN -N UN , D AN A LISTARH , S HOSHANA J AKOBOVITS , C EDRIC R ENGGLI , AND OTHERS AT

High-Performance Communication in Machine Learning Keynote at Austrian HPC Meeting, Feb. 2019,

High Performance Machine Learning: Advances, Challenges and Opportunities Eduardo Rodrigues

High Performance and Energy Efficient Machine Learning Accelerators and Variable Precision

High-Performance Hardware for Machine Learning U.C. Berkeley October 19, 2016 William Dally

Learning Everywhere: Pervasive Machine Learning for Effective High-Performance Computing Geoffrey

Machine Learning: Study of algorithms that improve their performance P at some task T

Introduction to High-Performance Machine Learning: Convolutional Neural Networks Valeriu Codreanu

1 Sample Job Posting Extracted Job Template Subject: US - TN -SOFTWARE PROGRAMMER

Communication Communication links: Machine Learning for Trading

High-Performance Machine Learning for Weather Prediction Applications Hatem Ltaief Senior

1 Why Study Machine Learning? Why Study Machine Learning? Cognitive Science The Time is Ripe

Machine Learning Machine Learning: algorithms that use experience to improve their

Scikit-learn 1 / 13 Machine Learning Learning: using experience to improve performance.

Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST

Machine Learning for Auto Optimization What is Machine Learning? Definition: Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Machine Learning CS 786 University of Waterloo Lecture 4: May 10, 2012 What is Machine

Machine Learning George Konidaris gdk@cs.duke.edu Spring 2016 Machine Learning Subfield of AI

An Exercise in An Exercise in Machine Learning Machine Learning

Machine Learning 1 Machine(Learning(in(a(Nutshell ( Data$ Model$ Performance$ Measure$

Machine Learning By Alex Scarlatos What is Machine Learning? Machine Learning is the process by

Traditional Machine Learning: Unsupervised Learning Juhan Nam Traditional Machine Learning

Measuring model performance or error Introduction to Machine Learning Is our model any good?

High-Performance Communication in Machine Learning Lunch talk at - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T. H OEFLER High-Performance Communication in Machine Learning Lunch talk at ICL/UTK, Knoxville, TN W ITH CONTRIBUTIONS FROM T AL B EN -N UN , D AN A LISTARH , S HOSHANA J AKOBOVITS , C EDRIC R ENGGLI , AND OTHERS AT

High-Performance Communication in Machine Learning Keynote at Austrian HPC Meeting, Feb. 2019,

High Performance Machine Learning: Advances, Challenges and Opportunities Eduardo Rodrigues

High Performance and Energy Efficient Machine Learning Accelerators and Variable Precision

High-Performance Hardware for Machine Learning U.C. Berkeley October 19, 2016 William Dally

Learning Everywhere: Pervasive Machine Learning for Effective High-Performance Computing Geoffrey

Machine Learning: Study of algorithms that improve their performance P at some task T

Introduction to High-Performance Machine Learning: Convolutional Neural Networks Valeriu Codreanu

1 Sample Job Posting Extracted Job Template Subject: US - TN -SOFTWARE PROGRAMMER

Communication Communication links: Machine Learning for Trading

High-Performance Machine Learning for Weather Prediction Applications Hatem Ltaief Senior

1 Why Study Machine Learning? Why Study Machine Learning? Cognitive Science The Time is Ripe

Machine Learning Machine Learning: algorithms that use experience to improve their

Scikit-learn 1 / 13 Machine Learning Learning: using experience to improve performance.

Faster Machine Learning via Low-Precision Communication &amp; Computation Dan Alistarh (IST

Machine Learning for Auto Optimization What is Machine Learning? Definition: Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Machine Learning CS 786 University of Waterloo Lecture 4: May 10, 2012 What is Machine

Machine Learning George Konidaris gdk@cs.duke.edu Spring 2016 Machine Learning Subfield of AI

An Exercise in An Exercise in Machine Learning Machine Learning

Machine Learning 1 Machine(Learning(in(a(Nutshell ( Data$ Model$ Performance$ Measure$

Machine Learning By Alex Scarlatos What is Machine Learning? Machine Learning is the process by

Traditional Machine Learning: Unsupervised Learning Juhan Nam Traditional Machine Learning

Measuring model performance or error Introduction to Machine Learning Is our model any good?

Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST