Fast and Generic Collectives for Distributed ML Guanhua Wang , - PowerPoint PPT Presentation

Fast and Generic Collectives for Distributed ML Guanhua Wang , Shivaram Venkataraman, Amar Phanishayee Jorgen Thelin, Nikhil R. Devanur, Ion Stoica 1

DNNs empower state-of-the-art results across many different applications Image Classification Robot Control Game Playing Speech Recognition 2

Speed-up DNN training: Data Parallelism Data parallel training speed-up on ImageNet-1K dataset* Significantly reduce training time 3 * https://software.intel.com/en-us/articles/caffe-training-on-multi-node-distributed-memory-systems

Speed-up DNN training: Data Parallelism Data parallel training speed-up on ImageNet-1K dataset* Significantly reduce training time 4 * https://software.intel.com/en-us/articles/caffe-training-on-multi-node-distributed-memory-systems

Speed-up DNN training: Data Parallelism ∇ W 2 ∇ W 1 Data parallel training speed-up on ImageNet-1K dataset* Significantly reduce training time ∇ W 4 ∇ W 3 Model Synchronization ∇ W = ∇ W 1 + ∇ W 2 + ⋯ + ∇ W N 5 * https://software.intel.com/en-us/articles/caffe-training-on-multi-node-distributed-memory-systems

Despite many performance optimizations, model synchronization is a big overhead in data parallel training on cloud servers >50% communication overhead Multi-GPU scaling performance using TensorFlow* *Horovod: fast and easy distributed deep learning in TensorFlow, arXiv:1802.05799, 2018 6

Despite many performance optimizations, model synchronization is a big overhead in data parallel training on cloud servers >50% communication overhead Multi-GPU scaling performance using TensorFlow* Up to 90% communication overhead Communication overhead of data-parallel training with Multi-GPU servers using PyTorch^ ^PipeDream: Generalized Pipeline Parallelism for DNN Training, SOSP 2019 7 *Horovod: fast and easy distributed deep learning in TensorFlow, arXiv:1802.05799, 2018

Model synchronization is a big overhead in data parallel training despite many performance optimizations >50% communication overhead Multi-GPU scaling performance using TensorFlow* To alleviate communication bottlenecks, recently there have been Up to 90% big improvements in hardware and software. communication overhead Communication overhead of data-parallel training with Multi-GPU servers using PyTorch^ ^PipeDream: Generalized Pipeline Parallelism for DNN Training, SOSP 2019 8 *Horovod: fast and easy distributed deep learning in TensorFlow, arXiv:1802.05799, 2018

NVIDIA DGX-1 NVIDIA DGX-2 State of the art (hardware) 9

What is inside? • Computation NVIDIA P100: 5.3 Tera-FLOPs Double Precision NVIDIA V100: 7.8 Tera-FLOPs Double Precision

What is inside? • Computation • Faster Interconnects NVIDIA P100: 5.3 Tera-FLOPs PCIe 3.0 (x16) ~10GB/s Double Precision Shared • NVIDIA V100: 7.8 Tera-FLOPs NVLink Double Precision Point-to-point • 1 st Gen (P100) ~ 18 GB/s • 2 nd Gen (V100) ~ 23 GB/s • 11

What is inside? • Computation • Faster Interconnects NVIDIA P100: 5.3 Tera-FLOPs PCIe 3.0 (x16) ~10GB/s Double Precision Shared • NVIDIA V100: 7.8 Tera-FLOPs NVLink Double Precision Point-to-point • 1 st Gen (P100) ~ 18 GB/s • 2 nd Gen (V100) ~ 23 GB/s • NVSwitch Fully connected crossbar • 6x NVLink 2 nd Gen Bandwidth • ~ 130 GB/s 12

State of the art (software) NCCL (Nvidia Collective Communication Library) Ring-based collective communication protocols 13

Ring-based collectives (e.g. Broadcast) GPU1 GPU2 GPU3 GPU0 GPU0 GPU1 GPU3 GPU2 Topology Ring Broadcast (from GPU0) 14

State of the art (software) NCCL (Nvidia Collective Communication Library) Ring-based collective communication protocols 21

Can these hardware & software improvements alleviate communication bottleneck in data-parallel training? 22

Can these hardware & software improvements alleviate communication bottleneck in data-parallel training? Not Really 23

High communication overheads even with state-of-the-art hardware (NVLink) and software (NCCL) There are many different 4 GPU Cross-GPU communication allocations with a server measured as the percentage of total epoch time when running within a single 8-GPU DGX-1 box 24

High communication overheads even with state-of-the-art hardware (NVLink) and software (NCCL) 4 GPU allocation with highest overhead High communication overheads even with state-of-the-art hardware (NVLink) and software (NCCL) 4 GPU allocation with lowest overhead Cross-GPU communication measured as the percentage of total epoch time when running within a single 8-GPU DGX-1 box 25

High communication overheads even with state-of-the-art hardware (NVLink) and software (NCCL) High communication overheads is consistent across different number of workers and for a range of DNNs Cross-GPU communication measured as the percentage of total epoch time when running within a single 8-GPU DGX-1 box 26

High communication overheads even with state-of-the-art hardware (NVLink) and software (NCCL) High communication overheads is consistent across different number of workers and for a range of DNNs Communication overheads become more pronounced with increasing GPU computation power. Cross-GPU communication measured as the percentage of total epoch time when running within a single 8-GPU DGX-1 box 27

High communication overheads even with state-of-the-art hardware (NVLink) and software (NCCL) High communication overheads consistent across different number of workers and for a range of DNNs We need Faster Collective Communication Protocols. Communication overheads become more pronounced with increasing GPU computation power. Cross-GPU communication measured as the percentage of total epoch time when running within a single 8-GPU DGX-1 box 28

Talk Outline • Motivation • Challenges to achieving faster collective communication • Design • Evaluation 29

Challenge 1: Different server configurations GPU3 GPU0 GPU4 GPU4 GPU7 GPU7 GPU2 GPU1 GPU5 GPU5 GPU6 GPU6 DGX1-P100 (NVLink 1 st Gen, ~18GB/s) DGX1-V100 (NVLink 2 nd Gen, ~23GB/s) 30

Challenge 1: Different server configurations GPU3 GPU0 GPU4 GPU4 GPU7 GPU7 GPU2 GPU1 GPU5 GPU5 GPU6 GPU6 DGX1-P100 (NVLink 1 st Gen, ~18GB/s) DGX1-V100 (NVLink 2 nd Gen, ~23GB/s) Protocols needs to be topology aware to effectively use hardware links. 31

Challenge 2: Link heterogeneity NVLink topology PCIe topology GPU3 GPU0 GPU4 GPU4 GPU7 GPU7 GPU2 GPU1 GPU5 GPU5 GPU6 GPU6 GPU0 GPU1 GPU3 Ring-based collectives can only utilize homogeneous links. 32

Challenge 2: Link heterogeneity NVLink topology PCIe topology GPU3 GPU0 GPU4 GPU4 GPU7 GPU7 GPU2 GPU1 GPU5 GPU5 GPU6 GPU6 GPU0 GPU1 GPU3 Ring-based collectives can only utilize homogeneous links. Why not heterogeneous links? e.g. PCIe will be bottleneck if included in a NVLink ring 33

Challenge 3: Fragmentation in multi-tenant clusters Within each 8-GPU server, # of GPUs allocated to 40,000 multi-GPU jobs at Microsoft. Examples of fragmented allocation (8GPU job across 2 servers) 3 + 5 2 + 6 34

Challenge 3: Fragmentation in multi-tenant clusters Within each 8-GPU server, # of GPUs allocated to 40,000 multi-GPU jobs at Microsoft. Why fragmentation? Many cluster schedulers are not topology-aware. Without support for efficient migration, DNN jobs must embrace fragmentation to avoid queuing delays. 35

Challenge 3: Fragmentation in multi-tenant clusters GPU3 GPU0 GPU4 GPU4 GPU7 GPU7 GPU2 GPU1 GPU5 GPU5 GPU6 GPU6 Within each 8-GPU server, # of GPUs allocated to 40,000 multi-GPU jobs at Microsoft. Why fragmentation? Irregular topo. à no ring Many cluster schedulers are not topology-aware. Existing solutions (NCCL) fall back to PCIe if they cannot form a NVLink ring. Without support for efficient migration, DNN jobs must embrace fragmentation to avoid queuing delays. 36 36

Can we do better than state-of-the-art? Topology Heterogeneity 1. Different server configurations 2. Link heterogeneity 3. Fragmentation in multi-tenant clusters Ring-based collective communication protocols 37

Can we do better than state-of-the-art? Topology Heterogeneity 1. Different server configurations BLINK 2. Link heterogeneity 3. Fragmentation in multi-tenant clusters 38

Talk Outline • Motivation • Challenges to achieving high-performance collective communication 1. Different server configurations 2. Link heterogeneity 3. Fragmentation in multi-tenant clusters • Design • Evaluation 39

Fast and Generic Collectives for Distributed ML Guanhua Wang , - PowerPoint PPT Presentation

Fast and Generic Collectives for Distributed ML Guanhua Wang , Shivaram Venkataraman, Amar Phanishayee Jorgen Thelin, Nikhil R. Devanur, Ion Stoica 1 DNNs empower state-of-the-art results across many different applications Image Classification

What are Generics? e.g. Generics, Generic Programming, Generic Types, Generic Methods 6

1 Definition of a simple generic class Why generic programming (cont.) class Pair <T> {

Generic Programming in a Dependently Typed Language Generic proofs for generic programs Peter

Generic Methods 36 What are Generic Methods? Generic methods = methods that introduce type

Planning and Optimization C14. Merge-and-Shrink Abstractions: Generic Algorithm Malte Helmert and

Generic classes Declaration Use Annotations 54 Generic classes Declaration add

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

A Fast Estimation of SRAM Failure Rate Using Probability Collectives Fang Gong Electrical

Generic absoluteness and universally Baire sets of reals Trevor Wilson Miami University, Ohio

New Generic Attacks on Hash-based MACs G. Leurent (Inria) New Generic Attacks on Hash-based MACs

Generic Programming Department of Computer Science University of Maryland, College Park Generic

GENERICS A generic type is a generic class or interface that is parameterized over types. 1 / 6

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

FAST DISTRIBUTED RSA KEY GENERATION FOR FAST DISTRIBUTED RSA KEY GENERATION FOR semi-honest

Fast Food and Your Health www.ddssafety.net Last updated October 2009 What is fast food?

Developing teachers as creative, reflective professionals Paul Ellis Head of Teaching &

Transcript Collision Attacks: Breaking Authentication in TLS, IKE and SSH or: MD5 MUST DIE

Sweep as a Generic Pruning Technique Applied to Constraint Relaxation Nicolas Beldiceanu and

Generics Asumu Takikawa RacketCon 2012 1 What are generics? 2 What are generics? hash-ref

Towards a Generic XML Content Presentation Model Michael Pediaditakis (mp49@kent.ac.uk) David

s ts s r

Applying Machine Learning to Understand Write Performance of Large-scale Parallel Filesystems

A Generic Tableau Prover and Its Integration with Isabelle Lawrence C. Paulson Computer

Fast and Generic Collectives for Distributed ML Guanhua Wang , - PowerPoint PPT Presentation

Fast and Generic Collectives for Distributed ML Guanhua Wang , Shivaram Venkataraman, Amar Phanishayee Jorgen Thelin, Nikhil R. Devanur, Ion Stoica 1 DNNs empower state-of-the-art results across many different applications Image Classification

What are Generics? e.g. Generics, Generic Programming, Generic Types, Generic Methods 6

1 Definition of a simple generic class Why generic programming (cont.) class Pair &lt;T&gt; {

Generic Programming in a Dependently Typed Language Generic proofs for generic programs Peter

Generic Methods 36 What are Generic Methods? Generic methods = methods that introduce type

Planning and Optimization C14. Merge-and-Shrink Abstractions: Generic Algorithm Malte Helmert and

Generic classes Declaration Use Annotations 54 Generic classes Declaration add

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

A Fast Estimation of SRAM Failure Rate Using Probability Collectives Fang Gong Electrical

Generic absoluteness and universally Baire sets of reals Trevor Wilson Miami University, Ohio

New Generic Attacks on Hash-based MACs G. Leurent (Inria) New Generic Attacks on Hash-based MACs

Generic Programming Department of Computer Science University of Maryland, College Park Generic

GENERICS A generic type is a generic class or interface that is parameterized over types. 1 / 6

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

FAST DISTRIBUTED RSA KEY GENERATION FOR FAST DISTRIBUTED RSA KEY GENERATION FOR semi-honest

Fast Food and Your Health www.ddssafety.net Last updated October 2009 What is fast food?

Developing teachers as creative, reflective professionals Paul Ellis Head of Teaching &amp;

Transcript Collision Attacks: Breaking Authentication in TLS, IKE and SSH or: MD5 MUST DIE

Sweep as a Generic Pruning Technique Applied to Constraint Relaxation Nicolas Beldiceanu and

Generics Asumu Takikawa RacketCon 2012 1 What are generics? 2 What are generics? hash-ref

Towards a Generic XML Content Presentation Model Michael Pediaditakis (mp49@kent.ac.uk) David

s ts s r

Applying Machine Learning to Understand Write Performance of Large-scale Parallel Filesystems

A Generic Tableau Prover and Its Integration with Isabelle Lawrence C. Paulson Computer

1 Definition of a simple generic class Why generic programming (cont.) class Pair <T> {

Developing teachers as creative, reflective professionals Paul Ellis Head of Teaching &