fast and generic collectives for distributed ml
play

Fast and Generic Collectives for Distributed ML Guanhua Wang , - PowerPoint PPT Presentation

Fast and Generic Collectives for Distributed ML Guanhua Wang , Shivaram Venkataraman, Amar Phanishayee Jorgen Thelin, Nikhil R. Devanur, Ion Stoica 1 DNNs empower state-of-the-art results across many different applications Image Classification


  1. Fast and Generic Collectives for Distributed ML Guanhua Wang , Shivaram Venkataraman, Amar Phanishayee Jorgen Thelin, Nikhil R. Devanur, Ion Stoica 1

  2. DNNs empower state-of-the-art results across many different applications Image Classification Robot Control Game Playing Speech Recognition 2

  3. Speed-up DNN training: Data Parallelism Data parallel training speed-up on ImageNet-1K dataset* Significantly reduce training time 3 * https://software.intel.com/en-us/articles/caffe-training-on-multi-node-distributed-memory-systems

  4. Speed-up DNN training: Data Parallelism Data parallel training speed-up on ImageNet-1K dataset* Significantly reduce training time 4 * https://software.intel.com/en-us/articles/caffe-training-on-multi-node-distributed-memory-systems

  5. Speed-up DNN training: Data Parallelism ∇ W 2 ∇ W 1 Data parallel training speed-up on ImageNet-1K dataset* Significantly reduce training time ∇ W 4 ∇ W 3 Model Synchronization ∇ W = ∇ W 1 + ∇ W 2 + ⋯ + ∇ W N 5 * https://software.intel.com/en-us/articles/caffe-training-on-multi-node-distributed-memory-systems

  6. Despite many performance optimizations, model synchronization is a big overhead in data parallel training on cloud servers >50% communication overhead Multi-GPU scaling performance using TensorFlow* *Horovod: fast and easy distributed deep learning in TensorFlow, arXiv:1802.05799, 2018 6

  7. Despite many performance optimizations, model synchronization is a big overhead in data parallel training on cloud servers >50% communication overhead Multi-GPU scaling performance using TensorFlow* Up to 90% communication overhead Communication overhead of data-parallel training with Multi-GPU servers using PyTorch^ ^PipeDream: Generalized Pipeline Parallelism for DNN Training, SOSP 2019 7 *Horovod: fast and easy distributed deep learning in TensorFlow, arXiv:1802.05799, 2018

  8. Model synchronization is a big overhead in data parallel training despite many performance optimizations >50% communication overhead Multi-GPU scaling performance using TensorFlow* To alleviate communication bottlenecks, recently there have been Up to 90% big improvements in hardware and software. communication overhead Communication overhead of data-parallel training with Multi-GPU servers using PyTorch^ ^PipeDream: Generalized Pipeline Parallelism for DNN Training, SOSP 2019 8 *Horovod: fast and easy distributed deep learning in TensorFlow, arXiv:1802.05799, 2018

  9. NVIDIA DGX-1 NVIDIA DGX-2 State of the art (hardware) 9

  10. What is inside? • Computation NVIDIA P100: 5.3 Tera-FLOPs Double Precision NVIDIA V100: 7.8 Tera-FLOPs Double Precision

  11. What is inside? • Computation • Faster Interconnects NVIDIA P100: 5.3 Tera-FLOPs PCIe 3.0 (x16) ~10GB/s Double Precision Shared • NVIDIA V100: 7.8 Tera-FLOPs NVLink Double Precision Point-to-point • 1 st Gen (P100) ~ 18 GB/s • 2 nd Gen (V100) ~ 23 GB/s • 11

  12. What is inside? • Computation • Faster Interconnects NVIDIA P100: 5.3 Tera-FLOPs PCIe 3.0 (x16) ~10GB/s Double Precision Shared • NVIDIA V100: 7.8 Tera-FLOPs NVLink Double Precision Point-to-point • 1 st Gen (P100) ~ 18 GB/s • 2 nd Gen (V100) ~ 23 GB/s • NVSwitch Fully connected crossbar • 6x NVLink 2 nd Gen Bandwidth • ~ 130 GB/s 12

  13. State of the art (software) NCCL (Nvidia Collective Communication Library) Ring-based collective communication protocols 13

  14. Ring-based collectives (e.g. Broadcast) GPU1 GPU2 GPU3 GPU0 GPU0 GPU1 GPU3 GPU2 Topology Ring Broadcast (from GPU0) 14

  15. Ring-based collectives (e.g. Broadcast) GPU1 GPU2 GPU3 GPU0 GPU0 GPU1 GPU3 GPU2 Topology Ring Broadcast (from GPU0) 15

  16. Ring-based collectives (e.g. Broadcast) GPU1 GPU2 GPU3 GPU0 GPU0 GPU1 GPU3 GPU2 Topology Ring Broadcast (from GPU0) 16

  17. Ring-based collectives (e.g. Broadcast) GPU1 GPU2 GPU3 GPU0 GPU0 GPU1 GPU3 GPU2 Topology Ring Broadcast (from GPU0) 17

  18. Ring-based collectives (e.g. Broadcast) GPU1 GPU2 GPU3 GPU0 GPU0 GPU1 GPU3 GPU2 Topology Ring Broadcast (from GPU0) 18

  19. Ring-based collectives (e.g. Broadcast) GPU1 GPU2 GPU3 GPU0 GPU0 GPU1 GPU3 GPU2 Topology Ring Broadcast (from GPU0) 19

  20. Ring-based collectives (e.g. Broadcast) GPU1 GPU2 GPU3 GPU0 GPU0 GPU1 GPU3 GPU2 Topology Ring Broadcast (from GPU0) 20

  21. State of the art (software) NCCL (Nvidia Collective Communication Library) Ring-based collective communication protocols 21

  22. Can these hardware & software improvements alleviate communication bottleneck in data-parallel training? 22

  23. Can these hardware & software improvements alleviate communication bottleneck in data-parallel training? Not Really 23

  24. High communication overheads even with state-of-the-art hardware (NVLink) and software (NCCL) There are many different 4 GPU Cross-GPU communication allocations with a server measured as the percentage of total epoch time when running within a single 8-GPU DGX-1 box 24

  25. High communication overheads even with state-of-the-art hardware (NVLink) and software (NCCL) 4 GPU allocation with highest overhead High communication overheads even with state-of-the-art hardware (NVLink) and software (NCCL) 4 GPU allocation with lowest overhead Cross-GPU communication measured as the percentage of total epoch time when running within a single 8-GPU DGX-1 box 25

  26. High communication overheads even with state-of-the-art hardware (NVLink) and software (NCCL) High communication overheads is consistent across different number of workers and for a range of DNNs Cross-GPU communication measured as the percentage of total epoch time when running within a single 8-GPU DGX-1 box 26

  27. High communication overheads even with state-of-the-art hardware (NVLink) and software (NCCL) High communication overheads is consistent across different number of workers and for a range of DNNs Communication overheads become more pronounced with increasing GPU computation power. Cross-GPU communication measured as the percentage of total epoch time when running within a single 8-GPU DGX-1 box 27

  28. High communication overheads even with state-of-the-art hardware (NVLink) and software (NCCL) High communication overheads consistent across different number of workers and for a range of DNNs We need Faster Collective Communication Protocols. Communication overheads become more pronounced with increasing GPU computation power. Cross-GPU communication measured as the percentage of total epoch time when running within a single 8-GPU DGX-1 box 28

  29. Talk Outline • Motivation • Challenges to achieving faster collective communication • Design • Evaluation 29

  30. Challenge 1: Different server configurations GPU3 GPU0 GPU4 GPU4 GPU7 GPU7 GPU2 GPU1 GPU5 GPU5 GPU6 GPU6 DGX1-P100 (NVLink 1 st Gen, ~18GB/s) DGX1-V100 (NVLink 2 nd Gen, ~23GB/s) 30

  31. Challenge 1: Different server configurations GPU3 GPU0 GPU4 GPU4 GPU7 GPU7 GPU2 GPU1 GPU5 GPU5 GPU6 GPU6 DGX1-P100 (NVLink 1 st Gen, ~18GB/s) DGX1-V100 (NVLink 2 nd Gen, ~23GB/s) Protocols needs to be topology aware to effectively use hardware links. 31

  32. Challenge 2: Link heterogeneity NVLink topology PCIe topology GPU3 GPU0 GPU4 GPU4 GPU7 GPU7 GPU2 GPU1 GPU5 GPU5 GPU6 GPU6 GPU0 GPU1 GPU3 Ring-based collectives can only utilize homogeneous links. 32

  33. Challenge 2: Link heterogeneity NVLink topology PCIe topology GPU3 GPU0 GPU4 GPU4 GPU7 GPU7 GPU2 GPU1 GPU5 GPU5 GPU6 GPU6 GPU0 GPU1 GPU3 Ring-based collectives can only utilize homogeneous links. Why not heterogeneous links? e.g. PCIe will be bottleneck if included in a NVLink ring 33

  34. Challenge 3: Fragmentation in multi-tenant clusters Within each 8-GPU server, # of GPUs allocated to 40,000 multi-GPU jobs at Microsoft. Examples of fragmented allocation (8GPU job across 2 servers) 3 + 5 2 + 6 34

  35. Challenge 3: Fragmentation in multi-tenant clusters Within each 8-GPU server, # of GPUs allocated to 40,000 multi-GPU jobs at Microsoft. Why fragmentation? Many cluster schedulers are not topology-aware. Without support for efficient migration, DNN jobs must embrace fragmentation to avoid queuing delays. 35

  36. Challenge 3: Fragmentation in multi-tenant clusters GPU3 GPU0 GPU4 GPU4 GPU7 GPU7 GPU2 GPU1 GPU5 GPU5 GPU6 GPU6 Within each 8-GPU server, # of GPUs allocated to 40,000 multi-GPU jobs at Microsoft. Why fragmentation? Irregular topo. à no ring Many cluster schedulers are not topology-aware. Existing solutions (NCCL) fall back to PCIe if they cannot form a NVLink ring. Without support for efficient migration, DNN jobs must embrace fragmentation to avoid queuing delays. 36 36

  37. Can we do better than state-of-the-art? Topology Heterogeneity 1. Different server configurations 2. Link heterogeneity 3. Fragmentation in multi-tenant clusters Ring-based collective communication protocols 37

  38. Can we do better than state-of-the-art? Topology Heterogeneity 1. Different server configurations BLINK 2. Link heterogeneity 3. Fragmentation in multi-tenant clusters 38

  39. Talk Outline • Motivation • Challenges to achieving high-performance collective communication 1. Different server configurations 2. Link heterogeneity 3. Fragmentation in multi-tenant clusters • Design • Evaluation 39

Recommend


More recommend