nccl 2 0
play

NCCL 2.0 Sylvain Jeaugey DEEP LEARNING ON GPUS Making DL training - PowerPoint PPT Presentation

NCCL 2.0 Sylvain Jeaugey DEEP LEARNING ON GPUS Making DL training times shorter Deeper neural networks, larger data sets training is a very, very long operation ! NCCL 2 CUDA NCCL 1 Multi-GPU Multi-core CPU GPU Multi-GPU Multi-node 2


  1. NCCL 2.0 Sylvain Jeaugey

  2. DEEP LEARNING ON GPUS Making DL training times shorter Deeper neural networks, larger data sets … training is a very, very long operation ! NCCL 2 CUDA NCCL 1 Multi-GPU Multi-core CPU GPU Multi-GPU Multi-node 2

  3. NCCL A multi-GPU communication library To other systems Sockets (Ethernet) Infiniband, with GPU Direct RDMA Within a system NVLink PCIe GPU Direct P2P 3

  4. NCCL Architecture Caffe Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NCCL CUDNN CUBLAS CUDA NVIDIA GPUs 4

  5. NCCL History Design NCCL 2.0 AGENDA New features API Changes Performance Future 5

  6. HISTORY Q4 2015: NCCL 1.x Open-source research project on github, helping Deep Learning frameworks compute on multiple GPUs with efficient collective operations. Limited to intra-node. Q2 2017: NCCL 2.x and beyond NVIDIA Library, multi-node support and improved API. 6

  7. DESIGN What is NCCL ? Optimized collective communication library between CUDA devices. Easy to integrate into any DL framework, as well as traditional HPC apps using MPI. Runs on the GPU using asynchronous CUDA kernels, for faster access to GPU memory, parallel reductions, NVLink usage. Operates on CUDA pointers. Operations are tied to a CUDA stream. Uses as little threads as possible to permit other computation to progress simultaneously. 7

  8. DESIGN Rings NCCL uses rings to move data across all GPUs and perform reductions. 8

  9. DESIGN Rings NCCL uses rings to move data across all GPUs and perform reductions. PCIe / QPI : 1 unidirectional ring 9

  10. DESIGN Rings NCCL uses rings to move data across all GPUs and perform reductions. PCIe / QPI : 1 unidirectional ring DGX-1 : 4 unidirectional rings 10

  11. DESIGN Kernels sendbuff recvbuff FIFO Reduction Previous GPU Next GPU in the ring in the ring 11

  12. NCCL 2.0 12

  13. NCCL 2.0 Inter-node communication Inter-node communication using Sockets or Infiniband verbs, with multi-rail support, topology detection and automatic use of GPU Direct RDMA. Optimal combination of NVLink, PCI and network interfaces to maximize bandwidth and create rings across nodes. PCIe, Infiniband DGX-1 : NVLink, 4x Infiniband 13

  14. NCCL 2.0 Inter-node communication Inter-node communication using Sockets or Infiniband verbs, with multi-rail support, topology detection and automatic use of GPU Direct RDMA. Optimal combination of NVLink, PCI and network interfaces to maximize bandwidth and create rings across nodes. PCIe, Infiniband DGX-1 : NVLink, 4x Infiniband 14

  15. NCCL 2.0 Processes, threads and GPUs Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread. n nodes Node 0 Node 1 Node N-1 2 sockets CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 per node 4 GPUs per GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 socket 15

  16. NCCL 2.0 Processes, threads and GPUs Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread. Node 0 Node 1 Node n-1 P P P P P P P P 1 process P P P CPU0 P P P P P CPU1 P P CPU0 CPU1 CPU0 CPU1 P P P P P P 8n 8n 8n 8n 8n 8n 8n 8n per GPU 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 -6 -5 -4 -3 -2 -1 -8 -7 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 16

  17. NCCL 2.0 Processes, threads and GPUs Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread. Node 0 Node 1 Node n-1 Process 0 Process 1 Process 2 Process 3 Process 2n-2 Process 2n-1 1 process CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 per socket t0 t1 t2 t3 t0 t1 t2 t3 t0 t1 t2 t3 t0 t1 t2 t3 t0 t1 t2 t3 t0 t1 t2 t3 1 thread GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 per GPU 17

  18. NCCL 2.0 Processes, threads and GPUs Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread. Node 0 Node 1 Node n-1 1 process CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 per node Process 0 Process 1 Process n-1 8 GPUs per process GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 18

  19. NCCL 2.0 API Group calls NCCL 2.0 is introducing mandatory new verbs ncclGroupStart/ncclGroupEnd when managing multiple devices from a single thread NCCL 1.x : for (int i=0; i<ngpus; i++) { cudaSetDevice(devices[i]); CPU0 CPU1 ncclAllReduce (…, comms[i], streams[i]); Process 0 } NCCL 2.0 : GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 ncclGroupStart (); for (int i=0; i<ngpus; i++) { ncclAllReduce (…, comms[i], streams[i]); } ncclGroupEnd (); 19

  20. NCCL 2.0 API Integration with parallel environments Inter-node communicator creation still uses the NCCL 1.x verbs : P P P P P P P P CPU0 CPU1 ncclGetUniqueId/ncclCommInitRank 0 1 2 3 4 5 6 7 if (rank == 0) ncclGetUniqueId(&id) My_Bcast(&id); GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 ncclCommInitRank(&comm, nranks, id, rank); Multi-process + multi-GPU per process (from a single thread) : combine ncclCommInitRank with ncclGroupStart/ncclGroupEnd CPU0 CPU1 Process 0 if (rank == 0) ncclGetUniqueId(&id) My_Bcast(&id); ncclGroupStart (); for (int i=0; i<ndev; i++) { GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 cudaSetDevice(devices[i]); ncclCommInitRank(&comm, ndev*nranks, id, ndev*rank+i); } ncclGroupEnd (); 20

  21. NCCL 2.0 API Others Other small API adjustments over the NCCL 1.x API : Counts are now of type size_t instead of int allGather arguments order has been fixed to be similar to other operations Additions/clarification on datatypes : integral : int8 = char, uint8 , int32 = int, uint32 , int64, uint64 floating point : float16 = half, float32 = float, float64 = double Clarifications and fixes for allgather and reduce_scatter send/receive counts and in-place operations 21

  22. PERFORMANCE 22

  23. PERFORMANCE Intra-node performance AllReduce bandwidth (OMB, size=128MB, in GB/s) 60 50 40 30 20 10 0 4 QPI 4 CPU 4 PCI DGX-1 23

  24. PERFORMANCE Inter-node performance AllReduce bandwidth (OMB, size=128MB, in GB/s) 45 40 35 30 25 MPI Baidu Allreduce 20 NCCL 15 10 5 0 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink) 24

  25. PERFORMANCE Deep Learning - CNTK CNTK scaling ResNet50, images/s 8000 7000 6569 6000 5000 4000 3000 3360 3281 1684 2000 1744 1645 1000 217 0 0 8 16 24 32 Ideal MPI NCCL 25

  26. FUTURE 26

  27. FUTURE Top asked features Additional communication primitives : point-to-point communication scatter (1 to N), gather (N to 1), alltoall (N to N) neighbor collectives (send/receive in multiple dimensions) User-defined reduction operations also, trying to merge computation and communication better Windows support Please let us know your needs ! Connect with experts / NCCL session : Wed Apr 10, 4pm 27

More recommend