multi gpu training with nccl
play

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING - PowerPoint PPT Presentation

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of multiple GPUs NCCL Multiple GPUs per system 1 GPU Multiple systems connected NCCL : N VIDIA C ollective C ommunication L ibrary 2 MULTI-GPU DL


  1. MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey

  2. MULTI-GPU COMPUTING Harvesting the power of multiple GPUs NCCL Multiple GPUs per system 1 GPU Multiple systems connected NCCL : N VIDIA C ollective C ommunication L ibrary 2

  3. MULTI-GPU DL TRAINING Single-GPU parameters Database : GBs of Forward/ Update input data : Backward images, sound, … batch (e.g. 256 images) gradients 3

  4. MULTI-GPU DL TRAINING Data parallel parameters parameters parameters parameters batch batch batch batch local gradients local gradients local gradients local gradients NCCL Allreduce : Sum gradients across GPUs gradients gradients gradients gradients 4

  5. NCCL A multi-GPU communication library Between systems Within a system NVLink Sockets (Ethernet) PCIe Infiniband, with GPU Direct RDMA GPU Direct P2P 5

  6. NCCL Architecture Tensorflow PyTorch MXNet CNTK Caffe2 Caffe (+Horovod) Deep Learning Frameworks NCCL CUDNN CUBLAS CUDA NVIDIA GPUs 6

  7. TIMELINE NCCL history & roadmap Intra-node Inter-node Improved communication communication latency 1.x 2.0 2.1 Point-to-point Aggregated Large scale primitives operations algorithms (Send/Recv) 2.2 2.3 2.4 7

  8. 132 NCCL 2.0 Provide best performance to DL apps Allreduce Bandwidth (OMB, size=128MB) 70 62 60 50 GB/s 40 30 20 12 8 10 5 0 QPI CPU PCI Switch DGX1 (Pascal) DGX1 (Volta) 8

  9. NCCL 2.1 ResNet50 buffer size Latency is important in some workloads, e.g. ResNet 50, in particular when reductions are done for each layer. ResNet50 / MXNet 34 32 30 28 26 24 22 20 Occurences 18 16 FP32 14 FP16 12 10 8 6 4 2 0 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 9 Bytes

  10. NCCL 2.1 Latency improvement 265 NCCL Latency (in us) 160 2.0.5 2.1.0 140 120 7x 100 75 μ s 80 60 5x 40 36 40 22 20 14 10 8 0 2 4 8 16 n GPUs 2 nodes, 1 node, NVLink NVLink+Infiniband 10

  11. NCCL 2.2 Aggregated operations : principle Principle : Merge multiple operations on the same CUDA device Pay the launch overhead only once (more operations per second) Use multiple NVLinks simultaneously (more bandwidth) DL framework ncclAllReduce NCCL cudaLaunchKernel CUDA 11

  12. NCCL 2.2 Aggregated operations : overhead Per-operation time, 8 GPUs, 8 Bytes reduction 15 14 13 12 11 10 9 8 μ s 7 6 5 4 3 2 1 0 1 2 4 8 16 32 64 128 256 # of Aggregated ops 12

  13. NCCL 2.2 Aggregated operations : usage Use ncclGroupStart() / ncclGroupEnd() around the NCCL operations we want to aggregate : ncclGroupStart(); for (int op=0; op<nops; op++) { ncclAllReduce( layers[op].localGradients, layers[op].globalGradients, layers[op].gradientSize, ncclFloat, ncclSum, ncclComm, ncclStream); } ncclGroupEnd(); // All operations are only guaranteed to be posted on the stream after ncclGroupEnd cudaStreamSynchronize(ncclStream); 13

  14. NCCL 2.2 Aggregated operations : usage Can be combined/nested with multi-GPU grouping : ncclGroupStart(); for (int op=0; op<nops; op++) { for (int gpu=0; gpu<ngpus; gpu++) { ncclGroupStart(); ncclAllReduce( layers[op].localGradients[gpu], layers[op].globalGradients[gpu], layers[op].gradientSize, ncclFloat, ncclSum, ncclComms[gpu], ncclStreams[gpu]); ncclGroupEnd(); } } ncclGroupEnd(); // All operations are only guaranteed to be posted on the stream after the last ncclGroupEnd for (int gpu=0; gpu<ngpus; gpu++) cudaStreamSynchronize(ncclStreams[gpu]); 14

  15. NCCL 2.2 Aggregated operations : other uses ReduceScatterV = Aggregation of multiple reductions operations ncclGroupStart(); for (int rank=0; rank<nranks; rank++) { ncclReduce(sendbuff+offsets[rank], recvbuff+offsets[rank], recvcounts[rank], datatype, redOp, rank, comm, stream); } ncclGroupEnd(); AllGatherV = Aggregation of multiple broadcasts operations ncclGroupStart(); for (int rank=0; rank<nranks; rank++) { ncclBroadcast(sendbuff+offsets[rank], recvbuff+offsets[rank], recvcounts[rank], datatype, rank, comm, stream); } ncclGroupEnd(); 15

  16. NCCL 2.3 Large scale algorithms Allreduce Latency 300 2.2 2.3 (projected) 250 200 150 100 50 0 2 4 8 16 32 64 128 16

  17. NCCL 2.4 Point-to-point primitives Send / Receive , Scatter[v],Gather[v],Alltoall[v,w ], neighbor collectives, … gather Neighbor alltoall scatter 17

  18. NCCL Summary Optimized inter-GPU communication for DL and HPC Optimized for all NVIDIA platforms, most OEMs and Cloud Scales to 100s of GPUs, targeting 10,000s in the near future. Aims at covering all communication needs for multi-GPU computing. Only relies on CUDA. No dependency on MPI or any parallel environment. More questions ? Connect with the Experts : NCCL Wed 28, 3pm 18

Recommend


More recommend