super gpu super kernels make
play

Super GPU & Super Kernels: Make programming of multi-GPU systems - PowerPoint PPT Presentation

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8, 2017 Why super GPU is needed Extending CUDA view into clusters Why super GPU is needed Extending CUDA view into clusters Example: Sparse Matrix


  1. Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8, 2017

  2. Why super GPU is needed Extending CUDA view into clusters Why super GPU is needed Extending CUDA view into clusters Example: Sparse Matrix Vector Multiplication AGENDA Implications for SW and HW Example: FFT Example: Caffe AlexNet Take-outs parse Matrix Vector Multiplication Implications for SW and HW Example: FFT 2 Example:

  3. MULTI-GPU SYSTEMS Multi-GPU nodes are here and more coming DGX1 - 8 GPUs Coming: Summit and Aurora: about 15 K GPUs GPU enabled clusters Need orchestrate GPUs computations GPUs are connected by a network based on NVlinks Supported by MPI-like library NCCL 3

  4. WORK RASTERIZATION // Kernel invocation with 3 x 2 grid of 4 x 3 threads dim3 grid(3, 2, 1); dim3 threads(4, 3, 1); MatrixAdd<<<grid, threads>>>(A, B, C) Allows to distribute computations CUDA success in programming massive number of threads can be extended to multi- GPU systems Tiling GPUs into a super-GPU seems like logical step in scaling 4

  5. WORK SUPER RASTERIZATION dim3 sblock(4, 3) dim3 sblock(8, 1, 1); GP100 GP100 GP100 GP100 __host__ void SpMVKernelS( dim3 sblock, float** d_matr, int** d_idx , โ€ฆ) { GP100 GP100 GP100 GP100 #pragma omp parallel for num_threads(sblock.x) GP100 GP100 GP100 GP100 for (int i = 0; i < sblock.x; ++i) { cudaSetDevice(i); SpMVKernel<<<grid, threads>>>(d_matr[i], d_idx[i ], โ€ฆ); } } dim3 sblock(8) dim3 sblock(2, 2, 2 ) GP100 GP100 GP100 GP100 GP100 GP100 GP100 GP100 5

  6. SPARSE MATRIX-VECTOR MULTIPLICATION (SPMV) No cross-GPU communications Single-precision SpMV, 22 M nnz, 1 M rows 120.00 112.3 Super-linear speedup K40m PLX 100.00 DGX-1 node 8.4 on DGX1 80.00 GFLOPs 4.3 on PLX connected K40m 60.00 13.3 40.00 14.96 3.48 20.00 0.00 1 2 3 4 5 6 7 8 Number of GPUs 6

  7. IMPLICATIONS FOR SW AND HW Driver has an option to recognize super-kernels and optimize launch - One option: CudaLaunchKernelCooperative - Move some load needed for kernel launch to GPUs Vectorize kernel launches - PCIe supports broadcast Allocation of page tables can be directed by superblock 7

  8. FFT SUPERKERNEL F rs = (F r I s )D s r (I r F s ) 32 M points FFT Double Complex DGX1 1200.00 FFT GP100 1000.00 800.00 GFLOPS 600.00 400.00 200.00 dim3 sblock(4, 2, 1); 0.00 1 3 5 7 Number GPUs CopyDataToGPUs(sblock, r * s, h_src, d_dst); ButterflyRightWing(sblock, r, s, d_dst, d_res); GlobalTranspose<T2>(sblock, r, s, d_res, d_tsr, d_tmp, handle); ButterflyLeftWing(sblock, r, s, d_tmp, d_dst); CopyDataFromGPU(sblock, r * s, d_dst, h_res); 8

  9. TRAINING NEURAL NETWORKS (CAFFE) Many layers Big data volumes have to pass through Most computationally expensive are convolutional layers Main ops: Gemm, Winograd, FFT ๐ท ๐‘† ๐‘‡ ๐‘ ๐‘œ๐‘™๐‘ž๐‘Ÿ = เท เท เท ๐‘Œ ๐‘œ, ๐‘‘, ๐‘ž + ๐‘ , ๐‘Ÿ + ๐‘ก ๐‘‹ ๐‘™๐‘‘๐‘ ๐‘ก ๐‘‘=1 ๐‘ =1 ๐‘ก=1 Data parallel distribution requires AllReduce to update weights 9

  10. CAFFE SUPERKERNEL AlexNet Scalability 9.00 8.00 Iterations per second 7.00 In train(), create dim3 sblock = get_gpus(); 6.00 5.00 4.00 Refer to sblock.Volume() instead of gpus.size(); 3.00 2.00 Pass sblock to P2Psync constructor: 1.00 0.00 1 2 4 8 P2Psync::P2Psync(solver, root_id, sblock, solver->params); Number of GPUs Use sblock in P2Psync::Run(); Alternative: caffe_gpu_gemm<float>(sblock, gemm_params , โ€ฆ) 10

  11. TAKE-AWAYS Multi-GPU programming using superblocks is easy Rasterization of the superkernel is as intuitive as rasterization of CTAs โ€ข Results in good scalability assuming good load balance and small communications (SpMV) Transparently distributes work, allows to concentrate on optimization of the communications โ€ข Pipeline communications and computations Data distributions can be described by the superblock 11

  12. QUESTIONS? Michael Frumkin mfrumkin@nvidia.com

  13. PHOTO CAPTION 13

Recommend


More recommend