Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8, 2017
Why super GPU is needed Extending CUDA view into clusters Why super GPU is needed Extending CUDA view into clusters Example: Sparse Matrix Vector Multiplication AGENDA Implications for SW and HW Example: FFT Example: Caffe AlexNet Take-outs parse Matrix Vector Multiplication Implications for SW and HW Example: FFT 2 Example:
MULTI-GPU SYSTEMS Multi-GPU nodes are here and more coming DGX1 - 8 GPUs Coming: Summit and Aurora: about 15 K GPUs GPU enabled clusters Need orchestrate GPUs computations GPUs are connected by a network based on NVlinks Supported by MPI-like library NCCL 3
WORK RASTERIZATION // Kernel invocation with 3 x 2 grid of 4 x 3 threads dim3 grid(3, 2, 1); dim3 threads(4, 3, 1); MatrixAdd<<<grid, threads>>>(A, B, C) Allows to distribute computations CUDA success in programming massive number of threads can be extended to multi- GPU systems Tiling GPUs into a super-GPU seems like logical step in scaling 4
WORK SUPER RASTERIZATION dim3 sblock(4, 3) dim3 sblock(8, 1, 1); GP100 GP100 GP100 GP100 __host__ void SpMVKernelS( dim3 sblock, float** d_matr, int** d_idx , โฆ) { GP100 GP100 GP100 GP100 #pragma omp parallel for num_threads(sblock.x) GP100 GP100 GP100 GP100 for (int i = 0; i < sblock.x; ++i) { cudaSetDevice(i); SpMVKernel<<<grid, threads>>>(d_matr[i], d_idx[i ], โฆ); } } dim3 sblock(8) dim3 sblock(2, 2, 2 ) GP100 GP100 GP100 GP100 GP100 GP100 GP100 GP100 5
SPARSE MATRIX-VECTOR MULTIPLICATION (SPMV) No cross-GPU communications Single-precision SpMV, 22 M nnz, 1 M rows 120.00 112.3 Super-linear speedup K40m PLX 100.00 DGX-1 node 8.4 on DGX1 80.00 GFLOPs 4.3 on PLX connected K40m 60.00 13.3 40.00 14.96 3.48 20.00 0.00 1 2 3 4 5 6 7 8 Number of GPUs 6
IMPLICATIONS FOR SW AND HW Driver has an option to recognize super-kernels and optimize launch - One option: CudaLaunchKernelCooperative - Move some load needed for kernel launch to GPUs Vectorize kernel launches - PCIe supports broadcast Allocation of page tables can be directed by superblock 7
FFT SUPERKERNEL F rs = (F r I s )D s r (I r F s ) 32 M points FFT Double Complex DGX1 1200.00 FFT GP100 1000.00 800.00 GFLOPS 600.00 400.00 200.00 dim3 sblock(4, 2, 1); 0.00 1 3 5 7 Number GPUs CopyDataToGPUs(sblock, r * s, h_src, d_dst); ButterflyRightWing(sblock, r, s, d_dst, d_res); GlobalTranspose<T2>(sblock, r, s, d_res, d_tsr, d_tmp, handle); ButterflyLeftWing(sblock, r, s, d_tmp, d_dst); CopyDataFromGPU(sblock, r * s, d_dst, h_res); 8
TRAINING NEURAL NETWORKS (CAFFE) Many layers Big data volumes have to pass through Most computationally expensive are convolutional layers Main ops: Gemm, Winograd, FFT ๐ท ๐ ๐ ๐ ๐๐๐๐ = เท เท เท ๐ ๐, ๐, ๐ + ๐ , ๐ + ๐ก ๐ ๐๐๐ ๐ก ๐=1 ๐ =1 ๐ก=1 Data parallel distribution requires AllReduce to update weights 9
CAFFE SUPERKERNEL AlexNet Scalability 9.00 8.00 Iterations per second 7.00 In train(), create dim3 sblock = get_gpus(); 6.00 5.00 4.00 Refer to sblock.Volume() instead of gpus.size(); 3.00 2.00 Pass sblock to P2Psync constructor: 1.00 0.00 1 2 4 8 P2Psync::P2Psync(solver, root_id, sblock, solver->params); Number of GPUs Use sblock in P2Psync::Run(); Alternative: caffe_gpu_gemm<float>(sblock, gemm_params , โฆ) 10
TAKE-AWAYS Multi-GPU programming using superblocks is easy Rasterization of the superkernel is as intuitive as rasterization of CTAs โข Results in good scalability assuming good load balance and small communications (SpMV) Transparently distributes work, allows to concentrate on optimization of the communications โข Pipeline communications and computations Data distributions can be described by the superblock 11
QUESTIONS? Michael Frumkin mfrumkin@nvidia.com
PHOTO CAPTION 13
Recommend
More recommend