spcl.inf.ethz.ch @spcl_eth dCUDA: Hardware Supported Overlap of Computation and Communication Tobias Gysi, Jeremia Bär, and Torsten Hoefler
spcl.inf.ethz.ch @spcl_eth GPU computing gained a lot of popularity in various application domains weather & climate machine learning molecular dynamics
spcl.inf.ethz.ch @spcl_eth GPU cluster programming using MPI and CUDA node 2 code node 1 // run compute kernel device device __global__ void mykernel ( … ) { } memory memory // launch compute kernel PCI-Express PCI-Express mykernel<<<64,128 >>>( … ); // on-node data movement cudaMemcpy( host host psize, &size, memory memory sizeof(int), cudaMemcpyDeviceToHost); PCI-Express PCI-Express // inter-node data movement mpi_send( pdata, size, MPI_FLOAT, … ); mpi_recv( pdata, size, interconnect MPI_FLOAT, … );
spcl.inf.ethz.ch @spcl_eth Disadvantages of the MPI-CUDA approach host device complexity • two programming models • duplicated functionality mykernel <<< >>>( … ); cudaMemcpy ( … ); mykernel ( … ) { … copy sync } device sync time mpi_send ( … ); mpi_recv ( … ); cluster sync mykernel<<< >>>( … ); performance mykernel ( … ) { • encourages sequential execution … } • low utilization of the costly hardware …
spcl.inf.ethz.ch @spcl_eth Achieve high resource utilization using oversubscription & hardware threads code thread 1 thread 2 thread 3 instruction pipeline ld %r0,%r1 ld %r0,%r1 ready ready ld mul %r0,%r0,3 stall ld %r0,%r1 ready ld ld st %r0,%r1 ready stall ld %r0,%r1 ld ld mul %r0,%r0,3 ready stall mul ld GPU cores use time stall mul %r0,%r0,3 ready mul mul “parallel slack” to ready stall mul %r0,%r0,3 mul mul hide instruction pipeline latencies st %r0,%r1 ready stall st mul stall st %r0,%r1 ready st st ready stall st %r0,%r1 st st …
spcl.inf.ethz.ch @spcl_eth Use oversubscription & hardware threads to hide remote memory latencies code thread 1 thread 2 thread 3 instruction pipeline get … get … ready ready get mul %r0,%r0,3 stall get … ready get get put … stall stall get … get get stall stall stall ! get introduce put & get time ready stall stall ! ! operations to access mul ! ready stall mul %r0,%r0,3 distributed memory mul mul stall mul %r0,%r0,3 ready ready stall mul %r0,%r0,3 mul mul ready stall put mul put … …
spcl.inf.ethz.ch @spcl_eth How much “parallel slack” is necessary to fully utilize the interconnect? Little’s law 𝑑𝑝𝑜𝑑𝑣𝑠𝑠𝑓𝑜𝑑𝑧 = 𝑚𝑏𝑢𝑓𝑜𝑑𝑧 ∗ 𝑢ℎ𝑠𝑝𝑣ℎ𝑞𝑣𝑢 device memory interconnect latency 1µs 19µs bandwidth 200GB/s 6GB/s concurrency 200kB 114kB #threads ~12000 ~7000 >>
spcl.inf.ethz.ch @spcl_eth dCUDA (distributed CUDA) extends CUDA with MPI-3 RMA and notifications for ( int i = 0; i < steps; ++i) { • iterative stencil kernel for ( int idx = from; idx < to; idx += jstride) • thread specific idx out[idx] = -4.0 * in[idx] + computation in[idx + 1] + in[idx - 1] + in[idx + jstride] + in[idx - jstride]; if (lsend) dcuda_put_notify (ctx, wout, rank - 1, len + jstride, jstride, &out[jstride], tag); if (rsend) dcuda_put_notify (ctx, wout, rank + 1, 0, jstride, &out[len], tag); communication • map ranks to blocks dcuda_wait_notifications (ctx, wout, • device-side put/get operations DCUDA_ANY_SOURCE, tag, lsend + rsend); • notifications for synchronization • shared and distributed memory swap(in, out); swap(win, wout); }
spcl.inf.ethz.ch @spcl_eth Advantages of the dCUDA approach performance device 1 device 2 • avoid device synchronization rank 1 rank 2 rank 3 rank 4 • latency hiding at cluster scale stencil( … ) ; stencil( … ) ; put( … ); put( … ); put( … ); put( … ); wait( … ); wait( … ); stencil( … ) ; stencil( … ) ; complexity put( … ); put( … ); put( … ); put( … ); • unified programming model wait( … ); wait( … ); time • one communication mechanism sync sync stencil( … ) ; stencil( … ) ; put( … ); put( … ); device 1 device 2 put( … ); put( … ); sync sync put wait( … ); wait( … ); stencil( … ) ; stencil( … ) ; rank 1 rank 2 rank 3 rank 4 put( … ); put( … ); put( … ); put( … ); wait( … ); wait( … ); … … … … put put
spcl.inf.ethz.ch @spcl_eth Implementation of the dCUDA runtime system event handler host-side block manager block manager block manager MPI GPU direct device-library device-library device-library device-side put ( … ); put( … ); put( … ); get( … ); get( … ); get( … ); wait( … ); wait( … ); wait( … );
spcl.inf.ethz.ch @spcl_eth Overlap of a copy kernel with halo exchange communication benchmarked on Greina (8 Haswell nodes with 1x Tesla K80 per node) no overlap 1000 execution time [ms] compute & exchange halo exchange 500 compute only 0 30 60 90 # of copy iterations per exchange
spcl.inf.ethz.ch @spcl_eth Weak scaling of MPI-CUDA and dCUDA for a stencil program benchmarked on Greina (8 Haswell nodes with 1x Tesla K80 per node) MPI-CUDA 100 dCUDA execution time [ms] 50 halo exchange 0 2 4 6 8 # of nodes
spcl.inf.ethz.ch @spcl_eth Weak scaling of MPI-CUDA and dCUDA for a particle simulation benchmarked on Greina (8 Haswell nodes with 1x Tesla K80 per node) MPI-CUDA 200 execution time [ms] dCUDA 150 100 50 halo exchange 0 2 4 6 8 # of nodes
spcl.inf.ethz.ch @spcl_eth Weak scaling of MPI-CUDA and dCUDA for sparse-matrix vector multiplication benchmarked on Greina (8 Haswell nodes with 1x Tesla K80 per node) 200 150 execution time [ms] dCUDA 100 MPI-CUDA 50 communication 0 1 4 9 # of nodes
spcl.inf.ethz.ch @spcl_eth Conclusions unified programming model for GPU clusters device-side remote memory access operations with notifications transparent support of shared and distributed memory extend the latency hiding technique of CUDA to the full cluster inter-node communication without device synchronization use oversubscription & hardware threads to hide remote memory latencies automatic overlap of computation and communication synthetic benchmarks demonstrate perfect overlap example applications demonstrate the applicability to real codes https://spcl.inf.ethz.ch/Research/Parallel_Programming/dCUDA/
Recommend
More recommend