progress in automatic gpu compilation and why you want to
play

Progress in automatic GPU compilation and why you want to run MPI on - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T ORSTEN H OEFLER Progress in automatic GPU compilation and why you want to run MPI on your GPU with Tobias Grosser and Tobias Gysi @ SPCL presented at CCDSC, Lyon, France, 2016 spcl.inf.ethz.ch @spcl_eth #pragma


  1. spcl.inf.ethz.ch @spcl_eth T ORSTEN H OEFLER Progress in automatic GPU compilation and why you want to run MPI on your GPU with Tobias Grosser and Tobias Gysi @ SPCL presented at CCDSC, Lyon, France, 2016

  2. spcl.inf.ethz.ch @spcl_eth #pragma ivdep 2

  3. spcl.inf.ethz.ch @spcl_eth !$ACC DATA & !$ACC PRESENT(density1,energy1) & !$ACC PRESENT(vol_flux_x,vol_flux_y,volume,mass_flux_x,mass_flux_y,vertexdx,vertexdy) & !$ACC PRESENT(pre_vol,post_vol,ener_flux) !$ACC KERNELS IF(dir.EQ.g_xdir) THEN IF(sweep_number.EQ.1)THEN !$ACC LOOP INDEPENDENT DO k=y_min-2,y_max+2 !$ACC LOOP INDEPENDENT DO j=x_min-2,x_max+2 pre_vol(j,k)=volume(j,k)+(vol_flux_x(j+1,k )-vol_flux_x(j,k)+vol_flux_y(j ,k+1)-vol_flux_y(j,k)) post_vol(j,k)=pre_vol(j,k)-(vol_flux_x(j+1,k )-vol_flux_x(j,k)) ENDDO ENDDO ELSE !$ACC LOOP INDEPENDENT DO k=y_min-2,y_max+2 !$ACC LOOP INDEPENDENT DO j=x_min-2,x_max+2 pre_vol(j,k)=volume(j,k)+vol_flux_x(j+1,k)-vol_flux_x(j,k) post_vol(j,k)=volume(j,k) ENDDO ENDDO 3 ENDIF

  4. spcl.inf.ethz.ch @spcl_eth Heitlager et al.: A Practical Model for Measuring Maintainability 4

  5. spcl.inf.ethz.ch @spcl_eth !$ACC DATA & !$ACC COPY(chunk%tiles(1)%field%density0) & !$ACC COPY(chunk%tiles(1)%field%density1) & !$ACC COPY(chunk%tiles(1)%field%energy0) & !$ACC COPY(chunk%tiles(1)%field%energy1) & !$ACC COPY(chunk%tiles(1)%field%pressure) & !$ACC COPY(chunk%tiles(1)%field%soundspeed) & !$ACC COPY(chunk%tiles(1)%field%viscosity) & !$ACC COPY(chunk%tiles(1)%field%xvel0) & !$ACC COPY(chunk%tiles(1)%field%yvel0) & !$ACC COPY(chunk%tiles(1)%field%xvel1) & !$ACC COPY(chunk%tiles(1)%field%yvel1) & !$ACC COPY(chunk%tiles(1)%field%vol_flux_x) & !$ACC COPY(chunk%tiles(1)%field%vol_flux_y) & !$ACC COPY(chunk%tiles(1)%field%mass_flux_x)& !$ACC COPY(chunk%tiles(1)%field%mass_flux_y)& !$ACC COPY(chunk%tiles(1)%field%volume) & Sloccount *f90: 6,440 !$ACC COPY(chunk%tiles(1)%field%work_array1)& !$ACC COPY(chunk%tiles(1)%field%work_array2)& !$ACC COPY(chunk%tiles(1)%field%work_array3)& !$ACC COPY(chunk%tiles(1)%field%work_array4)& !$ACC COPY(chunk%tiles(1)%field%work_array5)& !$ACC COPY(chunk%tiles(1)%field%work_array6)& !$ACC COPY(chunk%tiles(1)%field%work_array7)& !$ACC: 833 (13%) !$ACC COPY(chunk%tiles(1)%field%cellx) & !$ACC COPY(chunk%tiles(1)%field%celly) & !$ACC COPY(chunk%tiles(1)%field%celldx) & !$ACC COPY(chunk%tiles(1)%field%celldy) & !$ACC COPY(chunk%tiles(1)%field%vertexx) & !$ACC COPY(chunk%tiles(1)%field%vertexdx) & !$ACC COPY(chunk%tiles(1)%field%vertexy) & !$ACC COPY(chunk%tiles(1)%field%vertexdy) & !$ACC COPY(chunk%tiles(1)%field%xarea) & !$ACC COPY(chunk%tiles(1)%field%yarea) & !$ACC COPY(chunk%left_snd_buffer) & !$ACC COPY(chunk%left_rcv_buffer) & !$ACC COPY(chunk%right_snd_buffer) & !$ACC COPY(chunk%right_rcv_buffer) & !$ACC COPY(chunk%bottom_snd_buffer) & !$ACC COPY(chunk%bottom_rcv_buffer) & !$ACC COPY(chunk%top_snd_buffer) & !$ACC COPY(chunk%top_rcv_buffer) 5

  6. spcl.inf.ethz.ch @spcl_eth 6

  7. spcl.inf.ethz.ch @spcl_eth do i = 0, N do j = 0, i y(i,j) = ( y(i,j) + y(i,j+1) )/2 7

  8. spcl.inf.ethz.ch @spcl_eth Some results: Polybench 3.2 geomean: ~6x arithmean: ~30x Speedup over icc – O3 Xeon E5-2690 (10 cores, 0.5Tflop) vs. Titan Black Kepler GPU (2.9k cores, 1.7Tflop) 8 T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

  9. spcl.inf.ethz.ch @spcl_eth Compiles all of SPEC CPU 2006 – Example: LBM 8:24 essentially my 4-core x86 laptop with the (free) GPU that’s in there 7:12 6:00 Runtime (m:s) Xeon E5-2690 (10 cores, 0.5Tflop) vs. 4:48 ~20% Titan Black Kepler GPU (2.9k cores, 1.7Tflop) 3:36 2:24 ~4x 1:12 0:00 Mobile Workstation icc icc -openmp clang Polly ACC 9 T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

  10. spcl.inf.ethz.ch @spcl_eth 10

  11. spcl.inf.ethz.ch @spcl_eth GPU latency hiding vs. MPI ld st ld st … ld st ld st CUDA MPI • over-subscribe hardware • host controlled • use spare parallel slack for latency hiding • full device synchronization device compute core active thread instruction latency T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

  12. spcl.inf.ethz.ch @spcl_eth Hardware latency hiding at the cluster level? st ld ld put ld st ld st st ld st ld ld put ld st dCUDA (distributed CUDA) • unified programming model for GPU clusters • avoid unnecessary device synchronization to enable system wide latency hiding device compute core active thread instruction latency T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

  13. spcl.inf.ethz.ch @spcl_eth dCUDA: MPI-3 RMA extensions • iterative stencil kernel for ( int i = 0; i < steps; ++i) { for ( int idx = from; idx < to; idx += jstride) • thread specific idx out[idx] = -4.0 * in[idx] + computation in[idx + 1] + in[idx - 1] + in[idx + jstride] + in[idx - jstride]; if (lsend) dcuda_put_notify (ctx, wout, rank - 1, len + jstride, jstride, &out[jstride], tag); if (rsend) dcuda_put_notify (ctx, wout, rank + 1, 0, jstride, &out[len], tag); communication • map ranks to blocks • device-side put/get operations dcuda_wait_notifications (ctx, wout, • notifications for synchronization DCUDA_ANY_SOURCE, tag, lsend + rsend); • shared and distributed memory swap(in, out); swap(win, wout); } T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

  14. spcl.inf.ethz.ch @spcl_eth Hardware supported communication overlap dCUDA traditional MPI-CUDA 1 4 5 1 4 5 7 7 2 3 6 8 2 3 6 8 2 4 6 8 1 4 6 7 1 3 5 7 2 3 5 8 device compute core active block T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

  15. spcl.inf.ethz.ch @spcl_eth The dCUDA runtime system event handler block manager MPI host-side ack notification more blocks queue queue logging command queue queue device-side device library context T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

  16. spcl.inf.ethz.ch @spcl_eth (Very) simple stencil benchmark  Benchmarked on 8 Haswell nodes with 1x Tesla K80 per node no overlap 1000 execution time [ms] compute & exchange halo exchange 500 compute only 0 30 60 90 # of copy iterations per exchange T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

  17. spcl.inf.ethz.ch @spcl_eth Real stencil (COSMO weather/climate code)  Benchmarked on 8 Haswell nodes with 1x Tesla K80 per node MPI-CUDA 100 dCUDA execution time [ms] 50 halo exchange 0 2 4 6 8 # of nodes T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

  18. spcl.inf.ethz.ch @spcl_eth Particle simulation code (Barnes Hut)  Benchmarked on 8 Haswell nodes with 1x Tesla K80 per node MPI-CUDA 200 execution time [ms] dCUDA 150 100 50 halo exchange 0 2 4 6 8 # of nodes T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

  19. spcl.inf.ethz.ch @spcl_eth Sparse matrix-vector multiplication  Benchmarked on 8 Haswell nodes with 1x Tesla K80 per node 200 150 execution time [ms] dCUDA 100 MPI-CUDA 50 communication 0 1 4 9 # of nodes T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

  20. spcl.inf.ethz.ch @spcl_eth dCUDA – distributed memory http://spcl.inf.ethz.ch/Polly-ACC for ( int i = 0; i < steps; ++i) { for ( int idx = from; idx < to; idx += jstride) out[idx] = -4.0 * in[idx] + in[idx + 1] + in[idx - 1] + in[idx + jstride] + in[idx - jstride]; if (lsend) dcuda_put_notify (ctx, wout, rank - 1, len + jstride, jstride, &out[jstride], tag); if (rsend) dcuda_put_notify (ctx, wout, rank + 1, 0, jstride, &out[len], tag); dcuda_wait_notifications (ctx, wout, Automatic DCUDA_ANY_SOURCE, tag, lsend + rsend); Automatic swap(in, out); swap(win, wout); } High Performance Overlap “Regression Free” High Performance 20

  21. spcl.inf.ethz.ch @spcl_eth LLVM Nightly Test Suite 10000 1000 100 10 1 SCoPs 0-dim 1-dim 2-dim 3-dim No Heuristics Heuristics 21

  22. spcl.inf.ethz.ch @spcl_eth Cactus ADM (SPEC 2006) Workstation Mobile 22

  23. spcl.inf.ethz.ch @spcl_eth Evading various “ends” – the hardware view 23

Recommend


More recommend