tuned and wildly asynchronous stencil kernels for
play

Tuned and Wildly Asynchronous Stencil Kernels for Heterogeneous - PowerPoint PPT Presentation

Tuned and Wildly Asynchronous Stencil Kernels for Heterogeneous CPU/GPU systems Sundaresan Venkatasubramanian Prof. Richard Vuduc Georgia Tech Intl. Conference on Supercomputing June 10, 2009 Motivation and Goals Regular, memory


  1. Tuned and Wildly Asynchronous Stencil Kernels for Heterogeneous CPU/GPU systems Sundaresan Venkatasubramanian Prof. Richard Vuduc Georgia Tech Int’l. Conference on Supercomputing June 10, 2009

  2. Motivation and Goals  Regular, memory bandwidth-bound kernel  GPU with high memory bandwidth  Seems simple but…  Tuning  Constraints on parallelism and data movement  Real systems mix CPU-like/GPU-like  Goal of paper: Solve mapping 2

  3. Key ideas and results  Hand-tuned to understand how to map  98% of empirical streaming bandwidth  Model-driven hybrid CPU/GPU implementation  Asynchronous algorithms to avoid global syncs [Chazan & Miranker ’69]  1.2 to 2.5x speedup even while performing up to 1.7x flops! 3

  4. Problem To solve Poisson’s equation in 2-D on a square grid: Centered finite-difference approximation on a (N +2)*(N+2) grid with step size h : 4

  5. Algorithm Memory bandwidth bound Used as subroutine in other kernel complex algorithms. E.g, Multigrid for t=1,2,3 … T, do for all unknowns in the grid, do (embarrassingly parallel) U t+1 i,j = ¼ * (U t i+1,j + U t i-1,j + U t+1 i,j-1 + U t i,j+1 ) end for end for 2 copies of the grid Implicit global sync required 5

  6. CPU and GPU Baselines 6

  7. Tuned CPU Baseline*  SIMD Vectorization  Parallelized using pthreads  Binding on NUMA architectures *See paper for details 7

  8. Tuned GPU Baseline*  Exploit shared memory  Bank conflicts – access pattern  Non-coalesced accesses – Padding  Loop unrolling  Occupancy - Proper block size *See paper for details 8

  9. Experimental set-up 9

  10. Results Device: Nvidia Tesla C870 Grid size: 4096 No. of Iterations: 32 10

  11. Results 11

  12. 98% of empirical streaming bandwidth Approx 37 GFLOPS 66% of true peak 12

  13. Half the work of single Approx 17 GFLOPS 13

  14. CPU/GPU and Multi-GPU Methods 14

  15. CPU-GPU implementation 15

  16. CPU-GPU implementation Need to exchange 16

  17. Algorithm 17

  18. CPU-GPU Model graph Baseline CPU-only 100% Fraction of baseline CPU-only time Hybrid Baseline GPU-only GPU part CPU part Exchange time 0% 0% 100% Fraction assigned to CPU Optimal fraction 18

  19. CPU-GPU – Slower CPU Baseline CPU-only 100% Fraction of baseline CPU-only time Hybrid Hybrid never beats the pure GPU version Baseline GPU-only GPU part CPU part Exchange time 0% 0% 100% Fraction assigned to CPU 19

  20. ~1.1x speedup 20

  21. 21

  22. 22

  23. ~ 1.8x speedup 23

  24. Asynchronous algorithms 24

  25. TunedSync - Review Moving data between global memory Shared Global memory Synchronization Compute & Fetch Global Global grid 2 Write grid 1 Compute & Fetch Global Global Write grid 2 grid 1 T iterations Fetch Compute & Global Global Write grid 2 grid 1 25

  26. TunedSync - Review Shared memory Global Global Compute & Fetch grid 2 grid 1 Write 26

  27. Async0 Shared memory Compute & Global Global Fetch grid 2 Write grid 1 Repeat α /2 times Compute & (totally α iterations) Write • Effective number of iterations = (T’* α ) • Greater than T? • More iterations -> More FLOPS T’ iterations 27

  28. How is Async0 different from TunedSync?  Reduces the number of global memory accesses  Fewer global synchronizations  Expect Teff ≥ T, by a little (but can’t be less!)  Uses 2 shared memory grids 28

  29. Motivation for Async1 Shared memory Compute & Global Global Fetch grid 2 Write grid 1 Repeat α /2 times Compute & (totally α iterations) Write Do you need so many local syncs? T’ iterations 29

  30. Async1 Shared memory Compute & Global Global Fetch grid 2 Write grid 1 Repeat α /2 times Compute & (totally α iterations) Write T’ iterations 30

  31. Motivation for Async2 Shared memory Compute & Global Global Fetch grid 2 Write grid 1 Repeat α /2 times Compute & (totally α iterations) Write The exchange ghost cells is not assured anyway. Why not get rid of it? T’ iterations 31

  32. Async2 Shared memory Compute & Global Global Fetch grid 2 Write grid 1 Compute & Repeat α /2 times Write (totally α iterations) T’ iterations 32

  33. Motivation for Async3 Shared memory Compute & Global Global Fetch grid 2 Write grid 1 Compute & Repeat α /2 times Why not have a Write (totally α iterations) single shared memory grid instead of two? T’ iterations 33

  34. Async3 Shared memory Global Global Fetch grid 2 grid 1 Compute & Write Repeat α /2 times (totally α iterations) T’ iterations 34

  35. 35

  36. 36

  37. Conclusion  Need extensive (automatable) tuning to achieve near peak performance, even for a simple kernel  Simple performance models can guide CPU-GPU and Multi-GPU designs  “Fast and loose” asynchronous algorithms yield non-trivial speedups on the GPU 37

  38. Future work  Extension of chaotic relaxation technique to other domains.  Extending Multi-GPU study to GPU clusters.  Systems that will decide “on-the-go” if CPU-GPU or Multi-GPU execution will pay-off based on performance models  Automatic “asynchronous” code generator for “arbitrary” iterative methods. 38

  39. Thank you! 39

Recommend


More recommend