Tuned and Wildly Asynchronous Stencil Kernels for Heterogeneous CPU/GPU systems Sundaresan Venkatasubramanian Prof. Richard Vuduc Georgia Tech Int’l. Conference on Supercomputing June 10, 2009
Motivation and Goals Regular, memory bandwidth-bound kernel GPU with high memory bandwidth Seems simple but… Tuning Constraints on parallelism and data movement Real systems mix CPU-like/GPU-like Goal of paper: Solve mapping 2
Key ideas and results Hand-tuned to understand how to map 98% of empirical streaming bandwidth Model-driven hybrid CPU/GPU implementation Asynchronous algorithms to avoid global syncs [Chazan & Miranker ’69] 1.2 to 2.5x speedup even while performing up to 1.7x flops! 3
Problem To solve Poisson’s equation in 2-D on a square grid: Centered finite-difference approximation on a (N +2)*(N+2) grid with step size h : 4
Algorithm Memory bandwidth bound Used as subroutine in other kernel complex algorithms. E.g, Multigrid for t=1,2,3 … T, do for all unknowns in the grid, do (embarrassingly parallel) U t+1 i,j = ¼ * (U t i+1,j + U t i-1,j + U t+1 i,j-1 + U t i,j+1 ) end for end for 2 copies of the grid Implicit global sync required 5
CPU and GPU Baselines 6
Tuned CPU Baseline* SIMD Vectorization Parallelized using pthreads Binding on NUMA architectures *See paper for details 7
Tuned GPU Baseline* Exploit shared memory Bank conflicts – access pattern Non-coalesced accesses – Padding Loop unrolling Occupancy - Proper block size *See paper for details 8
Experimental set-up 9
Results Device: Nvidia Tesla C870 Grid size: 4096 No. of Iterations: 32 10
Results 11
98% of empirical streaming bandwidth Approx 37 GFLOPS 66% of true peak 12
Half the work of single Approx 17 GFLOPS 13
CPU/GPU and Multi-GPU Methods 14
CPU-GPU implementation 15
CPU-GPU implementation Need to exchange 16
Algorithm 17
CPU-GPU Model graph Baseline CPU-only 100% Fraction of baseline CPU-only time Hybrid Baseline GPU-only GPU part CPU part Exchange time 0% 0% 100% Fraction assigned to CPU Optimal fraction 18
CPU-GPU – Slower CPU Baseline CPU-only 100% Fraction of baseline CPU-only time Hybrid Hybrid never beats the pure GPU version Baseline GPU-only GPU part CPU part Exchange time 0% 0% 100% Fraction assigned to CPU 19
~1.1x speedup 20
21
22
~ 1.8x speedup 23
Asynchronous algorithms 24
TunedSync - Review Moving data between global memory Shared Global memory Synchronization Compute & Fetch Global Global grid 2 Write grid 1 Compute & Fetch Global Global Write grid 2 grid 1 T iterations Fetch Compute & Global Global Write grid 2 grid 1 25
TunedSync - Review Shared memory Global Global Compute & Fetch grid 2 grid 1 Write 26
Async0 Shared memory Compute & Global Global Fetch grid 2 Write grid 1 Repeat α /2 times Compute & (totally α iterations) Write • Effective number of iterations = (T’* α ) • Greater than T? • More iterations -> More FLOPS T’ iterations 27
How is Async0 different from TunedSync? Reduces the number of global memory accesses Fewer global synchronizations Expect Teff ≥ T, by a little (but can’t be less!) Uses 2 shared memory grids 28
Motivation for Async1 Shared memory Compute & Global Global Fetch grid 2 Write grid 1 Repeat α /2 times Compute & (totally α iterations) Write Do you need so many local syncs? T’ iterations 29
Async1 Shared memory Compute & Global Global Fetch grid 2 Write grid 1 Repeat α /2 times Compute & (totally α iterations) Write T’ iterations 30
Motivation for Async2 Shared memory Compute & Global Global Fetch grid 2 Write grid 1 Repeat α /2 times Compute & (totally α iterations) Write The exchange ghost cells is not assured anyway. Why not get rid of it? T’ iterations 31
Async2 Shared memory Compute & Global Global Fetch grid 2 Write grid 1 Compute & Repeat α /2 times Write (totally α iterations) T’ iterations 32
Motivation for Async3 Shared memory Compute & Global Global Fetch grid 2 Write grid 1 Compute & Repeat α /2 times Why not have a Write (totally α iterations) single shared memory grid instead of two? T’ iterations 33
Async3 Shared memory Global Global Fetch grid 2 grid 1 Compute & Write Repeat α /2 times (totally α iterations) T’ iterations 34
35
36
Conclusion Need extensive (automatable) tuning to achieve near peak performance, even for a simple kernel Simple performance models can guide CPU-GPU and Multi-GPU designs “Fast and loose” asynchronous algorithms yield non-trivial speedups on the GPU 37
Future work Extension of chaotic relaxation technique to other domains. Extending Multi-GPU study to GPU clusters. Systems that will decide “on-the-go” if CPU-GPU or Multi-GPU execution will pay-off based on performance models Automatic “asynchronous” code generator for “arbitrary” iterative methods. 38
Thank you! 39
Recommend
More recommend