Tuned and Wildly Asynchronous Stencil Kernels for Heterogeneous - PowerPoint PPT Presentation
Tuned and Wildly Asynchronous Stencil Kernels for Heterogeneous CPU/GPU systems Sundaresan Venkatasubramanian Prof. Richard Vuduc Georgia Tech Intl. Conference on Supercomputing June 10, 2009 Motivation and Goals Regular, memory
Tuned and Wildly Asynchronous Stencil Kernels for Heterogeneous CPU/GPU systems Sundaresan Venkatasubramanian Prof. Richard Vuduc Georgia Tech Int’l. Conference on Supercomputing June 10, 2009
Motivation and Goals Regular, memory bandwidth-bound kernel GPU with high memory bandwidth Seems simple but… Tuning Constraints on parallelism and data movement Real systems mix CPU-like/GPU-like Goal of paper: Solve mapping 2
Key ideas and results Hand-tuned to understand how to map 98% of empirical streaming bandwidth Model-driven hybrid CPU/GPU implementation Asynchronous algorithms to avoid global syncs [Chazan & Miranker ’69] 1.2 to 2.5x speedup even while performing up to 1.7x flops! 3
Problem To solve Poisson’s equation in 2-D on a square grid: Centered finite-difference approximation on a (N +2)*(N+2) grid with step size h : 4
Algorithm Memory bandwidth bound Used as subroutine in other kernel complex algorithms. E.g, Multigrid for t=1,2,3 … T, do for all unknowns in the grid, do (embarrassingly parallel) U t+1 i,j = ¼ * (U t i+1,j + U t i-1,j + U t+1 i,j-1 + U t i,j+1 ) end for end for 2 copies of the grid Implicit global sync required 5
CPU and GPU Baselines 6
Tuned CPU Baseline* SIMD Vectorization Parallelized using pthreads Binding on NUMA architectures *See paper for details 7
Tuned GPU Baseline* Exploit shared memory Bank conflicts – access pattern Non-coalesced accesses – Padding Loop unrolling Occupancy - Proper block size *See paper for details 8
Experimental set-up 9
Results Device: Nvidia Tesla C870 Grid size: 4096 No. of Iterations: 32 10
Results 11
98% of empirical streaming bandwidth Approx 37 GFLOPS 66% of true peak 12
Half the work of single Approx 17 GFLOPS 13
CPU/GPU and Multi-GPU Methods 14
CPU-GPU implementation 15
CPU-GPU implementation Need to exchange 16
Algorithm 17
CPU-GPU Model graph Baseline CPU-only 100% Fraction of baseline CPU-only time Hybrid Baseline GPU-only GPU part CPU part Exchange time 0% 0% 100% Fraction assigned to CPU Optimal fraction 18
CPU-GPU – Slower CPU Baseline CPU-only 100% Fraction of baseline CPU-only time Hybrid Hybrid never beats the pure GPU version Baseline GPU-only GPU part CPU part Exchange time 0% 0% 100% Fraction assigned to CPU 19
~1.1x speedup 20
21
22
~ 1.8x speedup 23
Asynchronous algorithms 24
TunedSync - Review Moving data between global memory Shared Global memory Synchronization Compute & Fetch Global Global grid 2 Write grid 1 Compute & Fetch Global Global Write grid 2 grid 1 T iterations Fetch Compute & Global Global Write grid 2 grid 1 25
TunedSync - Review Shared memory Global Global Compute & Fetch grid 2 grid 1 Write 26
Async0 Shared memory Compute & Global Global Fetch grid 2 Write grid 1 Repeat α /2 times Compute & (totally α iterations) Write • Effective number of iterations = (T’* α ) • Greater than T? • More iterations -> More FLOPS T’ iterations 27
How is Async0 different from TunedSync? Reduces the number of global memory accesses Fewer global synchronizations Expect Teff ≥ T, by a little (but can’t be less!) Uses 2 shared memory grids 28
Motivation for Async1 Shared memory Compute & Global Global Fetch grid 2 Write grid 1 Repeat α /2 times Compute & (totally α iterations) Write Do you need so many local syncs? T’ iterations 29
Async1 Shared memory Compute & Global Global Fetch grid 2 Write grid 1 Repeat α /2 times Compute & (totally α iterations) Write T’ iterations 30
Motivation for Async2 Shared memory Compute & Global Global Fetch grid 2 Write grid 1 Repeat α /2 times Compute & (totally α iterations) Write The exchange ghost cells is not assured anyway. Why not get rid of it? T’ iterations 31
Async2 Shared memory Compute & Global Global Fetch grid 2 Write grid 1 Compute & Repeat α /2 times Write (totally α iterations) T’ iterations 32
Motivation for Async3 Shared memory Compute & Global Global Fetch grid 2 Write grid 1 Compute & Repeat α /2 times Why not have a Write (totally α iterations) single shared memory grid instead of two? T’ iterations 33
Async3 Shared memory Global Global Fetch grid 2 grid 1 Compute & Write Repeat α /2 times (totally α iterations) T’ iterations 34
35
36
Conclusion Need extensive (automatable) tuning to achieve near peak performance, even for a simple kernel Simple performance models can guide CPU-GPU and Multi-GPU designs “Fast and loose” asynchronous algorithms yield non-trivial speedups on the GPU 37
Future work Extension of chaotic relaxation technique to other domains. Extending Multi-GPU study to GPU clusters. Systems that will decide “on-the-go” if CPU-GPU or Multi-GPU execution will pay-off based on performance models Automatic “asynchronous” code generator for “arbitrary” iterative methods. 38
Thank you! 39
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.