Memory Access Patterns: The Missing Piece of the Multi-GPU Puzzle Tal Ben-Nun , Ely Levy, Amnon Barak and Eri Rubin The Hebrew University of Jerusalem, Israel Supercomputing ‘ 15, November 2015
Introduction • Developing efficient parallel algorithms for GPUs is challenging Host Node • Memory I/O recurs as the bottleneck • Code clutter caused by device and CPU 1 CPU 2 memory management RAM PCI-Express • Current programming models for GPU GPU GPU GPU multi-GPU nodes are often: 1 2 3 4 • Insufficient for specific programming needs • Overly complex • Hard to debug
Case Study – The Game of Life • Famous cellular automaton • Each cell requires its 3×3 neighborhood to compute next generation Input Output • In GPUs, each thread computes one (or several) cells • Similar to stencil operators, the Jacobi method and many more
Case Study – The Game of Life • Famous cellular automaton • Each cell requires its 3×3 neighborhood to compute next generation Input Output • In GPUs, each thread computes one (or several) cells • Similar to stencil operators, the Jacobi method and many more
Case Study – The Game of Life • Famous cellular automaton • Each cell requires its 3×3 neighborhood to compute next generation Input Output • In GPUs, each thread computes one (or several) cells • Similar to stencil operators, the Jacobi method and many more
Case Study – The Game of Life • Famous cellular automaton • Each cell requires its 3×3 neighborhood to compute next generation Input Output • In GPUs, each thread computes one (or several) cells • Similar to stencil operators, the Jacobi method and many more
Case Study – The Game of Life • Famous cellular automaton • Each cell requires its 3×3 neighborhood to compute next generation Input Output • In GPUs, each thread computes one (or several) cells • Similar to stencil operators, the Jacobi method and many more
Case Study – The Game of Life • Famous cellular automaton • Each cell requires its 3×3 neighborhood to compute next generation Input Output • In GPUs, each thread computes one (or several) cells • Similar to stencil operators, the Jacobi method and many more
Case Study – The Game of Life • Famous cellular automaton • Each cell requires its 3×3 neighborhood to compute next generation Input Output • In GPUs, each thread computes one (or several) cells • Similar to stencil operators, the Jacobi method and many more
Case Study – The Game of Life • Famous cellular automaton • Each cell requires its 3×3 neighborhood to compute next generation Input Output • In GPUs, each thread computes one (or several) cells • Similar to stencil operators, the Jacobi method and many more
Case Study – The Game of Life • Famous cellular automaton • Each cell requires its 3×3 neighborhood to compute next generation Input Output • In GPUs, each thread computes one (or several) cells • Similar to stencil operators, the Jacobi method and many more
Case Study – The Game of Life Thread Pseudocode: Wrapped Window Width Boundaries // Load data to shared memory smem[tidy * BW + tidx] = M[(bidy * BH + tidy) * STRIDE + Block (bidx * BW + tidx)]; // Wrap coords Height // ... __syncthreads(); neighbors = 0; GPU 1 current_gen = smem[tidy * BW + tidx]; for (int ly = -1; ly <= 1; ++ly) { for (int lx = -1; lx <= 1; ++lx) { if (lx == 0 && ly == 0) continue ; neighbors += smem[(ly + tidy) * BW + (lx + tidx)]; } } outM[(bidy * BH + tidy) * STRIDE + (bidx * BW + tidx)] = ...; Single GPU
Case Study – The Game of Life GPU 1 GPU 2 Boundary Exchanges GPU 3 GPU 4 Multi-GPU
Case Study – The Game of Life GPU 1 M1(0,0) = M[STRIDE] GPU 2 M1(x,-1) = M4(x, HEIGHT/4) = = M[(HEIGHT-1)*STRIDE+x] GPU 3 M3(0,0) = M2(0,HEIGHT/4+1) = = M[((HEIGHT/4)*2+1)*STRIDE] GPU 4 Multi-GPU
Case Study – The Game of Life • 3 indexing systems: • Node memory GPU 1 • Per-GPU memory • Shared memory/registers GPU 2 • Error-prone • Index-wise • Synchronization-wise GPU 3 • Difficult to debug/maintain GPU 4 • Many lines-of-code Multi-GPU
Input Memory Access Patterns Access Pattern Description Typical Examples Illustrations Each thread requires an Matrix multiplication, Block (ND) entire dimension of a Exact N-body simulation, buffer Matrix transposition Each thread-block ND convolution, Window (ND) requires a spatially-local Jacobi method, N-dimensional window Stencil operators Sporadic access of a SpMV, Adjacency dense data structure with Cloth simulation a fixed pattern Thread operates on Barnes-Hut N-body Traversal (BFS, DFS) neighbors of a vertex algorithm Thread-block operates on Fast Fourier transform Permutation a permutation of the original data Patterns that cannot be Finite state machines Irregular determined in advance
Output Memory Access Patterns • Based on all possible mappings between number of threads and number of outputs per buffer: n O(n) n m < n n Unpredictable Structured Unstructured Reductive Reductive Irregular Injective Injective Static Dynamic
MAPS-Multi • An automatic multi-GPU task partitioning framework: • By expressing the input/output access patterns of each task, automatically segments and copies memory • Based on concepts from the Partitioned Global Address Space (PGAS) model • No source-to-source compilation or other intrusive actions • Header only, standard C++11 (over CUDA) library • Can work in conjunction with other systems (e.g. MPI) and device-level libraries (e.g. CUBLAS)
Framework Components Host-Level Infrastructure Device-Level Infrastructure Task Multiple Device Abstraction Multi-Device Input Container Output Container Input Container Output Container Input Container Output Container Datum Access Pattern Access Pattern Access Datum Datum Access Pattern Datum Datum Access Pattern Access Pattern Datum Global Memory N Global Memory 1 Pattern Aggregator Aggregator Aggregator Device Dimensions Kernel Constants Shared Shared Block Scheduler Container Container Device-level Device-level Segmenters Allocator Aggregator Aggregator Warp Segment Location Memory Analyzer Monitor Iterators Iterators Iterators Iterators Iterators Iterators Iterators Iterators Iterators Iterators Iterators Iterators Invoker Thread Invoker Thread Invoker Thread Thread Input Output Input Output Controller Controller Controller Controller GPU 1 GPU 2 GPU n
Game of Life Code Sample Host Code Device Code Scheduler sched; template < typename T, int ILPX, int ILPY> __global__ void GameOfLifeTick MAPS_MULTIDEF ( typedef Window2D <T,1, WRAP ,ILPX,ILPY> Win2D ; Window2D <T,1,WRAP,ILPX,ILPY> current_gen, typedef StructuredInjective <T,2,ILPX,ILPY> SMat ; StructuredInjective <T,2,ILPX,ILPY> next_gen) { // Define data structures to be used MAPS_MULTI_INIT (current_gen, next_gen); Matrix <T> A (width, height), B (width, height); #pragma unroll // Use existing host buffers as matrices MAPS_FOREACH (nextgen_iter, next_gen) { A.Bind(hbuffer_A); int live_neighbors = 0, is_live = 0; B.Bind(hbuffer_B); #pragma unroll // Analyze memory access patterns for allocation MAPS_FOREACH_ALIGNED (iter, current_gen, sched.AnalyzeCall( Win2D (A), SMat (B)); nextgen_iter) { sched.AnalyzeCall( Win2D (B), SMat (A)); // Set variables according to the rules if (iter.index() == 4) is_live = *iter; // Invoke the kernels else live_neighbors += *iter; for ( int i = 0; i < iterations; ++i) } sched.Invoke(GameOfLifeTick, int result = GameOfLifeConditions(...); Win2D ((i % 2) ? B : A), *nextgen_iter = result; SMat ((i % 2) ? A : B)); } next_gen.commit(); // Gather processed data back to host } if ((iterations % 2) == 0) sched.Gather(A); else sched.Gather(B);
Code Sample – Host Scheduler sched; typedef Window2D <T,1, WRAP ,ILPX,ILPY> Win2D ; typedef StructuredInjective <T,2,ILPX,ILPY> SMat ; // Define data structures to be used Matrix <T> A (width, height), B (width, height); // Use existing host buffers as matrices A.Bind(hbuffer_A); B.Bind(hbuffer_B); // Analyze memory access patterns for allocation sched.AnalyzeCall( Win2D (A), SMat (B)); sched.AnalyzeCall( Win2D (B), SMat (A)); // Invoke the kernels for ( int i = 0; i < iterations; ++i) sched.Invoke(GameOfLifeTick, Win2D ((i % 2) ? B : A), SMat ((i % 2) ? A : B)); // Gather processed data back to host if ((iterations % 2) == 0) sched.Gather(A); else sched.Gather(B);
Code Sample – Host GPU 3 GPU 1 GPU 2 GPU 4 Scheduler sched; typedef Window2D <T,1, WRAP ,ILPX,ILPY> Win2D ; typedef StructuredInjective <T,2,ILPX,ILPY> SMat ; // Define data structures to be used Matrix <T> A (width, height), B (width, height); // Use existing host buffers as matrices A.Bind(hbuffer_A); B.Bind(hbuffer_B); // Analyze memory access patterns for allocation sched.AnalyzeCall( Win2D (A), SMat (B)); sched.AnalyzeCall( Win2D (B), SMat (A)); // Invoke the kernels for ( int i = 0; i < iterations; ++i) sched.Invoke(GameOfLifeTick, Win2D ((i % 2) ? B : A), SMat ((i % 2) ? A : B)); // Gather processed data back to host if ((iterations % 2) == 0) sched.Gather(A); else sched.Gather(B);
Recommend
More recommend