parallelization strategies asd accelerator hpc workshop
play

Parallelization Strategies ASD Accelerator HPC Workshop Computer - PowerPoint PPT Presentation

Parallelization Strategies ASD Accelerator HPC Workshop Computer Systems Group Research School of Computer Science Australian National University Canberra, Australia May 01, 2019 Issues to Consider When Parallelizing a Problem Given a serial


  1. Parallelization Strategies ASD Accelerator HPC Workshop Computer Systems Group Research School of Computer Science Australian National University Canberra, Australia May 01, 2019

  2. Issues to Consider When Parallelizing a Problem Given a serial solution to the problem: which loops are data parallel? how is the data laid out in memory? what are the data dependencies? how would thread block sizes and shape affect performance? When the first parallel solution has been made: what are the main issues limiting performance? how can these be mitigated? Computer Systems (ANU) Parallelization Strategies 01 May 2019 2 / 26

  3. Dynamic Parallelism – Mandelbrot Set Revisited Outline 1 Dynamic Parallelism – Mandelbrot Set Revisited Stencil Computations 2 Dynamic Programming – The Knapsack Problem 3 Computer Systems (ANU) Parallelization Strategies 01 May 2019 3 / 26

  4. Dynamic Parallelism – Mandelbrot Set Revisited Adaptive Parallelism (code and graphics in the section are from https://devblogs.nvidia.com/introduction-cuda-dynamic-parallelism) by Andy Adinets many computations (e.g. AMR) require more work in certain areas CUDA 5.0 introduced Dynamic Parallelism to support this: ‘coarse grain’ kernels can invoke ‘finer-grain’ kernels Computer Systems (ANU) Parallelization Strategies 01 May 2019 4 / 26

  5. Dynamic Parallelism – Mandelbrot Set Revisited Naive GPU Solution to the Mandelbrot Set the ‘Escape Time’ algorithm is based on computing the dwell , the number of iterations of z ← z 2 + c at each pixel (x,y) in an w × h image. cmin ( cmax ) are the bottom-left (top-right) image corners __host__ __device__ int pixel_dwell (int w, int h, complex cmin , complex cmax , int x, int y) { 2 complex dc = cmax - cmin; float fx = (float)x / w, fy = (float)y / h; 4 complex c = cmin + complex(fx * dc.re , fy * dc.im); complex z = c; 6 int dwell = 0; while (dwell < MAX_DWELL && abs2(z) < 2 * 2) { 8 z = z * z + c; dwell ++; 10 } return dwell 12 } Computer Systems (ANU) Parallelization Strategies 01 May 2019 5 / 26

  6. Dynamic Parallelism – Mandelbrot Set Revisited Naive GPU Solution to the Mandelbrot Set (II) the kernel can be simply expressed with the help of dwells() : __global__ void mandelbrot_k (int *dwells , int w, int h, complex cmin , complex cmax) { 2 int x = threadIdx.x + blockDim.x * blockIdx.x; int y = threadIdx.y + blockDim.y * blockIdx.y; 4 if (x < w && y < h) dwells[y * w + x] = pixel_dwell (w, h, cmin , cmax , x, y); 6 } ... 8 // kernel launch int w = 4096 , h = 4096; 10 dim3 bs(64, 4), grid(divup(w, bs.x), divup(h, bs.y)); mandelbrot_k <<<grid , bs >>>(d_dwells , w, h, 12 complex (-1.5, 1), complex (0.5 , 1)); although embarrassingly parallel , within a thread block, it suffers from load imbalance, as pixels may have differing dwell values (also there are large areas of constant dwell which need not be computed for every point) Computer Systems (ANU) Parallelization Strategies 01 May 2019 6 / 26

  7. Dynamic Parallelism – Mandelbrot Set Revisited The Mariani-Silver Algorithm solves the above problems by using recursive sub-division in areas of non-constant dwell 1 mariani_silver (rectangle) if (border(rectangle) has common dwell) fill rectangle with common dwell 3 else if (rectangle size < threshold) per -pixel evaluation of the rectangle 5 else for each sub_rectangle in subdivide( 7 rectangle) mariani_silver ( sub_rectangle ) Computer Systems (ANU) Parallelization Strategies 01 May 2019 7 / 26

  8. Dynamic Parallelism – Mandelbrot Set Revisited CUDA Implementation of Mariani-Silver Algorithm __global__ void mandelbrot_block_k (int *dwells , int w, int h, complex cmin , complex cmax , 2 int x0 , int y0 , int d, int depth) { x0 += d * blockIdx.x, y0 += d * blockIdx.y; 4 int common_dwell = border_dwell (w, h, cmin , cmax , x0 , y0 , d); if (threadIdx.x == 0 && threadIdx.y == 0) { 6 if ( common_dwell != DIFF_DWELL ) { // uniform dwell , just fill dim3 bs(BSX , BSY), grid(divup(d, BSX), divup(d, BSY)); 8 dwell_fill <<<grid , bs >>>(dwells , w, x0 , y0 , d, comm_dwell ); } else if (depth + 1 < MAX_DEPTH && d / SUBDIV > MIN_SIZE) { 10 dim3 bs(blockDim.x, blockDim.y), grid(SUBDIV , SUBDIV); mandelbrot_block_k <<<grid , bs >>> 12 (dwells , w, h, cmin , cmax , x0 , y0 , d / SUBDIV , depth +1); } else { // leaf , per -pixel kernel 14 dim3 bs(BSX , BSY), grid(divup(d, BSX), divup(d, BSY)); mandelbrot_pixel_k <<<grid , bs >>> 16 (dwells , w, h, cmin , cmax , x0 , y0 , d); } 18 }} ... 20 int width = 8192 , height = 8192; mandelbrot_block_k <<<dim3(I_SUBDIV , I_SUBDIV), dim3(BSX , BSY)>>> 22 (dwells , width , height , complex (-1.5, -1), complex (0.5 , 1), 0, 0, width / I_SUBDIV , 1); 24 Computer Systems (ANU) Parallelization Strategies 01 May 2019 8 / 26

  9. Dynamic Parallelism – Mandelbrot Set Revisited Dynamic Parallelism - Closing Remarks the Mariani-Silver algorithm performed 1.3 × to almost 6 × faster over the naive (depending on image size) dynamic kernel launch can fail (lack of resources); should perform a cucheck_dev(cudaGetLastError()) after launch #define cucheck_dev (call) \ 2 { cudaError_t cucheck_err = (call); \ if( cucheck_err != cudaSuccess ) { \ const char *err_str = cudaGetErrorString ( cucheck_err ); \ 4 printf("%s (%d): %s\n", __FILE__ , __LINE__ , err_str); \ assert (0); \ 6 }} kernel launch is asynchronous; successful launch only means the kernel is queued must compile for Compute Capability 3.5 or higher ( -arch=sm 35 ) dynamic parallelism is generally useful for recursive algorithms, including tree-based algorithms (e.g. quad-tree re-ordering, tree traversal) Computer Systems (ANU) Parallelization Strategies 01 May 2019 9 / 26

  10. Stencil Computations Outline 1 Dynamic Parallelism – Mandelbrot Set Revisited Stencil Computations 2 Dynamic Programming – The Knapsack Problem 3 Computer Systems (ANU) Parallelization Strategies 01 May 2019 10 / 26

  11. Stencil Computations Overview: Stencil Computations degrees of synchronization synchronous example: Heat Diffusion serial and parallel code comparison of block and strip thread block shapes optimization 1: iterating over contiguous indices optimization 2: use of shared data Computer Systems (ANU) Parallelization Strategies 01 May 2019 11 / 26

  12. Stencil Computations Degrees of Synchronization from fully to loosely synchronous the more synchronous your computation, the more potential overhead SIMD : synchronized at the instruction level provides ease of programming (one program) well suited for data decomposition applicable to many numerical problems the forall statement was introduced to specify data parallel operations forall (i = 0; i < n; i++) { data parallel work 2 } e.g. the Jacobi iteration, which solves a system of linear equations ( Ax = b ) iteratively ( x t +1 = ( b − ( A − diag ( A )) x t ) / diag ( A )) e.g. an s -point stencil computation ( A t +1 = f ( A t i + c 1 , j + d 1 , . . . , A i + c s , j + d s )) i , j Occurs in many physical problems (e.g. advection), image processing etc Computer Systems (ANU) Parallelization Strategies 01 May 2019 12 / 26

  13. Stencil Computations Locally Synchronous Example: Heat Diffusion Consider a metal sheet with a fixed temperature along the sides but unknown temperatures in the middle – find the temperature in the middle. finite difference approximation to the Laplace equation: ∂ 2 T ( x , y ) + ∂ 2 T ( x , y ) = 0 ∂ x 2 ∂ y 2 T ( x + δ x , y ) − 2 T ( x , y ) + T ( x − δ x , y ) + T ( x , y + δ y ) − 2 T ( x , y ) + T ( x , y − δ x 2 δ y 2 assuming an even grid (i.e. δ x = δ y ) of n × n points (denoted as h i , j ), the temperature at any point is an average of surrounding points: h i , j = h i − 1 , j + h i +1 , j + h i , j − 1 + h i , j +1 4 problem is very similar to the Game of Life, i.e. what happens in a cell depends upon its NSEW neighbours Computer Systems (ANU) Parallelization Strategies 01 May 2019 13 / 26

  14. Stencil Computations Array Ordering x x x x 1 2 k−1 k x x x x 2k−1 2k k+1 k+2 x i−k x x x i−1 i+1 i x i+k x 2 k we will solve iteratively: x i = x i − 1 + x i +1 + x i − k + x i + k 4 but this problem may also be written as a system of linear equations: x i − k + x i − 1 − 4 x i + x i +1 + x i + k = 0 Computer Systems (ANU) Parallelization Strategies 01 May 2019 14 / 26

  15. Stencil Computations Heat Equation: Sequential Code assume a fixed number of iterations and a square mesh beware of what happens at the edges! 1 for (iter = 0; iter < max_iter; iter ++) { for (i = 1; i < n; i++) for (j = 1; j < n; j++) 3 g[i][j] = 0.25*(h[i -1][j] + h[i+1][j] + h[i][j -1] + h[i][j+1]); 5 for (i = 1; i < n; i++) for (j = 1; j < n; j++) 7 h[i][j] = g[i][j]; 9 } Computer Systems (ANU) Parallelization Strategies 01 May 2019 15 / 26

Recommend


More recommend