compilation techniques for automatic extraction of
play

Compilation Techniques for Automatic Extraction of Parallelism and - PowerPoint PPT Presentation

Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures Jos M. Andin P H D A DVISORS : Gabriel Rodrguez and Manuel Arenaz Outline 1. Introduction 2. A Novel Compiler Support for


  1. 1 for (i = 0; i < n; i++) { 2 t = 0; Building the KIR (I) 3 for (j = 0; j < m; j++) { 4 t = t + A[i][j] * x[j]; 5 } 6 y[i] = t; BB0 7 } i = 0; BB1 (2) t = 0; K < i BB0 > (2) j = 0; BB2 K < i BB4 > K < j BB1 > t = t + A[i][j] * x[j]; (2) K < j BB2 > K < t BB1 > j++; (2) (1) (1) BB3 if (j < m) T K < t BB2 > F BB4 K < y BB4 > y[i] = t; (1) (2) (2) i++; (1) Edges (1), (2) are abstracted BB5 with diKernels T if (i < n) F Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures 16 / 118

  2. diKernel-level Flow Dependences • Identification of the flow of information across the program • Statement-level dominance • Range of values of variable x produced and used throughout the execution of statements and DEF(x, x i ) ⊇ USE(x, y j ). Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 17

  3. Building the KIR (II) BB0 i = 0; i=0 dominates i++ DEF(i,i=0) ⊇ USE(i,i++) BB1 (2) t = 0; K < i BB0 > (2) j = 0; BB2 K < i BB4 > K < j BB1 > t = t + A[i][j] * x[j]; (2) K < j BB2 > K < t BB1 > j++; (2) (1) (1) BB3 if (j < m) T K < t BB2 > F BB4 K < y BB4 > y[i] = t; (1) (2) (2) i++; (1) BB5 T if (i < n) F Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 18

  4. Hierarchy of Execution Scopes • To expose the computational stages of the program • Based on the hierarchy of loops: one execution scope for each perfect loop nest • The root execution scope is a special node that represents the program as a whole. • diKernels belong to the innermost execution scope that contains all of their statements • diKernels that compute the loop indices belong to the ES of the corresponding loop Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 19

  5. Building the KIR (and III) 1 for (i = 0; i < n; i++) { 2 t = 0; ROOT EXECUTION SCOPE 3 for (j = 0; j < m; j++) { 4 ES_for i (Figure 2.2, lines 1-7) t = t + A[i][j] * x[j]; 5 } 6 K < t BB1 > y[i] = t; scalar assignment 7 } K < i BB0 > ES_for j (Figure 2.2, lines 3-5) K < t BB2 > scalar reduction K < i BB4 > K < j BB1 > K < y BB4 > K < j BB2 > K < t BB1 > regular assignment K < t BB2 > K < y BB4 > Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 20

  6. Outline 2. A Novel Compiler Support for Multicore Systems • KIR: A diKernel-based IR • Automatic Partitioning driven by the KIR • Automatic Parallelization of the Benchmark Suite • Experimental Evaluation Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 21

  7. Spurious diKernel-level Dependences • They do not prevent the parallelization ROOT EXECUTION SCOPE ES_for i (Figure 2.2, lines 1-7) t is a privatizable scalar variable K < t BB1 > scalar assignment ES_for j (Figure 2.2, lines 3-5) 1 for (i = 0; i < n; i++) { 2 t = 0; K < t BB2 > scalar reduction 3 for (j = 0; j < m; j++) { 4 t = t + A[i][j] * x[j]; 5 } 6 y[i] = t; K < y BB4 > 7 } regular assignment Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 22

  8. OpenMP-enabled Parallelization Strategy • Find the critical path in the KIR • diKernel-level flow dependences • Parallelizing transformations for each type of diKernel • Optimizations for the joint parallelization of loops • Minimize synchronization between diKernels • Minimize thread creation/destruction Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 23

  9. Parallelizing Transformations FULLY PARALLEL LOOP PARTIALLY PARALLEL LOOP 1. r = 0; 1. #pragma omp parallel for 2. #pragma omp parallel for reduction (+:r) 2. for (i = 0; i < n; i++) { 3. for (i = 0; i < n; i++) { 3. A[i] = 2 4. r = r + A[i]; 4. } 5. } Array Expansion Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 24

  10. Automatic Partitioning driven by the KIR (I) critical path ROOT EXECUTION SCOPE ES_for i (Figure 2.2, lines 1-7) 1 for (i = 0; i < n; i++) { K < t BB1 > scalar assignment 2 t = 0; 3 for (j = 0; j < m; j++) { ES_for j (Figure 2.2, lines 3-5) 4 t = t + A[i][j] * x[j]; 5 } K < t BB2 > 6 y[i] = t; scalar reduction 7 } K < y BB4 > regular assignment Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 25

  11. Automatic Partitioning driven by the KIR (and II) critical path ROOT EXECUTION SCOPE ES_for i (Figure 2.2, lines 1-7) K < t BB1 > scalar assignment FULLY PARALLEL LOOP ES_for j (Figure 2.2, lines 3-5) 1 #pragma omp parallel shared (A,x,y) private (i,j,t) 2 { K < t BB2 > scalar reduction 3 #pragma omp for schedule ( static ) 4 for (i = 0; i < n; i = i + 1) { 5 t = 0; 6 for (j = 0; j < m; j = j + 1) { 7 K < y BB4 > t = (t) + ((A[i][j]) * (x[j])); regular assignment 8 } 9 y[i] = t; 10 } 11 } Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 26

  12. Outline 2. A Novel Compiler Support for Multicore Systems • KIR: A diKernel-based IR • Automatic Partitioning driven by the KIR • Automatic Parallelization of the Benchmark Suite • Experimental Evaluation Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 27

  13. Automatic Parallelization of the Benchmark Suite • Synthetic Benchmarks • Dense/Sparse Matrix-Vector Multiplication • Sobel Edge Filter • SWIM from SPEC CPU2000 • EQUAKE from SPEC CPU2000 Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 28

  14. DenseAMUX & SparseAMUX DenseAMUX 1 for (i = 0; i < n; i++) { 2 ROOT EXECUTION SCOPE t = 0; 3 for (j = 0; j < m; j++) { ES_for i (Figures 2.8a and 2.8b, lines 1-7) 4 t = t + A[i][j] * x[j]; 5 } K < t 2 > 6 y[i] = t; scalar assignment 7 } ES_for j (Figures 2.8a and 2.8b, lines 3-5) K < t 4 > SparseAMUX scalar reduction 1 for (i = 0; i < n; i++) { 2 t = 0; K < y 6 > 3 for (j = ia[i]; j < ia[i+1]-1; j++) { regular assignment 4 t = t + A[j] * x[ja[j]]; 5 } 6 y[i] = t; 7 } Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 29

  15. SparseAMUX ROOT EXECUTION SCOPE FULLY PARALLEL LOOP ES_for i (Figures 2.8a and 2.8b, lines 1-7) 1 #pragma omp parallel shared (A,ia,ja,x,y) private (i,j,t) K < t 2 > 2 { scalar assignment 3 #pragma omp for schedule ( static ) 4 for (i = 0; i < n; i++) { ES_for j (Figures 2.8a and 2.8b, lines 3-5) 5 t = 0; 6 for (j = ia[i]; j < (ia[i+1] - 1); j = j + 1) { K < t 4 > 7 t = (t) + ((A[j]) * (x[ja[j]])); scalar reduction 8 } 9 y[i] = t; 10 } 11 } K < y 6 > regular assignment Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 30

  16. AMUXMS & ATMUX AMUXMS 1 for (i = 0; i < n; i++) { 2 y[i] = A[i] * x[i]; 3 } ROOT EXECUTION SCOPE 4 for (j = 0; j < n; j++) { 5 for (l = ja[j]; l < ja[j+1]-1; l++) { ES_for i (Figures 2.9a and 2.9b, lines 1-3) 6 y[j] = y[j] + A[l] * x[ja[l]]; 7 } < y 2 > 8 } regular assignment ATMUX ES_for j,l (Figures 2.9a and 2.9b, lines 4-8) 1 for (i = 0; i < n; i++) { < y 6 > 2 irregular reduction y[i] = 0; 3 } 4 for (j = 0; j < n; j++) { 5 for (l = ia[j]; l < ia[j+1]-1; l++) { 6 y[ja[l]] = y[ja[l]] + x[j] * A[l]; 7 } 8 } Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 31

  17. AMUXMS FULLY PARALLEL LOOP ROOT EXECUTION SCOPE 1 #pragma omp parallel shared (A,x,ja,y) private (i,j,l,t) ES_for i (Figures 2.9a and 2.9b, lines 1-3) 2 { 3 #pragma omp for schedule ( static ) nowait 4 for (i = 0; i < n; i = i + 1) { < y 2 > 5 y[i] = (A[i]) * (x[i]); regular assignment 6 } 7 #pragma omp for schedule ( static ) 8 for (j = 0; j < n; j = j + 1) { ES_for j,l (Figures 2.9a and 2.9b, lines 4-8) 9 for (l = ja[j]; l < (ja[j+1] - 1); l = l + 1) { 10 y[j] = (y[j]) + ((A[l]) * (x[ja[l]])); < y 6 > 11 } irregular reduction 12 } FULLY PARALLEL LOOP 13 } Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 32

  18. ATMUX PARTIALLY PARALLEL LOOP 1 #pragma omp parallel shared (A,ia,ja,x,y) private (i,j,l,y___private) 2 { 3 if (omp_get_thread_num() == 0) { 4 y___private = y; 5 } else { 6 y___private = ( float *) malloc(n * sizeof ( float )); 7 } 8 ROOT EXECUTION SCOPE for (i = 0; i < n; i = i + 1) { Initialization 9 y___private[i] = 0; ES_for i (Figures 2.9a and 2.9b, lines 1-3) 10 } 11 #pragma omp for schedule ( static ) < y 2 > 12 for (j = 0; j < n; j = j + 1) { regular assignment 13 for (l = ia[j]; l < (ia[j+1] - 1); l = l + 1) { 14 y___private[ja[l]] = (y___private[ja[l]]) + ((x[j]) * (A[l])); 15 } ES_for j,l (Figures 2.9a and 2.9b, lines 4-8) 16 } Computation 17 #pragma omp critical < y 6 > 18 if (omp_get_thread_num() != 0) { irregular reduction 19 for (i = 0; i < n; i = i + 1) { 20 y[i] += y___private[i]; 21 } Reduction 22 } 23 if (omp_get_thread_num() != 0) { 24 free(y___private); 25 } 26 } Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 33

  19. ROOT EXECUTION SCOPE ES_for iter (Figure 2.18, lines 1-46) EQUAKE (I) ES_for i,j (Figure 2.18, lines 2-4) < disp 4 > regular assignment 1 for (iter = 1; iter <= timesteps; iter++) { ES_for i,while (Figure 2.18, lines 5-27) 2 for (i = 0; i < ARCHnodes; i++) 3 for (j = 0; j < 3; j++) < disp 26 > 4 disp[disptplus][i][j] = 0.0; irregular reduction 5 for (i = 0; i < ARCHnodes; i++) { 6 Anext = ARCHmatrixindex[i]; Alast = ARCHmatrixindex[i+1]; 7 sum0 = K[Anext][0][0] * disp[dispt][i][0] ES_for i,j (Figure 2.18, lines 29-31) 8 + K[Anext][0][1] * disp[dispt][i][1] 9 + K[Anext][0][2] * disp[dispt][i][2]; < disp 31 > regular reduction 10 sum1 = K[Anext][1][0] * ...; sum2 = K[Anext][2][0] * ...; 11 Anext++; 12 while (Anext < Alast) { ES_for i,j (Figure 2.18, lines 32-37) 13 col = ARCHmatrixcol[Anext]; 14 sum0 += K[Anext][0][0] * disp[dispt][col][0] < disp 34 > 15 + K[Anext][0][1] * disp[dispt][col][1] regular reduction 16 + K[Anext][0][2] * disp[dispt][col][2]; 17 sum1 += K[Anext][1][0]*...; sum2 += K[Anext][2][0]*...; ES_for i,j (Figure 2.18, lines 38-40) 18 disp[disptplus][col][0] += 19 K[Anext][0][0] * disp[dispt][i][0] < disp 40 > 20 + K[Anext][1][0] * disp[dispt][i][1] regular reduction 21 + K[Anext][2][0] * disp[dispt][i][2]; 22 disp[disptplus][col][1] += K[Anext][0][1] ... ES_for i,j (Figure 2.18, lines 41-44) 23 disp[disptplus][col][2] += K[Anext][0][2] ... 24 Anext++; 25 < vel 43 > } regular assignment 26 disp[disptplus][i][0] += sum0; ... 27 } 28 Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 34

  20. ROOT EXECUTION SCOPE ES_for iter (Figure 2.18, lines 1-46) EQUAKE (II) ES_for i,j (Figure 2.18, lines 2-4) < disp 4 > regular assignment ES_for i,while (Figure 2.18, lines 5-27) 27 < disp 26 > 28 time = iter * Exc.dt; irregular reduction 29 for (i = 0; i < ARCHnodes; i++) 30 for (j = 0; j < 3; j++) ES_for i,j (Figure 2.18, lines 29-31) 31 disp[disptplus][i][j] *= - Exc.dt * Exc.dt; 32 for (i = 0; i < ARCHnodes; i++) < disp 31 > 33 for (j = 0; j < 3; j++) regular reduction 34 disp[disptplus][i][j] += 35 2.0 * M[i][j] * disp[dispt][i][j] ES_for i,j (Figure 2.18, lines 32-37) 36 - (M[i][j] - Exc.dt / 2.0 * C[i][j]) 37 * disp[disptminus][i][j] - ... < disp 34 > 38 for (i = 0; i < ARCHnodes; i++) regular reduction 39 for (j = 0; j < 3; j++) 40 disp[disptplus][i][j] /= (M[i][j] + Exc.dt / 2.0 * C[i][j]); 41 for (i = 0; i < ARCHnodes; i++) ES_for i,j (Figure 2.18, lines 38-40) 42 for (j = 0; j < 3; j++) 43 vel[i][j] = 0.5 / Exc.dt * (disp[disptplus][i][j] < disp 40 > regular reduction 44 - disp[disptminus][i][j]); 45 i = disptminus; disptminus = dispt; dispt = disptplus; disptplus = i; 46 } ES_for i,j (Figure 2.18, lines 41-44) < vel 43 > regular assignment Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 35

  21. EQUAKE (III) Minimization of thread creation/destruction 1 #pragma omp parallel shared (disp) private (disp___disptplus___private,...) 2 { 3 if (omp_get_thread_num() == 0) { 4 disp___disptplus___private = disp[disptplus]; 5 } else { 6 disp___disptplus___private = ( double **) malloc (ARCHnodes * sizeof ( double *)); 7 for (i = 0; i < ARCHnodes; i = i + 1) 8 disp___disptplus___private[i] = ( double *) malloc(3 * sizeof ( double )); 9 } PARTIALLY PARALLEL LOOP 10 for (iter = 1; iter < (timesteps + 1); iter = iter + 1) { 11 #pragma omp barrier 12 for (i = 0; i < ARCHnodes; i = i + 1) 13 for (j = 0; j < 3; j = j + 1) Initialization 14 disp___disptplus___private[i][j] = 0.0; 15 #pragma omp for schedule ( static ) 16 for (i = 0; i < ARCHnodes; i = i + 1) { 17 Anext = ARCHmatrixindex[i]; Alast = ARCHmatrixindex[i+1]; 18 sum0 = K[Anext][0][0] * ... 19 Anext++; 20 while (Anext < Alast) { 21 col = ARCHmatrixcol[Anext]; 22 sum0 += K[Anext][0][0] * ... Computation 23 disp___disptplus___private[col][0] += K[Anext][0][0] * ... 24 Anext++; 25 } 26 disp___disptplus___private[i][0] += sum0; ... 27 } 28 #pragma omp critical 29 if (omp_get_thread_num() != 0) Reduction 30 for (i = 0; i < ARCHnodes; i = i + 1) 31 for (j = 0; j < 3; j = j + 1) 32 disp[disptplus][i][j] += disp___disptplus___private[i][j]; 33 #pragma omp barrier 34 time = iter Exc.dt; Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 36

  22. EQUAKE (and IV) 34 time = iter * Exc.dt; 35 #pragma omp for schedule ( static ) nowait 36 FULLY PARALLEL LOOP for (i = 0; i < ARCHnodes; i = i + 1) 37 for (j = 0; j < 3; j = j + 1) 38 disp[disptplus][i][j] *= - Exc.dt * Exc.dt; 39 #pragma omp for schedule ( static ) nowait 40 for (i = 0; i < ARCHnodes; i = i + 1) FULLY PARALLEL LOOP 41 for (j = 0; j < 3; j = j + 1) 42 disp[disptplus][i][j] += ... 43 #pragma omp for schedule ( static ) nowait 44 for (i = 0; i < ARCHnodes; i = i + 1) FULLY PARALLEL LOOP 45 for (j = 0; j < 3; j = j + 1) 46 disp[disptplus][i][j] /= ... 47 #pragma omp for schedule ( static ) nowait 48 for (i = 0; i < ARCHnodes; i = i + 1) FULLY PARALLEL LOOP 49 for (j = 0; j < 3; j = j + 1) 50 vel[i][j] = ... 51 i = disptminus; disptminus = dispt; dispt = disptplus; disptplus = i; 52 } /* for iter */ 53 if (omp_get_thread_num() != 0) { 54 for (i = 0; i < ARCHnodes; i = i + 1) 55 free(disp___disptplus___private[i]); 56 free(disp___disptplus___private); 57 } 58 } Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 37

  23. Outline 2. A Novel Compiler Support for Multicore Systems • KIR: A diKernel-based IR • Automatic Partitioning driven by the KIR • Automatic Parallelization of the Benchmark Suite • Experimental Evaluation Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 38

  24. Program Characteristics Compilers Unknown LB Complex CF Irreg. writes Irreg. reads Temp. vars PLUTO GCC ICC KIR Benchmark diKernel √ √ √ √ reg. assig. regular assignment √ √ √ irreg. assig. irregular assignment √ √ sc. reduc. 1 scalar reduction ≈ Synthetic √ √ sc. reduc. 2 scalar reduction ≈ √ √ √ sc. reduc. 3 scalar reduction ≈ √ √ √ √ reg. reduc. regular reduction √ √ √ √ irreg. reduc. irregular reduction √ reg. recurr. regular recurrence √ √ √ DenseAMUX regular assignment ≈ Algebra √ √ √ √ AMUX regular assignment √ √ √ AMUXMS regular reduction √ √ √ √ ATMUX irregular reduction √ √ √ sobel1 regular assignment Im. √ √ √ sobel2 regular assignment √ √ √ Apps SWIM regular recurrence U √ √ √ √ EQUAKE irregular reduction ≈ Effectiveness 2 Intel Xeon E5520 quad-core Nehalem processors at 2.26 GHz with 8 MB of cache memory per processor and 8 GB of RAM Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 39

  25. Performance: EQUAKE (Execution Time) 3.0 100 Remaining Overhead 90 Irregular 2.5 80 70 2.0 60 1.5 50 40 1.0 30 20 0.5 10 0.0 0 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 KIR/ICC ICC KIR/ICC ICC KIR/ICC ICC KIR/ICC ICC KIR/ICC ICC KIR/ICC ICC WL x 1 WL x 2 WL x 3 WL x 1 WL x 2 WL x 3 Execution Time (s) Speedup Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 40

  26. Outline 1. Introduction 2. A Novel Compiler Support for Multicore Systems 3. Locality-Aware Automatic Parallelization for GPGPU 4. Trace-Based Affine Reconstruction of Code 5. Conclusions Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 41

  27. Outline 3. Locality-Aware Automatic Parallelization for GPGPU • GPGPU with CUDA and OpenHMPP • Locality-Aware Generation of Efficient GPGPU Code • CONV3D & SGEMM • Experimental Evaluation Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 42

  28. Outline 3. Locality-Aware Automatic Parallelization for GPGPU • GPGPU with CUDA and OpenHMPP • Locality-Aware Generation of Efficient GPGPU Code • CONV3D & SGEMM • Experimental Evaluation Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 43

  29. GPGPU with CUDA • First GPGPU programs look like graphics applications • CUDA enables the use of C CUDA kernel: specifies the operation of a single GPU thread • Main ideas: 1. Lightweight parallel threads in hierarchy: grid, block 2. Shared memory 3. Barriers Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 44

  30. Example of CUDA-enabled GPU architecture Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 45

  31. GPU Memories Location Access Scope registers SM read & write one GPU thread local memory DRAM read & write one GPU thread shared memory SM read & write all GPU threads in a block global memory DRAM read & write all GPU threads & CPU explicit allocations and transfers Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 46

  32. GPU Programming Features in CUDA 1 Threadification + 2 Thread grouping: warps Impact on performance 3 Minimization of CPU-GPU data transfers 4 Coalescing 5 Maximization of the usage of registers and shared memory 6 Divergency 7 Occupancy 8 Threads per block - Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 47

  33. GPGPU with OpenHMPP interaction RPC RPC RPC address spaces disjoint disjoint disjoint automatic & automatic & automatic & data transfers manual manual manual sw-managed automatic explicit handling explicit handling caches handling parallelism gangs, workers, loop iterations, loop iterations specification SIMD tasks, SIMD standard loop directives no no transformations Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 48

  34. Outline 3. Locality-Aware Automatic Parallelization for GPGPU • GPGPU with CUDA and OpenHMPP • Locality-Aware Generation of E ffi cient GPGPU Code • CONV3D & SGEMM • Experimental Evaluation Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 49

  35. GPU Programming Features addressed by our Automatic Technique 1 Threadification + 2 Thread grouping: warps Impact on performance 3 Minimization of CPU-GPU data transfers 4 Coalescing 5 Maximization of the usage of registers and shared memory 6 Divergency 7 Occupancy 8 Threads per block - Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 50

  36. Chains of Recurrences (chrecs) • Algebraic formalism Definition 3.3.1. Given a constant φ ∈ Z , a function g : N 0 → Z , and the operator + , the chrec f = { φ , + , g } is defined as a function f : N 0 → Z such that: i − 1 ∑ { φ , + , g } ( i ) = φ + g ( j ) j = 0 • Useful for representing the iterations of a loop and array access patterns 2 for (i = 0; i <= N; i++) { CHRECS_xk = [{0,+,1}][{0,+,1}] 3 for (j = 0; j <= N; j++) { 4 ... x[i][j] ... 5 } 6 } • We instantiate (particularize) them for each GPU thread Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 51

  37. Detection of Coalesced Accesses to the GPU Global Memory CHRECS_xk = [{0,+,1}][{0,+,1}] CHRECS_xk = [{0,+,1}][{0,+,1}] row column major major 1 // only for_i is threadified 1 // only for_j is threadified 2 for (i = 0; i <= N; i++) { 2 for (j = 0; j <= N; j++) { 3 3 for (j = 0; j <= N; j++) { for (i = 0; i <= N; i++) { 4 4 ... x[i][j] ... ... x[i][j] ... 5 5 } } 6 } 6 } (a) Source code S1. (c) Source code S2. T0 T1 T2 T0 T1 T2 ( i = 0 ) ( i = 1 ) ( i = 2 ) ( j = 0 ) ( j = 1 ) ( j = 2 ) j = 0 x [ 0 ][ 0 ] x [ 1 ][ 0 ] x [ 2 ][ 0 ] i = 0 x [ 0 ][ 0 ] x [ 0 ][ 1 ] x [ 0 ][ 2 ] j = 1 x [ 0 ][ 1 ] x [ 1 ][ 1 ] x [ 2 ][ 1 ] i = 1 x [ 1 ][ 0 ] x [ 1 ][ 1 ] x [ 1 ][ 2 ] the j = 2 x [ 0 ][ 2 ] x [ 1 ][ 2 ] x [ 2 ][ 2 ] i = 2 x [ 2 ][ 0 ] x [ 2 ][ 1 ] x [ 2 ][ 2 ] same ... ... ... ... ... ... ... ... 1 st dim 1 st dim chrecs chrecs { 0 } { 1 } { 2 } { 0, + , 1 } { 0, + , 1 } { 0, + , 1 } 2 nd dim 2 nd dim { 0, + , 1 } { 0, + , 1 } { 0, + , 1 } { 0 } { 1 } { 2 } (b) Non-coalesced accesses. (d) Coalesced accesses. convex set Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 52

  38. Detection of whether an Access to the GPU Global Memory can be Coalesced 1: FUNCTION IS C OALESCED A CCESS Input: access x k [ i k ,1 ][ i k ,2 ] . . . [ i k , n ] to an n -dimensional array x stored in row-major order Input: loop nest L = L 1 , L 2 , . . . , L l where L 1 is the threadified loop Output: returns whether the given access x k can be coalesced after threadifying the loop nest L CHRECS _ x k [ { φ k ,1 , + , g k ,1 } ][ { φ k ,2 , + , g k ,2 } ] . . . [ { φ k , n , + , g k , n } ] 2: W warp of GPU threads { T0 , T1 , T2 ...} 3: for each thread Ti in W do 4: CHRECS _ x Ti k [ { φ Ti k ,1 , + , g Ti k ,1 } ][ { φ Ti k ,2 , + , g Ti k ,2 } ] . . . [ { φ Ti k , n , + , g Ti k , n } ] 5: end for 6: if ( 9 d 2 { 1 . . . n � 1 } , Tj 2 W � { T0 } : { φ Tj k , d , + , g Tj k , d } 6 = { φ T0 k , d , + , g T0 k , d } ) then 7: . first n � 1 chrecs differ return false 8: end if 9: CHRECS _ RANGE _ x k , n S Ti { φ Ti k , n , + , g Ti k , n } 10: if CHRECS _ RANGE _ x k , n defines a convex set then 11: return true . threads of the warp access consecutive locations 12: else 13: return ( 8 Tj 2 W � { T0 } : { φ Tj k , n , + , g Tj k , n } = { φ T0 k , n , + , g T0 k , n } ) 14: . threads of the warp access the same location end if 15: 16: end FUNCTION Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 53

  39. Usage of Registers to Store Reused Data within a GPU Thread 1: PROCEDURE S TORE R EUSED D ATA I N R EGISTERS Input: n -dimensional array x [ s 1 ][ s 2 ] . . . [ s n ] Input: loop nest L = L 1 , L 2 , . . . , L l where L 1 is the threadified loop Output: a modified program that exploits reused data to maximize the usage of the GPU registers collect accesses x k [ i k ,1 ][ i k ,2 ] . . . [ i k , n ] with k 2 { 1, . . . , m } 2: CHRECS _ x k [ { φ k ,1 , + , g k ,1 } ][ { φ k ,2 , + , g k ,2 } ] . . . [ { φ k , n , + , g k , n } ] 3: for each thread Ti do 4: CHRECS _ x Ti k [ { φ Ti k ,1 , + , g Ti k ,1 } ][ { φ Ti k ,2 , + , g Ti k ,2 } ] . . . [ { φ Ti k , n , + , g Ti k , n } ] 5: REUSED _ DATA _ x Ti T m k = 1 CHRECS _ x Ti 6: k if ( REUSED _ DATA _ x Ti 6 = ∅ ) then 7: store reused data between the accesses made by Ti in its set of 8: registers if data are private end if 9: end for 10: 11: end PROCEDURE Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 54

  40. Usage of the GPU Shared Memory for Data Shared between the Threads of a Block 1: PROCEDURE S TORE S HARED D ATA I N S HARED M EMORY Input: n -dimensional array x [ s 1 ][ s 2 ] . . . [ s n ] Input: loop nest L = L 1 , L 2 , . . . , L l where L 1 is the threadified loop Output: a modified program using the GPU shared memory to share data be- tween the threads of a block collect accesses x k [ i k ,1 ][ i k ,2 ] . . . [ i k , n ] with k 2 { 1, . . . , m } 2: CHRECS _ x k [ { φ k ,1 , + , g k ,1 } ][ { φ k ,2 , + , g k ,2 } ] . . . [ { φ k , n , + , g k , n } ] 3: for each block B do 4: for each thread Ti in B do 5: CHRECS _ x Ti k [ { φ Ti k ,1 , + , g Ti k ,1 } ][ { φ Ti k ,2 , + , g Ti k ,2 } ] . . . [ { φ Ti k , n , + , g Ti k , n } ] 6: end for 7: SHDATA _ x T Ti CHRECS _ x Ti k with k 2 { 1, . . . , m } 8: if ( SHDATA _ x 6 = ∅ ) then 9: store data shared between the threads of block B 10: in the shared memory end if 11: end for 12: 13: end PROCEDURE Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 55

  41. Increase the Computational Load of a GPU Thread 1: PROCEDURE I NCREASE L OAD Input: access x k [ i k ,1 ][ i k ,2 ] . . . [ i k , n ] to an n -dimensional array x stored in row-major order Input: loop nest L = L 1 , L 2 , . . . , L l where both L 1 , L 2 are threadified Input: amount of data ∆ to be processed by a GPU thread Output: a modified program after applying loop tiling under the OpenHMPP programming model increment the step of the outer loop L 1 to ∆ 2: for each scalar variable s in L do 3: promote s to an array s [ ∆ ] 4: transform reads and writes to s into loops of ∆ iterations 5: end for 6: 7: end PROCEDURE Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 56

  42. Use Scalar Variables to Enable GPU Compiler Optimizations 1: PROCEDURE I NCREASE L OAD Input: loop nest L = L 1 , L 2 , L 3 . . . , L l that results of Algorithm 3.4 where both L 1 , L 2 are threadified, the step of L 1 is ∆ , and L 3 is the created loop with ∆ iterations Output: a modified program that uses more scalar variables to enable GPU com- piler optimizations apply loop fission to L 3 , the loop created in line 5 of Algorithm 3.4 2: for each loop L 0 3 resulting from the fission of L 3 do 3: interchange loops until L 0 3 is the innermost one 4: insert a fullunroll directive before L 0 5: 3 end for 6: 7: end PROCEDURE Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 57

  43. Outline 3. Locality-Aware Automatic Parallelization for GPGPU • GPGPU with CUDA and OpenHMPP • Locality-Aware Generation of Efficient GPGPU Code • CONV3D & SGEMM • Experimental Evaluation Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 58

  44. CONV3D & SGEMM conv3d-hmpp1 conv3d-hmpp2 conv3d-hmpp3 sgemm-hmpp1 sgemm-hmpp2 sgemm-hmpp3 sgemm-hmpp4 sgemm-cublas conv3d-cpu sgemm-mkl sgemm-cpu GPU Features √ √ √ √ √ √ √ Coalescing - - - - √ √ √ √ Registers - - - - √ √ Shared Memory - - - - Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 59

  45. shaded to be omitted in the discovering of CONV3D (I) parallelism 1 int sizex, sizey, sizez, bound = 4; 2 3 void conv3d( float output[sizex][sizey][sizez], 4 float input[bound+sizex+bound][4+sizey+4][4+sizez+4], ROOT EXECUTION SCOPE 5 float coefx, float coefy, float coefz) { 6 ES_for i,j,k (Figure 3.4, lines 7-35) 7 for ( int i = 0; i < sizex; i++) { 8 for ( int j = 0; j < sizey; j++) { 9 for ( int k = 0; k < sizez; k++) { K < tempx 10 > 10 float tempx = input[i][j][k] + coefx * scalar assignment 11 ( 12 input[i-1][j][k] + input[i+1][j][k] + 13 input[i-2][j][k] + input[i+2][j][k] + 14 input[i-3][j][k] + input[i+3][j][k] + 15 input[i-4][j][k] + input[i+4][j][k] 16 ); K < tempy 17 > 17 float tempy = input[i][j][k] + coefy * scalar assignment 18 ( 19 input[i][j-1][k] + input[i][j+1][k] + 20 input[i][j-2][k] + input[i][j+2][k] + 21 input[i][j-3][k] + input[i][j+3][k] + 22 input[i][j-4][k] + input[i][j+4][k] K < tempz 24 > 23 ); scalar assignment 24 float tempz = input[i][j][k] + coefz * 25 ( 26 input[i][j][k-1] + input[i][j][k+1] + 27 input[i][j][k-2] + input[i][j][k+2] + 28 input[i][j][k-3] + input[i][j][k+3] + K < output 31 > 29 input[i][j][k-4] + input[i][j][k+4] regular reduction 30 ); 31 output[i][j][k] = 32 output[i][j][k] + tempx + tempy + tempz; 33 } 34 } FULLY PARALLEL LOOP 35 } 36 } Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 60

  46. CONV3D (II) • conv3d-cpu: Sequential code • conv3d-hmpp1: Coalescing CHRECS_input1 = 1. int i, j, k, size_x, size_y, size_z; 2. float coefx,coefy,coefz,*input,*output; [{0,+,1}][{0,+,1}][{0,+,1}] 3. 4. for (i = 0; i < size_x; i++) { 5. for (j = 0; j < size_y; j++) { 6. for (k = 0; k < size_z; k++) { 7. float tempx = input[i][j][k]+coefx* 8. … Default OpenHMPP policy • CHRECS_input1T1 = CHRECS_input1T0 = [{0}][{1}][{0,+,1}] [{0}][{0}][{0,+,1}] Loop nest is permuted to forj, fork, fori • CHRECS_input1T0 = CHRECS_input1T1 = [{0,+,1}][{0}][{0}] [{0,+,1}][{0}][{1}] Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 61

  47. CONV3D (III) • conv3d-hmpp2: Registers 4. for (j = 0; j < size_y; j++) { 5. for (k = 0; k < size_z; k++) { 6. for (i = 0; i < size_x; i++) { 7. float tempx = input[i][j][k]+coefx* 8. ( 9. input[i-1][j][k]+input[i+1][j][k]+ 10. … CHRECS_input1 = CHRECS_input3 = CHRECS_input2 = [{0,+,1}][{0,+,1}][{0,+,1}] [{1,+,1}][{0,+,1}][{0,+,1}] [{-1,+,1}][{0,+,1}][{0,+,1}] CHRECS_input1T0 = CHRECS_input2T0 = CHRECS_input3T0 = [{0,+,1}][{0}][{0}] [{-1,+,1}][{0}][{0}] [{1,+,1}][{0}][{0}] ∩≠∅ Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 62

  48. CONV3D (IV) 1 #pragma hmpp conv3d___hmpp2 codelet 2 void conv3d___hmpp2( float output[sizex][sizey][sizez], 3 float input[bound+sizex+bound][4+sizey+4][4+sizez+4], 4 float coefx, float coefy, float coefz) { 5 6 #pragma hmppcg gridify (j, k) 18 for ( int i = 0; i < sizex; i++) { 7 for ( int j = 0; j < sizey; j++) { 19 i___minus4 = i___minus3; 8 for ( int k = 0; k < sizez; k++) { 20 i___minus3 = i___minus2; 9 float i___minus4 = 0; 21 i___minus2 = i___minus1; 10 float i___minus3 = input[-4][j][k]; 22 i___minus1 = i___plus0; 11 float i___minus2 = input[-3][j][k]; 23 i___plus0 = i___plus1; 12 float i___minus1 = input[-2][j][k]; 24 i___plus1 = i___plus2; 13 float i___plus0 = input[-1][j][k]; 25 i___plus2 = i___plus3; 14 float i___plus1 = input[0][j][k]; 26 i___plus3 = i___plus4; 15 float i___plus2 = input[1][j][k]; 27 i___plus4 = input[i+4][j][k]; 16 float i___plus3 = input[2][j][k]; 28 float tempx = i___plus0 + coefx * 17 float i___plus4 = input[3][j][k]; 29 ( 18 for int 30 i___minus1 + i___plus1 + 31 i___minus2 + i___plus2 + 32 i___minus3 + i___plus3 + 33 i___minus4 + i___plus4 34 ); 35 float tempy = ... 36 float tempz = ... 37 output[i][j][k] = 38 output[i][j][k] + tempx + tempy + tempz; 39 } 40 } 41 } 42 } Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 63

  49. CONV3D (V) • conv3d-hmpp3: Shared memory 4. for (j = 0; j < size_y; j++) { T0 T1 5. for (k = 0; k < size_z; k++) { CHRECS 1 st dim 2 nd dim 3 rd dim 1 st dim 2 nd dim 3 rd dim 6. for (i = 0; i < size_x; i++) { { 0, + , 1 } { 0 } { 0 } { 0, + , 1 } { 0 } { 1 } CHRECS _ input 19 … 21. float tempz = input[i][j][k]+coefz* { 0, + , 1 } { 0 } { − 1 } { 0, + , 1 } { 0 } { 0 } CHRECS _ input 20 22. ( { 0, + , 1 } { 0 } { 1 } { 0, + , 1 } { 0 } { 2 } CHRECS _ input 21 23. input[i][j][k-1]+input[i][j][k+1]+ { 0, + , 1 } { 0 } { − 2 } { 0, + , 1 } { 0 } { − 1 } CHRECS _ input 22 24. input[i][j][k-2]+input[i][j][k+2]+ CHRECS _ input 23 { 0, + , 1 } { 0 } { 2 } { 0, + , 1 } { 0 } { 3 } 25. input[i][j][k-3]+input[i][j][k+3]+ { 0, + , 1 } { 0 } { − 3 } { 0, + , 1 } { 0 } { − 2 } CHRECS _ input 24 26. input[i][j][k-4]+input[i][j][k+4] { 0, + , 1 } { 0 } { 3 } { 0, + , 1 } { 0 } { 4 } CHRECS _ input 25 27. ); CHRECS _ input 26 { 0, + , 1 } { 0 } { − 4 } { 0, + , 1 } { 0 } { − 3 } … { 0, + , 1 } { 0 } { 4 } { 0, + , 1 } { 0 } { 5 } CHRECS _ input 27 shared clause of the gridify directive Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 64

  50. CONV3D (and VI) 1 #pragma hmpp conv3d___hmpp3 codelet 2 void conv3d___hmpp3( float output[sizex][sizey][sizez], 3 float input[bound+sizex+bound][4+sizey+4][4+sizez+4], 4 float coefx, float coefy, float coefz) { 5 float input___shared[bound+8+bound][bound+32+bound]; 6 #pragma hmppcg gridify (j,k), blocksize (32x8), shared (input___shared),unguarded 7 for ( int j = 0; j < sizey; j++) { 8 for ( int k = 0; k < sizez; k++) { 9 int tx = 0; 10 int ty = 0; 11 #pragma hmppcg set tx = RankInBlockX() 12 #pragma hmppcg set ty = RankInBlockY() 13 int rk = tx + bound; 14 int rj = ty + bound; 15 float i___minus4 = ... 16 for ( int i = 0; i < sizex; i++) { 17 i___minus4 = ... 18 #pragma hmppcg grid barrier 24 float tempx = ... 19 input___shared[rj-bound][rk-bound] = input[i][j-bound][k-bound]; 25 float tempy = i___plus0 + coefy * 20 input___shared[rj+bound][rk-bound] = input[i][j+bound][k-bound]; 26 ( 21 input___shared[rj-bound][rk+bound] = input[i][j-bound][k+bound]; 27 input___shared[rj-1][rk] + input___shared[rj+1][rk] + 22 input___shared[rj+bound][rk+bound] = input[i][j+bound][k+bound]; 28 input___shared[rj-2][rk] + input___shared[rj+2][rk] + 23 #pragma hmppcg grid barrier 29 input___shared[rj-3][rk] + input___shared[rj+3][rk] + 24 30 input___shared[rj-4][rk] + input___shared[rj+4][rk] 31 ); 32 float tempz = i___plus0 + coefz * 33 ( 34 input___shared[rj][rk-1] + input___shared[rj][rk+1] + 35 input___shared[rj][rk-2] + input___shared[rj][rk+2] + 36 input___shared[rj][rk-3] + input___shared[rj][rk+3] + 37 input___shared[rj][rk-4] + input___shared[rj][rk+4] 38 ); 39 output[i][j][k] = 40 output[i][j][k] + tempx + tempy + tempz; 41 } 42 } 43 } 44 } Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 65

  51. SGEMM (I) ROOT EXECUTION SCOPE 1 int m, n, k; ES_for i,j (Figure 3. 8 , lines 5-13) 2 void sgemm( float C[m][n], float alpha, float A[m][k], 3 float B[k][n], float beta) { K < prod 7 > 4 scalar assignment 5 for ( int i = 0; i < m; i++) { 6 for ( int j = 0; j < n; j++) { ES_for l (Figure 3. 8 , lines 8-10) 7 float prod = 0; 8 for ( int l = 0; l < k; l++) { K < prod 9 > 9 prod += A[i][l] * B[l][j]; scalar reduction 10 } 11 C[i][j] = alpha * prod + beta * C[i][j]; 12 } 13 } K < C 11 > 14 } regular reduction FULLY PARALLEL LOOP Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 66

  52. SGEMM (II) • sgemm-cpu: Sequential code • sgemm-mkl: Intel MKL • sgemm-hmpp1: Offloading (and check coalescing) 1 int m, n, k; 2 void sgemm( float C[m][n], float alpha, float A[m][k], 3 float B[k][n], float beta) { 4 5 for ( int i = 0; i < m; i++) { not instantiated T0 T1 6 for ( int j = 0; j < n; j++) { CHRECS 1 st dim 2 nd dim 1 st dim 2 nd dim 1 st dim 2 nd dim 7 float prod = 0; { 0, + , 1 } { 0, + , 1 } { 0 } { 0, + , 1 } { 0 } { 0, + , 1 } CHRECS _ A 8 for ( int l = 0; l < k; l++) { { 0, + , 1 } { 0, + , 1 } { 0, + , 1 } { 0 } { 0, + , 1 } { 1 } CHRECS _ B 9 prod += A[i][l] * B[l][j]; CHRECS _ C { 0, + , 1 } { 0, + , 1 } { 0 } { 0 } { 0 } { 1 } 10 } 11 C[i][j] = alpha * prod + beta * C[i][j]; 12 } 13 } 14 } Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 67

  53. SGEMM (III) • sgemm-hmpp2: Tiling preserving coalescing 1 int m, n, k; 2 #define DELTA 16 3 4 #pragma hmpp sgemm___hmpp2 codelet 5 void sgemm___hmpp2( float C[m][n], float alpha, float A[m][k], 6 float B[k][n], float beta) { 7 8 #pragma hmppcg gridify (i,j), blocksize (128x1) 9 for ( int i = 0; i < m; i = i + DELTA) { 10 for ( int j = 0; j < n; j++) { 11 float prod[DELTA]; 12 for ( int t = 0; t < DELTA; t++) { 13 prod[t] = 0; 14 for ( int l = 0; l < k; l++) { 15 prod[t] += A[i+t][l] * B[l][j]; 16 } 17 C[i+t][j] = alpha * prod[t] + beta * C[i+t][j]; 18 } 19 } 20 } 21 } Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 68

  54. SGEMM (and IV) 1 int m, n, k; • sgemm-hmpp3: 2 #define DELTA 16 3 Let the compiler 4 #pragma hmpp sgemm___hmpp3 codelet 5 void sgemm___hmpp3( float C[m][n], float alpha, float A[m][k], 6 float B[k][n], float beta) { use the registers 7 8 #pragma hmppcg gridify (i,j), blocksize (128x1) ( fullunroll ) 9 for ( int i = 0; i < m; i = i + DELTA) { 10 for ( int j = 0; j < n; j++) { 11 float prod[DELTA]; • sgemm-hmpp4: 12 #pragma hmppcg fullunroll 13 for ( int t = 0; t < DELTA; t++) { 14 Use the shared prod[t] = 0; 15 } 16 for ( int l = 0; l < k; l++) { memory for B 17 #pragma hmppcg fullunroll 18 for ( int t = 0; t < DELTA; t++) { 19 prod[t] += A[i+t][l] * B[l][j]; 20 } • sgemm-cublas: 21 } 22 #pragma hmppcg fullunroll NVIDIA CUBLAS 23 for ( int t = 0; t < DELTA; t++) { 24 C[i+t][j] = alpha * prod[t] + beta * C[i+t][j]; library 25 } 26 } 27 } 28 } Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 69

  55. Outline 3. Locality-Aware Automatic Parallelization for GPGPU • GPGPU with CUDA and OpenHMPP • Locality-Aware Generation of Efficient GPGPU Code • CONV3D & SGEMM • Experimental Evaluation Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 70

  56. Performance Evaluation: CONV3D sizex, sizey and sizez in 128, 256, 384, 512, 640 and 768 120 conv3d-cpu conv3d-hmpp1 100 conv3d-hmpp2 conv3d-hmpp3 80 GFLOPS 60 40 20 0 CPU (nova) GPU Tesla S1070 (nova) GPU Tesla S2050 (pluton) Fermi cards introduced memory caches Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 71

  57. Performance Evaluation: SGEMM (I) m, n and k in 128, 256, 384, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 4096, 6144 and 8192 500 sgemm-cpu sgemm-mkl 400 sgemm-hmpp1 sgemm-hmpp2 GFLOPS 300 sgemm-hmpp3 sgemm-hmpp4 sgemm-cublas 200 100 0 CPU (nova) GPU Tesla S1070 (nova) GPU Tesla S2050 (pluton) the biggest improvement factor is the usage of the GPU shared memory Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 72

  58. Performance Evaluation: SGEMM (and II) blue: sgemm-cublas red: sgemm-hmpp4 8192 black: sgemm-mkl 6144 4096 k 2048 1024 128 8192 6144 8192 6144 4096 4096 2048 128 10242048 1024 128 n m GPU Tesla S2050 (pluton) Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 73

  59. Outline 1. Introduction 2. A Novel Compiler Support for Multicore Systems 3. Locality-Aware Automatic Parallelization for GPGPU 4. Trace-Based A ffi ne Reconstruction of Code 5. Main Contributions and Future Research Lines Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 74

  60. Outline 4. Trace-Based Affine Reconstruction of Code • Problem Formulation • Problem Resolution with CHOLESKY • Extensions for Supporting Nonlinear Traces • Experimental Evaluation Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 75

  61. Outline 4. Trace-Based Affine Reconstruction of Code • Problem Formulation • Problem Resolution with CHOLESKY • Extensions for Supporting Nonlinear Traces • Experimental Evaluation Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 76

  62. Problem Statement 1 0x1e2d140 2 0x1e2d140 88 0x1e2d248 1. for (i = 0; i <= 29; i++) { . . 89 0x1e2d340 . 2. for (j = 0; j <= 29-i; j++) { 90 0x1e2d348 3. for (k = 0; k < i; k++) { 91 0x1e2d350 30 0x1e2d140 4. … A[i][k] … 92 0x1e2d340 31 0x1e2d240 5. } 93 0x1e2d348 32 0x1e2d248 6. } 94 0x1e2d350 33 0x1e2d240 7. } . 34 0x1e2d248 . . . . . • We assume that: • Addresses are generated by a single instruction • Instruction is enclosed in an affine loop nest • Existing memory optimization techniques based on the polyhedral model, and any other static or dynamic optimization technique in the absence of source and/or binary code, can be applied. Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 77

  63. Problem Formulation (I) DO i 1 = 0 , u 1 ( − → ı ) DO i 2 = 0 , u 2 ( − → ı ) . . . DO i n = 0 , u n ( − → ı ) V [ f 1 ( − → ı )] . . . [ f m ( − → ı )] V [ f 1 ( − → ı )] . . . [ f m ( − → ı )] = V [ c 0 + i 1 c 1 + . . . + i n c n ] Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 78

  64. Problem Formulation (II) • In our model, only three possible variations of a loop index between two consecutive iterations are allowed 1. i j does not change ⇒ δ k j = 0 2. i j is increased by one ⇒ δ k j = 1 3. i j is reset to 0 ⇒ δ k j = − i k j 2 i k + 1 3 2 3 − i k δ k 1 1 1 i k + 1 − i k δ k 6 7 6 7 = − → ( − → ı k + 1 − − → 2 2 2 ı k ) = δ k 6 7 6 7 = . . 6 7 6 7 . . . . 6 7 6 7 4 5 4 5 i k + 1 − i k δ k n n n Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 79

  65. Problem Formulation (and III) • The stride between two consecutive accesses is a linear combination of the coefficients of the loop indices accesses σ k = V ( − → ı k + 1 ) − V ( − → ı k ) indices. σ k = c 1 i k + 1 c n i k + 1 V + ( c 0 + + . . . + ) − n 1 c 1 i k c n i k V + ( c 0 + + . . . + n ) = 1 c 1 δ k c n δ k = + . . . + = n 1 c − → − → δ k = Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 80

  66. Outline 4. Trace-Based Affine Reconstruction of Code • Problem Formulation • Problem Resolution with CHOLESKY • Extensions for Supporting Nonlinear Traces • Experimental Evaluation Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 81

  67. Solution Space 2n + 1 candidates Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 82

  68. Problem Resolution (I) { − → c , I k , U , − → solution S k subtrace { a 1 , . . . , a k } w } , n = • Coefficients of the Loop Indices ector − → c ∈ Z n of u j ( − → ı ) = w j + u j ,1 i 1 + . . . + u j , ( j − 1 ) i ( j − 1 ) • Iteration Indices   − 1 0 0 . . . 0   w 1 Matrix I k = [ − → ı 1 | . . . | − → ı k ] ∈ Z n × k of u 2,1 − 1 0 . . . 0   w 2     and − →     u 3,1 u 3,2 − 1 . . . 0 w = U = .     . . . . . .   • Bounds  ...  . . . . . . . .       w n matrix U ∈ Z n × n u n ,1 u n ,2 u n ,3 . . . − 1 − → w ∈ Z n such w ≥ − → U − → ı + − → 0 T Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 83

  69. Problem Resolution (II) • To be a valid solution: • Each consecutive pair of indices must be sequential UI k + − → w 1 1 × k ≥ 0 n × k • The observed strides are coherent with the reconstructed ones c − → δ k = σ k − → c ( − → ı k + 1 − − → ı k ) = − → Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 84

  70. Problem Resolution: CHOLESKY (I) 1 #define N 32; 2 double p[N], A[N][N], x; 1 0x1e2d140 3 int i, j, k; 2 0x1e2d140 88 0x1e2d248 4 . . 89 0x1e2d340 5 #pragma scop . 90 0x1e2d348 6 for (i = 0; i < N; ++i) { 91 0x1e2d350 7 30 0x1e2d140 x = A[i][i]; 92 0x1e2d340 8 31 0x1e2d240 for (j = 0; j <= i - 1; ++j) 93 0x1e2d348 32 0x1e2d248 9 x = x - A[i][j] * A[i][j]; 94 0x1e2d350 33 0x1e2d240 10 p[i] = 1.0 / sqrt(x); . 34 0x1e2d248 . 11 for (j = i + 1; j < N; ++j) { . . 12 x = A[i][j]; . . 13 for (k = 0; k <= i - 1; ++k) 14 x = x - A[j][k] * A[i][k] ; 15 A[j][i] = x * p[i]; 16 } 17 } σ 1 ⇤ = [ a 2 − a 1 ] = [ 0 ] − → 18 #pragma endscop 8 ⇥ c = ı 2 ⇤ = [ 0, 1 ] > I 2 = > ⇥ − → ı 1 | − → > < U = [ − 1 ] > > > − → w = [ 1 ] T : Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 85

  71. Problem Resolution: Building the System • Calculate the observed stride σ k = a k + 1 − a k • Build a diophantine linear equation system c ) − → − → c ( − → ı k + 1 − − → ı k ) = σ k ⇒ ( − → c T − → δ k = − → c T σ k • One or more solutions: Explore them independently • No solution under current boundaries • Increase dimensionality adding a new loop • Modify boundaries • Discard this branch Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 86

  72. Problem Resolution: Solving the System c ) − → − → c ( − → ı k + 1 − − → ı k ) = σ k ⇒ ( − → c T − → δ k = − → c T σ k • As indices must be sequential, there are at most n solutions { − → = +( l , − → ı k + 1 ı k ) , 0 < l ≤ n } l • We only need to calculate the predicted stride for each valid index and compare with the observed stride the n valid indices c − → l = − → σ k δ k as ˆ l , k Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 87

  73. Problem Resolution: CHOLESKY (II) • Solution for the first two accesses: σ 1 ⇤ = [ a 2 − a 1 ] = [ 0 ] − → 8 ⇥ c = ı 2 ⇤ = [ 0, 1 ] > > I 2 = ⇥ − → ı 1 | − → > < U = [ − 1 ] > > > − → w = [ 1 ] T : • Processing the third access: a 3 = 0x1e2d140 σ 2 = a 3 − a 2 = 0x1e2d140 − 0x1e2d140 = 0 c � ! 1 = [ 0 ] [ 1 ] T = 0 1 = � ! σ 2 δ 2 ˆ h i I = [ I | + ( 1, � ! ı 2 ) ] = 0 1 2 • The reconstruction continues until σ 30 = a 31 � a 30 = 0x1e2d240 � 0x1e2d140 ) σ 30 = 0x100 = 256 3 Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 88

  74. Problem Resolution: Increasing the Solution Dimensionality (I) • Add a new loop ı k + 1 = f ( p , � � ! ! ı k )   I k ( 1: p ,: ) � ! I k + 1 = ı k + 1   0 1 ⇥ k     I k ( p + 1: n ,: ) • In CHOLESKY " # " # 0 . . . 0 1 0 . . . 0 1 I = = 0 . . . 29 I ( 1:30 ) 0 0 Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 89

  75. Problem Resolution: Increasing the Solution Dimensionality (and II) • The coefficient for the new loop can be derived from the observed stride 4 2 3 0 . . 6 7 . 6 7 6 7 n 0 6 7 p = σ k + 6 7 i k h i ∑ c 0 = σ k ) � ! c ( � ! ı k + 1 � � ! r c r ı k ) = σ k ) 6 7 c 1 , . . . , c p , c 0 p , c p + 1 , . . . , c n 1 6 7 6 7 � i k r = p + 1 6 7 p 6 7 . 6 7 . . 6 7 4 5 � i k n • In CHOLESKY ⇥ 256 0 = σ 30 + i 30 0 ⇤ 1 c 1 = 256 + 0 · 29 ⇒ − → c 0 c =  � Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 90

  76. Problem Resolution: Updating the Loop Bounds • Loop indices must be sequential and stay into loop bounds U 0 I k + 1 + � ! w 0 1 1 ⇥ ( k + 1 ) � 0 n ⇥ ( k + 1 ) • Inconsistent system The branch is discarded • System with solutions Overdetermined of � ! of O ( 1 ) . w 0 becomes O ( n 2 ) . of U 0 Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 91

  77. Problem Resolution: Accelerating the Traversal • In the general case, the complexity of exploring the solution space for a trace with A addresses generated by n loops is O(n A ) γ k = U − ı k + − − → → → w • Each element indicates how many more iterations of each index are left before it resets under the bounds � ! is � ! = +( l , � ! • The most plausible value for the next index is ı k + 1 ı k ) , l � ! k where l is the position of the innermost positive element • Several accesses are recognized in block Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 92

  78. Problem Resolution: CHOLESKY (III) " # " # 0 . . . 0 1 0 . . . 0 1 I = = 0 . . . 29 I ( 1:30 ) 0 0 " # " # − 1 0 − 1 0 i T − → h 1 | − → = [ 1 | 29 ] T U = = w = w ( 1:1 ) − 1 0 U ( 1:1,1:1 ) 0 UI + − → w 1 1 × ( 31 ) ≥ 0 2 × ( 31 ) " # " # " # h − 1 0 0 . . . 0 1 1 i + = 1 . . . 1 0 − 1 0 . . . 29 0 29 " # " # 0 0 0 . . . 0 − 1 1 . . . 1 + = 0 − 1 − 2 . . . − 29 0 29 . . . 29 " # 1 1 1 . . . 1 0 ≥ 0 2 × ( 31 ) 29 28 27 . . . 0 29 Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 93

  79. Problem Resolution: CHOLESKY (and IV) 1 #define N 32; 2 double p[N], A[N][N], x; 3 int i, j, k; 4 5 #pragma scop 6 for (i = 0; i < N; ++i) { 7 x = A[i][i]; access A [ i ][ k ] as A [ 256 ⇤ i + 8 ⇤ k ] . 8 for (j = 0; j <= i - 1; ++j) 9 x = x - A[i][j] * A[i][j]; 10 p[i] = 1.0 / sqrt(x); 11 for (j = i + 1; j < N; ++j) { 12 h i x = A[i][j]; � ! c = 256 0 8 13 for (k = 0; k <= i - 1; ++k) 14 x = x - A[j][k] * A[i][k] ; 15 A[j][i] = x * p[i]; 2 3 � 1 0 0 16 } 17 } U = � 1 � 1 0 6 7 18 #pragma endscop 4 5 1 0 � 1 h i � ! w = 29 29 0 Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 94

  80. Outline 4. Trace-Based Affine Reconstruction of Code • Problem Formulation • Problem Resolution with CHOLESKY • Extensions for Supporting Nonlinear Traces • Experimental Evaluation Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 95

  81. Supporting Nonlinearity: Input Noise • Some trace files mainly contain references issued by a single access, but mixed with unrelated ones (e.g., nearly affine or unlabeled traces) • The exploration of the solution space can be modified to discard until max observed accesses whether ( ) e σ k = σ k + r , 0 < e ≤ max ∑ ˆ r = 0 • Tolerance parameter for discarding a branch Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures / 118 96

Recommend


More recommend