improved static analysis to generate more effjcient code
play

Improved Static Analysis to Generate More Effjcient Code for - PowerPoint PPT Presentation

Improved Static Analysis to Generate More Effjcient Code for Execution of Loop Nests in GPUs J. Nelson Amaral Department of Computjng Science Nathan Michael Z Jacky Michael N. Braedy Sarah Rebecca Eldon Dylan Thomas Ben Hao Ben Hao


  1. Improved Static Analysis to Generate More Effjcient Code for Execution of Loop Nests in GPUs J. Nelson Amaral Department of Computjng Science

  2. Nathan Michael Z Jacky Michael N. Braedy Sarah Rebecca Eldon Dylan Thomas Ben Hao Ben Hao Tristan Brad Simons Artem Chikin Lyle Roleman Artem H.

  3. September 2014

  4. Antem Chikin

  5. September 2014 July 2018

  6. Taylor Lloyd Artem Chikin

  7. November 2014

  8. htup://www.extremetech.com/computjng/194283-ibm-and-nvidia-will-build-two-ultra-effjcient-150-petafmop-supercomputers-for-the-doe htup://www.cnet.com/news/ibm-nvidia-land-325-million-supercomputer-deal/ htup://www.anandtech.com/show/8727/nvidia-ibm-supercomputers h tu p : / / w w w . z d n e t . c o m / a r tj c l e / i b m - n v i d i a - t a 9 p p e d - t o - b u i l d - w o r l d s - f a s t e s t - s u p e r c o m p u t e r s /

  9. 200000 Linpack TFLOP/S 200000 Linpack TFLOP/S 88000 Linpack TFLOP/S 88000 Linpack TFLOP/S "Summit … is expected to deliver more than fjve tjmes 15000 KW 15000 KW the system-level applicatjon performance of Titan while 9000 KW 9000 KW consuming only 10% more power." htup://info.nvidianews.com/rs/nvidia/images/Coral %20White%20Paper%20Final-3-2.pdf 10 htup://www.top500.org/list/2014/11/

  10. Technology? Nvidia Volta GPU IBM Power9 Nvidia NVlink

  11. Programming Model? OpenMP OpenACC MPI OpenCL CUDA

  12. Compiler Technology? LLVM IBM XL Compiler

  13. May 2015

  14. Etuore Tiotuo J. Nelson Amaral Artem Chikin Taylor Lloyd Science Internship Program IBM Canada Sofuware Laboratory Markham, ON

  15. Programming Model ➔ OpenMP 3.x OpenMP 4.x

  16. void vecAdd( double *a, double *b, double *c, int n) CPU core { for ( int i = 0; i < n; i++) { c[i] = a[i] + b[i]; cache } } void vecAdd( double *a, double *b, double *c, int n) CPU { core core #pragma omp parallel for for ( int i = 0; i < n; i++) { c[i] = a[i] + b[i]; cache cache } } CPU TARGET team void vecAdd( double *a, double *b, double *c, int n) { #pragma omp target map(to: a[:n], b[:n]) \ map(from: c[:n]) for ( int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } } Memory Memory

  17. TARGET CPU void vecAdd( double *a, double *b, double *c, int n) { team team #pragma omp target map(to: a[:n], b[:n]) \ map(from: c[:n]) #pragma omp parallel for for ( int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } } Memory Memory TARGET CPU team team team void vecAdd( double *a, double *b, double *c, int n) { #pragma omp target map(to: a[:n], b[:n]) \ map(from: c[:n]) #pragma omp teams parallel for for ( int i = 0; i < n; i++) { c[i] = a[i] + b[i]; This two teams are executjng } Memory Memory the same computatjon on the } same memory locatjons without synchronizatjon. Likely wrong result.

  18. void vecAdd( double *a, double *b, double *c, int n) TARGET CPU { team team team #pragma omp target map(to: a[:n], b[:n]) \ map(from: c[:n]) #pragma omp teams distribute parallel for for ( int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } } The iteratjons of the loop are Memory Memory distributed to the two teams and the result is correct.

  19. Memory Coalescing

  20. warp of 32 threads accesses coalesced issues 32 accesses into a single transactjon in one cycle One 128-byte L1 cache line Accesses must be aligned to the boundary of a cache line. warp of 32 threads accesses coalesced into a single transactjon One 128-byte L1 cache line Warps may access addresses in any order within the cache line.

  21. warp of 32 threads Four transactjons are required Four 32-byte L2 cache line

  22. warp of 32 threads 32-byte cache line 32-byte cache line 32-byte cache line warp of 32 threads Inter-tread stride 32-byte cache line 32-byte cache line 32-byte cache line Intra-tread stride

  23. October 2016

  24. Karim Ali CSC building Karim Ali Taylor Lloyd Computjng Science Centre

  25. Taint Analysis

  26. a is tainted void main() { long int a = readCreditCardNumber(); long int b = 0; b = foo(a); print(b); } is b tainted? long int foo(int p) { if(p != 0) print(p); }

  27. Taylor Lloyd Arithmetjc Control Form (ACF) Analysis

  28. tx = threadIdx.x tx = threadIdx.x tx <= 256 tx > 256 tx = threadIdx.x tx = 256 *addr = a + tx tx > 256 *addr = a + tx tx = 256 + int readBounded(int* a) { int tx = threadIdx.x; if(tx > 256) *addr = a + tx tx = 256; int *addr = a + tx; return *addr return *addr; } ACF T (*addr) = (T > 256) *([a] + 4*256) + (T <= 256) *([a] + 4*T) ACF 0 (*addr) = (0 > 256)*([a] + 4*256) + (0 <= 256)*([a] + 4*0) ACF 0 (*addr) = [a] ACF 1 (*addr) – ACF 0 (*addr) = 4 ACF 1 (*addr) = [a] + 4

  29. June/July 2017

  30. Taylor Lloyd Artem Chikin Dhruv Jain Sanket Kedia

  31. August 2017

  32. Artem Chikin IBM Canada Sofuware Laboratory Computjng Science Centre Markham, ON

  33. Iteratjon Point Difgerence Analysis (IPDA) ACF can be used for any pair of expressions. ACF can be based on the induction variables in a loop nest. ACF is useful when applied to address expressions in a loop nest. ACF can make a Data Dependence Graph more precise. Compiler can transform code based on IPDA.

  34. IPDA Analysis in an example

  35. conv2D : Two-dimensional convolutjon

  36. for (CIVI = 0; CIVI < NI - 2; ++CIVI) { i = CIVI+1; for (CIVJ = 0; CIVJ < NJ - 2; ++CIVJ) { B[i*NJ + CIVJ + 1] = … ; } } B + 8*((CIVI+1)*NJ + CIVJ + 1) IPAD propagates symbolic expressions Base address for array B from dominant defjnitjon to each use. Assuming that data type size is 8 bytes

  37. for (CIVI = 0; CIVI < NI - 2; ++CIVI) { i = CIVI+1; for (CIVJ = 0; CIVJ < NJ - 2; ++CIVJ) { B[i*NJ + CIVJ + 1] = … ; } } B + 8*((CIVI+1)*NJ + CIVJ + 1) Iteratjon Point Algebraic Difgerence: B + 8*((CIVI’+1)*NJ + CIVJ’ + 1) - (B + 8*((CIVI+1)*NJ + CIVJ + 1) ) 8*(CIVI’*NJ + CIVJ’) - 8*(CIVI*NJ + CIVJ) 8*((CIVI’-CIVI)*NJ + (CIVJ’-CIVJ)) = 0 ? 8*(∆CIVI*NJ + ∆CIVJ)

  38. for (CIVI = 0; CIVI < NI - 2; ++CIVI) { i = CIVI+1; for (CIVJ = 0; CIVJ < NJ - 2; ++CIVJ) { B[i*NJ + CIVJ + 1] = … ; } } B + 8*((CIVI+1)*NJ + CIVJ + 1) Iteratjon Point Algebraic Difgerence: ∆CIVI ∆CIVJ = 0 ≠ 0 ≠ 0 = 0 ≠ 0 ≠ 0 = 0 ? 8*(∆CIVI*NJ + ∆CIVJ)

  39. Loop Collapsing and Loop Interchange

  40. for (i = 0; i < N; ++i) { for (j = 0; j < N; ++j) { A[i+j*N] = A[i+j*N] * A[i+j*N]; } } Loop Collapse j for (c = 0; c < N*N; ++c) { i = c / N; i j = c % N; A[i+j*N] = A[i+j*N] * A[i+j*N]; } c

  41. Loop Interchange for (i = 0; i < N; ++i) for (j = 0; j < N; ++j) { { for (j = 0; j < N; ++j) for (i = 0; i < N; ++i) { { A[i+j*N] = A[i+j*N] * A[i+j*N]; A[i+j*N] = A[i+j*N] * A[i+j*N]; } } } } j Loop Collapse for (c = 0; c < N*N; ++c) i { i = c / N; j = c % N; A[i+j*N] = A[i+j*N] * A[i+j*N]; } c

  42. A detailed example of how IPDA Analysis helps

  43. 557.pcsp

  44. It is an OpenMP program SP = Pentadiagonal Solver 557.pcsp It is a C language program

  45. 4-dimensional loop and Outer-dimension range: 0, 1, 2 for (k = 1; k <= gp2-2; k++) { for (j = 0; j <= gp1-3; j++) { j1 = j + 1; j2 = j + 2; for (i = 1; i <= gp0-2; i++) { ・ ・ ・ for (m = 0; m < 3; m++) { } ・ ・ ・ } } }

  46. i is innermost loop and last coordinate

  47. j elements from three rows accessed data dependence on loop j ⇒ j loop is sequentjal

  48. loop nest is not perfect

  49. Expression Re-materializatjon

  50. We will focus on m=3

  51. for (k = 1; k <= gp2-2; k++) { for (j = 0; j <= gp1-3; j++) { for (i = 1; i <= gp0-2; i++) { Sequentjal Executjon ・ ・ ・ lhsY[3][k][j][i] = fac1* lhsY[3][k][j][i]; ・ ・ ・ } } } k i j

  52. Parallelizing loop k k i j Intra-thread access patuern. lhsY[3][k][j][i]

  53. warp of 32 threads Parallelizing loop k k i Inter-thread access patuern? j None of the accesses are coalesced lhsY[3][k][j][i]

  54. Interchange loops j and i

  55. warp of 32 threads Parallelizing loop k k i Inter-thread access patuern? j None of the accesses are coalesced lhsY[3][k][j][i]

  56. Collapse loops k and i

  57. warp of 32 threads Parallelizing loop c Inter-thread access patuern? j Perfect coalescing lhsY[3][k][j][i]

  58. 1.4 ms 0 Executjon Time 41.13 ms On Nvidia Pascal (P100) Kernel is 29.4 tjmes faster Afuer IPAD-enabled transformatjons On Nvidia Volta (V100): 16.4 tjmes faster Benchmark speedups: Pascal (P100): 1.53x Volta (V100): 1.26x *Benchmarks was not verifying.

  59. 1.4 ms 0 Executjon Time 41.13 ms 88.5 On Nvidia Pascal (P100) Kernel is 29.4 tjmes faster Afuer IPAD-enabled transformatjons 111 On Nvidia Volta (V100): 16.4 tjmes faster Benchmark speedups: 3.33x Pascal (P100): 1.53x 2.3x Volta (V100): 1.26x *Afuer bug fjgs with benchmark verifying.

  60. Etuore Tiotuo J. Nelson Amaral Artem Chikin October 2017

  61. March 2018

  62. Opportunitjes in three other Benchmarks

Recommend


More recommend