Improved Static Analysis to Generate More Effjcient Code for - PowerPoint PPT Presentation

Improved Static Analysis to Generate More Effjcient Code for Execution of Loop Nests in GPUs J. Nelson Amaral Department of Computjng Science

Nathan Michael Z Jacky Michael N. Braedy Sarah Rebecca Eldon Dylan Thomas Ben Hao Ben Hao Tristan Brad Simons Artem Chikin Lyle Roleman Artem H.

September 2014

Antem Chikin

September 2014 July 2018

Taylor Lloyd Artem Chikin

November 2014

htup://www.extremetech.com/computjng/194283-ibm-and-nvidia-will-build-two-ultra-effjcient-150-petafmop-supercomputers-for-the-doe htup://www.cnet.com/news/ibm-nvidia-land-325-million-supercomputer-deal/ htup://www.anandtech.com/show/8727/nvidia-ibm-supercomputers h tu p : / / w w w . z d n e t . c o m / a r tj c l e / i b m - n v i d i a - t a 9 p p e d - t o - b u i l d - w o r l d s - f a s t e s t - s u p e r c o m p u t e r s /

200000 Linpack TFLOP/S 200000 Linpack TFLOP/S 88000 Linpack TFLOP/S 88000 Linpack TFLOP/S "Summit … is expected to deliver more than fjve tjmes 15000 KW 15000 KW the system-level applicatjon performance of Titan while 9000 KW 9000 KW consuming only 10% more power." htup://info.nvidianews.com/rs/nvidia/images/Coral %20White%20Paper%20Final-3-2.pdf 10 htup://www.top500.org/list/2014/11/

Technology? Nvidia Volta GPU IBM Power9 Nvidia NVlink

Programming Model? OpenMP OpenACC MPI OpenCL CUDA

Compiler Technology? LLVM IBM XL Compiler

May 2015

Etuore Tiotuo J. Nelson Amaral Artem Chikin Taylor Lloyd Science Internship Program IBM Canada Sofuware Laboratory Markham, ON

Programming Model ➔ OpenMP 3.x OpenMP 4.x

void vecAdd( double *a, double *b, double *c, int n) CPU core { for ( int i = 0; i < n; i++) { c[i] = a[i] + b[i]; cache } } void vecAdd( double *a, double *b, double *c, int n) CPU { core core #pragma omp parallel for for ( int i = 0; i < n; i++) { c[i] = a[i] + b[i]; cache cache } } CPU TARGET team void vecAdd( double *a, double *b, double *c, int n) { #pragma omp target map(to: a[:n], b[:n]) \ map(from: c[:n]) for ( int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } } Memory Memory

TARGET CPU void vecAdd( double *a, double *b, double *c, int n) { team team #pragma omp target map(to: a[:n], b[:n]) \ map(from: c[:n]) #pragma omp parallel for for ( int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } } Memory Memory TARGET CPU team team team void vecAdd( double *a, double *b, double *c, int n) { #pragma omp target map(to: a[:n], b[:n]) \ map(from: c[:n]) #pragma omp teams parallel for for ( int i = 0; i < n; i++) { c[i] = a[i] + b[i]; This two teams are executjng } Memory Memory the same computatjon on the } same memory locatjons without synchronizatjon. Likely wrong result.

void vecAdd( double *a, double *b, double *c, int n) TARGET CPU { team team team #pragma omp target map(to: a[:n], b[:n]) \ map(from: c[:n]) #pragma omp teams distribute parallel for for ( int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } } The iteratjons of the loop are Memory Memory distributed to the two teams and the result is correct.

Memory Coalescing

warp of 32 threads accesses coalesced issues 32 accesses into a single transactjon in one cycle One 128-byte L1 cache line Accesses must be aligned to the boundary of a cache line. warp of 32 threads accesses coalesced into a single transactjon One 128-byte L1 cache line Warps may access addresses in any order within the cache line.

warp of 32 threads Four transactjons are required Four 32-byte L2 cache line

warp of 32 threads 32-byte cache line 32-byte cache line 32-byte cache line warp of 32 threads Inter-tread stride 32-byte cache line 32-byte cache line 32-byte cache line Intra-tread stride

October 2016

Karim Ali CSC building Karim Ali Taylor Lloyd Computjng Science Centre

Taint Analysis

a is tainted void main() { long int a = readCreditCardNumber(); long int b = 0; b = foo(a); print(b); } is b tainted? long int foo(int p) { if(p != 0) print(p); }

Taylor Lloyd Arithmetjc Control Form (ACF) Analysis

tx = threadIdx.x tx = threadIdx.x tx <= 256 tx > 256 tx = threadIdx.x tx = 256 *addr = a + tx tx > 256 *addr = a + tx tx = 256 + int readBounded(int* a) { int tx = threadIdx.x; if(tx > 256) *addr = a + tx tx = 256; int *addr = a + tx; return *addr return *addr; } ACF T (*addr) = (T > 256) *([a] + 4*256) + (T <= 256) *([a] + 4*T) ACF 0 (*addr) = (0 > 256)*([a] + 4*256) + (0 <= 256)*([a] + 4*0) ACF 0 (*addr) = [a] ACF 1 (*addr) – ACF 0 (*addr) = 4 ACF 1 (*addr) = [a] + 4

June/July 2017

Taylor Lloyd Artem Chikin Dhruv Jain Sanket Kedia

August 2017

Artem Chikin IBM Canada Sofuware Laboratory Computjng Science Centre Markham, ON

Iteratjon Point Difgerence Analysis (IPDA) ACF can be used for any pair of expressions. ACF can be based on the induction variables in a loop nest. ACF is useful when applied to address expressions in a loop nest. ACF can make a Data Dependence Graph more precise. Compiler can transform code based on IPDA.

IPDA Analysis in an example

conv2D : Two-dimensional convolutjon

for (CIVI = 0; CIVI < NI - 2; ++CIVI) { i = CIVI+1; for (CIVJ = 0; CIVJ < NJ - 2; ++CIVJ) { B[i*NJ + CIVJ + 1] = … ; } } B + 8*((CIVI+1)*NJ + CIVJ + 1) IPAD propagates symbolic expressions Base address for array B from dominant defjnitjon to each use. Assuming that data type size is 8 bytes

for (CIVI = 0; CIVI < NI - 2; ++CIVI) { i = CIVI+1; for (CIVJ = 0; CIVJ < NJ - 2; ++CIVJ) { B[i*NJ + CIVJ + 1] = … ; } } B + 8*((CIVI+1)*NJ + CIVJ + 1) Iteratjon Point Algebraic Difgerence: B + 8*((CIVI’+1)*NJ + CIVJ’ + 1) - (B + 8*((CIVI+1)*NJ + CIVJ + 1) ) 8*(CIVI’*NJ + CIVJ’) - 8*(CIVI*NJ + CIVJ) 8*((CIVI’-CIVI)*NJ + (CIVJ’-CIVJ)) = 0 ? 8*(∆CIVI*NJ + ∆CIVJ)

for (CIVI = 0; CIVI < NI - 2; ++CIVI) { i = CIVI+1; for (CIVJ = 0; CIVJ < NJ - 2; ++CIVJ) { B[i*NJ + CIVJ + 1] = … ; } } B + 8*((CIVI+1)*NJ + CIVJ + 1) Iteratjon Point Algebraic Difgerence: ∆CIVI ∆CIVJ = 0 ≠ 0 ≠ 0 = 0 ≠ 0 ≠ 0 = 0 ? 8*(∆CIVI*NJ + ∆CIVJ)

Loop Collapsing and Loop Interchange

for (i = 0; i < N; ++i) { for (j = 0; j < N; ++j) { A[i+j*N] = A[i+j*N] * A[i+j*N]; } } Loop Collapse j for (c = 0; c < N*N; ++c) { i = c / N; i j = c % N; A[i+j*N] = A[i+j*N] * A[i+j*N]; } c

Loop Interchange for (i = 0; i < N; ++i) for (j = 0; j < N; ++j) { { for (j = 0; j < N; ++j) for (i = 0; i < N; ++i) { { A[i+j*N] = A[i+j*N] * A[i+j*N]; A[i+j*N] = A[i+j*N] * A[i+j*N]; } } } } j Loop Collapse for (c = 0; c < N*N; ++c) i { i = c / N; j = c % N; A[i+j*N] = A[i+j*N] * A[i+j*N]; } c

A detailed example of how IPDA Analysis helps

557.pcsp

It is an OpenMP program SP = Pentadiagonal Solver 557.pcsp It is a C language program

4-dimensional loop and Outer-dimension range: 0, 1, 2 for (k = 1; k <= gp2-2; k++) { for (j = 0; j <= gp1-3; j++) { j1 = j + 1; j2 = j + 2; for (i = 1; i <= gp0-2; i++) { ・・・ for (m = 0; m < 3; m++) { } ・・・ } } }

i is innermost loop and last coordinate

j elements from three rows accessed data dependence on loop j ⇒ j loop is sequentjal

loop nest is not perfect

Expression Re-materializatjon

We will focus on m=3

for (k = 1; k <= gp2-2; k++) { for (j = 0; j <= gp1-3; j++) { for (i = 1; i <= gp0-2; i++) { Sequentjal Executjon ・・・ lhsY[3][k][j][i] = fac1* lhsY[3][k][j][i]; ・・・ } } } k i j

Parallelizing loop k k i j Intra-thread access patuern. lhsY[3][k][j][i]

warp of 32 threads Parallelizing loop k k i Inter-thread access patuern? j None of the accesses are coalesced lhsY[3][k][j][i]

Interchange loops j and i

warp of 32 threads Parallelizing loop k k i Inter-thread access patuern? j None of the accesses are coalesced lhsY[3][k][j][i]

Collapse loops k and i

warp of 32 threads Parallelizing loop c Inter-thread access patuern? j Perfect coalescing lhsY[3][k][j][i]

1.4 ms 0 Executjon Time 41.13 ms On Nvidia Pascal (P100) Kernel is 29.4 tjmes faster Afuer IPAD-enabled transformatjons On Nvidia Volta (V100): 16.4 tjmes faster Benchmark speedups: Pascal (P100): 1.53x Volta (V100): 1.26x *Benchmarks was not verifying.

1.4 ms 0 Executjon Time 41.13 ms 88.5 On Nvidia Pascal (P100) Kernel is 29.4 tjmes faster Afuer IPAD-enabled transformatjons 111 On Nvidia Volta (V100): 16.4 tjmes faster Benchmark speedups: 3.33x Pascal (P100): 1.53x 2.3x Volta (V100): 1.26x *Afuer bug fjgs with benchmark verifying.

Etuore Tiotuo J. Nelson Amaral Artem Chikin October 2017

March 2018

Opportunitjes in three other Benchmarks

Improved Static Analysis to Generate More Effjcient Code for - PowerPoint PPT Presentation

Improved Static Analysis to Generate More Effjcient Code for Execution of Loop Nests in GPUs J. Nelson Amaral Department of Computjng Science Nathan Michael Z Jacky Michael N. Braedy Sarah Rebecca Eldon Dylan Thomas Ben Hao Ben Hao

Static Code Analysis of Complex PHP Application Vulnerabilities Johannes Dahse Static Code

Static analysis of OpenAFS code base Cheyenne Wills OpenAFS 2019 Workshop Overview What is

Static and Method Overloading static One per class, not per object static variables

Static and dynamic verification Static and dynamic V&V Software inspections Concerned

A Scalable, Portable, and Memory-Effjcient Lock-Free FIFO Queue Ruslan Nikolaev Systems

Language to Image Generation Generate a bird with Generate a bird with Generate a bird

Translation Models Machine-dependent Generate Machine Code Directly Through

A Brief Introduction to Static Analysis Sam Blackshear March 13, 2012 Outline A theoretical

Static Analysis of Haskell Neil Mitchell http://ndmitchell.com Static Analysis is getting

Improved pythonDEVS Simulator Improved pythonDEVS Simulator Improved pythonDEVS Simulator

Static and dynamic verification Software inspections Concerned with analysis of the static

Learn more Do more Be more Learn more Do more Be more UNITY Learn more Do

Safe(r) Kotlin Code Static Analysis Tools for Kotlin Marvin Ramin Static Analysis public class

Source Code Analysis for Security through LLVM Lu Zhao HP Fortify lu.zhao@hp.com Static Code

RIPS RIPS A static source code analyser NDS Seminar for vulnerabilities in PHP scripts A

Static Single Assignment Form Last Time Static single assignment (SSA) form Today

Horn clause logic: the knowns and the unknowns Ekaterina Komendantskaya School of Mathematical

Multi-Paradigm Declarative Programming in Curry Michael Hanus RWTH Aachen 1 Declarative

1 Comments from previous renditions 1. Go to office hours, 2. Go to TA sessions, 3. Dont

CTSA Program Steering Committee February 11, 2019 2:30 4:00 ET Agenda 2:30 Welcome Barry

Program Transformation and Constraint-based Verification Valerio Senni Department of Computer

Quantum Algorithms Tutorial Ronald de Wolf 1/ 37 Post-quantum cryptography I Quantum computers

Novel Fuel Saving Technology; Compliance, Approvals and Our Innovative Compliance Process SMTF

A Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs Taylor

Sambuz

Useful Links

Newsletter

Mail Us

Improved Static Analysis to Generate More Effjcient Code for - PowerPoint PPT Presentation

Improved Static Analysis to Generate More Effjcient Code for Execution of Loop Nests in GPUs J. Nelson Amaral Department of Computjng Science Nathan Michael Z Jacky Michael N. Braedy Sarah Rebecca Eldon Dylan Thomas Ben Hao Ben Hao

Static Code Analysis of Complex PHP Application Vulnerabilities Johannes Dahse Static Code

Static analysis of OpenAFS code base Cheyenne Wills OpenAFS 2019 Workshop Overview What is

Static and Method Overloading static One per class, not per object static variables

Static and dynamic verification Static and dynamic V&amp;V Software inspections Concerned

A Scalable, Portable, and Memory-Effjcient Lock-Free FIFO Queue Ruslan Nikolaev Systems

Language to Image Generation Generate a bird with Generate a bird with Generate a bird

Translation Models Machine-dependent Generate Machine Code Directly Through

A Brief Introduction to Static Analysis Sam Blackshear March 13, 2012 Outline A theoretical

Static Analysis of Haskell Neil Mitchell http://ndmitchell.com Static Analysis is getting

Improved pythonDEVS Simulator Improved pythonDEVS Simulator Improved pythonDEVS Simulator

Static and dynamic verification Software inspections Concerned with analysis of the static

Learn more Do more Be more Learn more Do more Be more UNITY Learn more Do

Safe(r) Kotlin Code Static Analysis Tools for Kotlin Marvin Ramin Static Analysis public class

Source Code Analysis for Security through LLVM Lu Zhao HP Fortify lu.zhao@hp.com Static Code

RIPS RIPS A static source code analyser NDS Seminar for vulnerabilities in PHP scripts A

Static Single Assignment Form Last Time Static single assignment (SSA) form Today

Horn clause logic: the knowns and the unknowns Ekaterina Komendantskaya School of Mathematical

Multi-Paradigm Declarative Programming in Curry Michael Hanus RWTH Aachen 1 Declarative

1 Comments from previous renditions 1. Go to office hours, 2. Go to TA sessions, 3. Dont

CTSA Program Steering Committee February 11, 2019 2:30 4:00 ET Agenda 2:30 Welcome Barry

Program Transformation and Constraint-based Verification Valerio Senni Department of Computer

Quantum Algorithms Tutorial Ronald de Wolf 1/ 37 Post-quantum cryptography I Quantum computers

Novel Fuel Saving Technology; Compliance, Approvals and Our Innovative Compliance Process SMTF

A Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs Taylor

Sambuz

Useful Links

Newsletter

Mail Us

Static and dynamic verification Static and dynamic V&V Software inspections Concerned