Atelier Num erique OMP Code Optimization: Vectorization Bertrand - PowerPoint PPT Presentation

Atelier Num´ erique OMP Code Optimization: Vectorization Bertrand Putigny July 5, 2016 1 / 27

HPC Hardware Architecture Overview Cluster: CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP 2 / 27

Increasing Clusters (computing) Power ⋄ node performance: ◦ ր number of core ◮ memory system (caches hierarchy, prefetcher) ◦ ր core computing power: ◮ ր frequency (over since 2005: heat, electrical consumption) ◮ instruction level parallelism (out of order execution, super scalar execution, ...) ◮ data parallelism ⋄ number of nodes: ◦ communication 3 / 27

Exploiting Such Hardware ⋄ node performance: ◦ ր number of core ◮ memory system (caches hierarchy, prefetcher) ◦ ր core computing power: ◮ ր frequency (over since 2005: heat, electrical consumption) ◮ instruction level parallelism (out of order execution, super scalar execution, ...) ◮ data parallelism ⋄ number of nodes: ??? ◦ communication ⋄ MPI ⋄ OpenMP 4 / 27

Exploiting Such Hardware ⋄ node performance: ◦ ր number of core ◮ memory system (caches hierarchy, prefetcher) ◦ ր core computing power: ◮ ր frequency (over since 2005: heat, electrical consumption) ◮ instruction level parallelism (out of order execution, super scalar execution, ...) ◮ data parallelism ⋄ number of nodes: compiler optimization ◦ communication ⋄ MPI ⋄ OpenMP 4 / 27

Outline Introduction Vectorization Vector Instruction Code Transformation and Optimization Code Vectorization Tools Vector Advisor Usage Conclusion 5 / 27

Vector Instruction SIMD : Single Instruction Multiple Data A0 A1 A2 A3 + ⋄ exploits data parallelism ⋄ operation on vectors B0 B1 B2 B3 ◦ arithmetic = ◦ binary A0 + B0 A1 + B1 A2 + B2 A3 + B3 6 / 27

SIMD Instruction Sets ⋄ SSE : 128bits ◦ 2 double precision reals ◦ 4 single precision reals ⋄ AVX : 256bits ◦ 4 double precision reals ◦ 8 single precision reals ⋄ coming up: AVX-256 : 512bits SIMD is here to stay: Trends: ⋄ larger vectors ⋄ more instructions ( FMA , gather...) ⇒ need to optimize code for SIMD 7 / 27

Using SIMD instructions ⋄ automatic code vectorization (compiler) ⋄ hand vectorization (assembly, intrinsic) ◦ poor portability (depends both on the hardware and the compiler) ◦ hard to write ◦ hard to read ⇒ not a good option 8 / 27

Using SIMD instructions ⋄ automatic code vectorization (compiler) ⋄ hand vectorization (assembly, intrinsic) ◦ poor portability (depends both on the hardware and the compiler) ◦ hard to write ◦ hard to read ⇒ not a good option Solution: Understand basics of compiler code vectorization: ⋄ understand why automatic code vectorization failed ⋄ help the compiler with high level code transformation 8 / 27

Notation Note: ⋄ C-like code illustrating transformation ⋄ actually performed by the compiler on its IR 9 / 27

Automatic Code Vectorization Code transformation: Do the same thing ”differently”: ⋄ keep the same semantic ⋄ different code versions ⋄ can be done at several level ◦ source code level (source to source compilers) ◦ intermediate representation (most of the time) ◦ instruction level Code transformation examples: ⋄ instruction scheduling (optimize ILP, at assembly level) ⋄ scalar promotion (IR level) for (i=0; i<N; i++) { for (j=0; j<N; j++) { A[i][j] = (1/( double) i) * A[i][j]; } } ⋄ loop tiling (cache access optimization, most of the time by hand) 10 / 27

Automatic Code Vectorization Code Transformation: 1. rely on loop unrolling 2. turn set of instructions (scalar) into a single vector instruction Original code: 1. Unrolled loop: for(i=0; i<SIZE; i++) { // peeling (if need be) y[i] = x[i] + y[i]; for(i=0; i<SIZE -SIZE %4; i+=4) { } y[i] = x[i] + y[i]; y[i+1] = x[i+1] + y[i+1]; y[i+2] = x[i+2] + y[i+2]; y[i+3] = x[i+3] + y[i+3]; } // remainder ... 2. Vectorized pseudo-code: for(i=0; i<SIZE -SIZE %4; i+=4) { y[i:i+3] = x[i:i+3] + y[i:i+3]; } // remainder ... 11 / 27

Factor Affecting Code Vectorization: Trip Count Scalar code: for(i=0; i <7; i++) { ≈ 7 cycles y[i] = x[i] + y[i]; } Vectorized: for(i=0; i <4; i+=4) { y[i:i+3] = x[i:i+3] + y[i:i+3]; } ≈ 4 cycles y[4] = x[4] + y[4]; y[5] = x[5] + y[5]; y[6] = x[6] + y[6]; Vectorized with padding: for(i=0; i <8; i+=4) { ≈ 2 cycles y[i:i+3] = x[i:i+3] + y[i:i+3]; } 12 / 27

Factor Affecting Code Vectorization: Dependencies Loop-carried data dependencies: ⋄ cannot be vectorized: for(i=1; i<SIZE; i++) { y[i] = y[i -1] - y[i]; } ⋄ can be vectorized if vector length ≤ 4: for(i=4; i<SIZE; i++) { y[i] = y[i -4] - y[i]; } ... y[i] y[i+1] y[i+2] y[i+3] y[i+4] y[i+5] y[i+6] y[i+7] iter i: ... iter i+4: y[i-4] y[i-3] y[i-2] y[i-1] y[i] y[i+1] y[i+2] y[i+3] ⇒ use OpenMP 4.0 pragma omp simd safelen(n) 13 / 27

Factor Affecting Code Vectorization: Aliasing Pointer Aliasing: void foo(double *x, double *y, int n) { for(i=0; i<n; i++) { x[i] = y[i] - x[i]; } } void bar () { foo(x, x+1, n -1); } ⇒ use compiler -fno-alias option (if you do not use aliasing) 14 / 27

Factor Affecting Code Vectorization: Data Layout Poor memory access: Optimal memory access: struct coord { struct coord { double x; double *x; double y; double *y; }; }; for(i=0; i<n; i++) { for(i=0; i<n; i++) { points[i].x += v.x; points.x[i] += v.x[0]; points[i].y += v.y; points.y[i] += v.y[0]; } } p[0].x p[0].y p[1].x p[1].y p[2].x p[2].y p[3].x p[3].y ... p[0].x p[1].x p[2].x p[3].x p[4].x p[5].x p[6].x p[7].x ... MEM: MEM: REG: p[0].x p[1].x p[2].x p[3].x REG: p[0].x p[1].x p[2].x p[3].x 15 / 27

Factor Affecting Code Vectorization: Control Flow Conditionals: y[i] y[i+1] y[i+2] y[i+3] for(i=0; i<n; i++) { if (x[i] > threshold) { mask: true true false true x[i] = y[i]; } } x[i] x[i+1] x[i+2] x[i+3] ⇒ can be vectorized using masks Function calls: for(i=0; i<n; i++) { x[i] = f(y[i]); } ⇒ use OpenMP 4.0 pragma omp declare simd 16 / 27

Factor Affecting Code Vectorization: Reduction Sum: r = .0; for(i=0; i<n; i++) { r += x[i]; } ⇒ use pragma omp reduction(+: r) 17 / 27

Outline Introduction Vectorization Vector Instruction Code Transformation and Optimization Code Vectorization Tools Vector Advisor Usage Conclusion 18 / 27

Performance analysis: Static code analysis ⋄ characterize loops ◦ vectorized ◦ scalar Profiling: ⋄ program instrumentation ⋄ record performance metrics ◦ time spent in loop ◦ number of execution ◦ ... 19 / 27

Intel Vector Advisor Features: ⋄ static code analysis ⋄ binary code instrumentation ◦ user friendly (no need to change source code) ◦ instrumentation after optimization ⋄ developed by hardware manufacturer ⇒ good hardware knowledge ⋄ handy optimization tips 20 / 27

Vector Advisor Usage 1. Find hotspots: (survey) ⋄ focus on small part of the code that matters ⋄ find performance issues from static code analysis ◦ vectorized loops vs scalar loops ( SSE or AVX ?) ◦ reason preventing vectorization ◦ inefficient vectorization (instruction such as shuffle) 2. Run deeper analysis ⋄ find performance issues based on runtime collected data ◦ memory access pattern ◦ trip count ◦ inefficient loop peeling or remainder ◦ check runtime dependency 3. Make modifications accordingly ⋄ go back to 1. 21 / 27

Analysis: Summary vectorization efficiency : estimation based on: ⋄ of time spent in vectorized body ⋄ peeling or remainder ⋄ static code analysis ⋄ and runtime metrics ⋄ simulation 22 / 27

Analysis: Survey ⋄ which loops were vectorized and which were not ◦ reason ⇒ should help vectorizing some loops ⋄ vectorization efficiency ◦ low efficiency: too long peeling or remainder? ⇒ run trip count analysis ◦ if in loop nest: should we vectorize another loop? ⋄ traits (not shown above): instruction that can affect performance: ◦ insert ◦ extract ◦ shuffle ◦ division ◦ ... ⇒ change data layout? (memory access pattern can provide more insight) 23 / 27

Analysis: Trip Count Count number of iteration of a loop: ⋄ mark loop for deeper analysis in the GUI ⋄ run the analysis again ⋄ no peeling: good memory alignment ⋄ body executed 62 time ⋄ remainder vectorized and executed once 24 / 27

Analysis: Memory Access Pattern ⋄ access to memory: stride 1 / constant stride / non constant stride ⋄ non constant stride ◦ work on data layout ◦ in loop nest: should you vectorize another loop 25 / 27

Analysis: Runtime dependency Check Check data dependency at runtime ⋄ this is for one run! ⋄ help forcing vectorization of a loop (with simd pragma) ⋄ but make sure there is really no dependency at algorithmic level 26 / 27

Summary Iterative optimization process: 1. find hotspots 2. characterize issues 3. make changes accordingly 4. compare with initial code ⋄ only spend time on code that matters (hotspots) ⋄ understand why vectorization failed or do not perform well ⋄ compiler optimization are complex, and can be unpredictable ◦ don’t try to guess: check performance metrics 27 / 27

Atelier Num erique OMP Code Optimization: Vectorization Bertrand - PowerPoint PPT Presentation

Atelier Num erique OMP Code Optimization: Vectorization Bertrand Putigny July 5, 2016 1 / 27 HPC Hardware Architecture Overview Cluster: CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP

Simple Overflow 1 #include <stdio.h> int main(void){ unsigned int num = 0xffffffff;

More List-of-Num Examples ; A list-of-num is either ; - empty ; - (cons num list-of-num)

OSPF Optimized Multipath (OSPF-OMP) Curtis Villamiza r < curtis@ans.net > URLs

Functions Num : N N and Sub : N 3 N , Num( a ) := a , Sub( , x ,

;; keep-rel (num num -> bool) num list-of-nums -> list-of-nums ;; Purpose: keep all the

OMP/LuMI OMP/LuMI Cold atoms, atom interferometry & clocks Hlne Perrin (Laboratoire de

Analyse et r esolution num erique d equations alg ebro-diff erentielles

(In)Security of IoT Pascal Lafourcade Chaire de Confiance Num erique 15th March 2016 1 / 19

Atelier Impact de lactivit humaine sur Impact de lactivit humaine sur Atelier

Rob De Brincat New Business Manager Innovation Atelier Projects Introduction Atelier

THE E DOT OTS S FROM OM WO WORK K COM OMP P TO O EXPE PERIE RIENCE NCE RATING ING

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Functions and procedures Rules of Processing Problem statement (short form) ;; Data Definition

Israels Chronology Egypt Ex 1-11 Egypt Sinai

C3 B: Exploiting the Num erous C3 B: Exploiting the Num erous Possibilities W eb Technology

Atelier Lexploitation durable des ressources Lexploitation durable des ressources

B Method Proof assistants May 16, 2017 Lucas Franceschino What is B method? B-method goal

THE B METHOD Level 1 overview introduction concepts of B the B language to B B

Ancient History school resources Presented by Roth schools outreach program 1 April 2017 Dr Eve

Quantification and Qualification of High Lime Fly Ash by Efficiency Factor: Mechanical and

Tableaux Modulo Theories using Superdeduction An Application to the Verification of B Proof Rules

PUF based Security Enhancement for Automotive Software Update Hiroyuki

Koshino House Tadao Ando Koshino House Introduction In the Koshino House, Tadao Ando arranged

bonobo Simple ETL in Python 3.5+ Romain Dorgueil @rdorgueil CTO/Hacker in Residence LAtelier

Atelier Num erique OMP Code Optimization: Vectorization Bertrand - PowerPoint PPT Presentation

Atelier Num erique OMP Code Optimization: Vectorization Bertrand Putigny July 5, 2016 1 / 27 HPC Hardware Architecture Overview Cluster: CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP CMP

Simple Overflow 1 #include &lt;stdio.h&gt; int main(void){ unsigned int num = 0xffffffff;

More List-of-Num Examples ; A list-of-num is either ; - empty ; - (cons num list-of-num)

OSPF Optimized Multipath (OSPF-OMP) Curtis Villamiza r &lt; curtis@ans.net &gt; URLs

Functions Num : N N and Sub : N 3 N , Num( a ) := a , Sub( , x ,

;; keep-rel (num num -&gt; bool) num list-of-nums -&gt; list-of-nums ;; Purpose: keep all the

OMP/LuMI OMP/LuMI Cold atoms, atom interferometry &amp; clocks Hlne Perrin (Laboratoire de

Analyse et r esolution num erique d equations alg ebro-diff erentielles

(In)Security of IoT Pascal Lafourcade Chaire de Confiance Num erique 15th March 2016 1 / 19

Atelier Impact de lactivit humaine sur Impact de lactivit humaine sur Atelier

Rob De Brincat New Business Manager Innovation Atelier Projects Introduction Atelier

THE E DOT OTS S FROM OM WO WORK K COM OMP P TO O EXPE PERIE RIENCE NCE RATING ING

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Functions and procedures Rules of Processing Problem statement (short form) ;; Data Definition

Israels Chronology Egypt Ex 1-11 Egypt Sinai

C3 B: Exploiting the Num erous C3 B: Exploiting the Num erous Possibilities W eb Technology

Atelier Lexploitation durable des ressources Lexploitation durable des ressources

B Method Proof assistants May 16, 2017 Lucas Franceschino What is B method? B-method goal

THE B METHOD Level 1 overview introduction concepts of B the B language to B B

Ancient History school resources Presented by Roth schools outreach program 1 April 2017 Dr Eve

Quantification and Qualification of High Lime Fly Ash by Efficiency Factor: Mechanical and

Tableaux Modulo Theories using Superdeduction An Application to the Verification of B Proof Rules

PUF based Security Enhancement for Automotive Software Update Hiroyuki

Koshino House Tadao Ando Koshino House Introduction In the Koshino House, Tadao Ando arranged

bonobo Simple ETL in Python 3.5+ Romain Dorgueil @rdorgueil CTO/Hacker in Residence LAtelier

Simple Overflow 1 #include <stdio.h> int main(void){ unsigned int num = 0xffffffff;

OSPF Optimized Multipath (OSPF-OMP) Curtis Villamiza r < curtis@ans.net > URLs

;; keep-rel (num num -> bool) num list-of-nums -> list-of-nums ;; Purpose: keep all the

OMP/LuMI OMP/LuMI Cold atoms, atom interferometry & clocks Hlne Perrin (Laboratoire de