OpenMP Kenjiro Taura 1 / 74 Contents 1 Overview 2 A Running - PowerPoint PPT Presentation

OpenMP Kenjiro Taura 1 / 74

Contents 1 Overview 2 A Running Example: SpMV 3 parallel pragma 4 Work sharing constructs loops ( for ) scheduling task parallelism ( task and taskwait ) 5 Data sharing clauses 6 SIMD constructs 2 / 74

Goal learn OpenMP, by far the most widespread standard API for shared memory parallel programming learn that various schedulers execute your parallel programs differently 4 / 74

A running example: Sparse Matrix Vector Multiply (SpMV) sparse matrix : a matrix whose elements are mostly zeros i.e. the number of non-zero elements (nnz) ≪ the number of all elements ( M × N ) M : the number of rows N : the number of columns N A y = x M 6 / 74

Sparse matrices appear everywhere meshes in scientific simulation A i,j = a weight connecting nodes i and j in the mesh graphs, which in turn appear in many applications A i,j = the weight of the edge i → j (or j → i ) Web, social network, road/traffic networks, metabolic pathways, etc. many problems can be formulated as SpMV or can be solved using SpMV eigenvalues (including PageRank, graph partitioning, etc.) partial differential equation . . . 7 / 74

What makes “sparse” matrix different from ordinary (dense) matrix? the number of non-zero elements are so small that representing it as M × N array is too wasteful (or just impossible) → use a data structure that takes memory/computation only (or mostly) for non-zero elements (coordinate list, compressed sparse row, etc.) N A y = x M 8 / 74

Coordinate list (COO) represent a matrix as a list of ( i, j, A i,j )’s x data format: j ✞ struct coo { 1 i int n_rows, n_cols, nnz; 2 / ∗ nnz elements ∗ / Aij 3 struct { i, j, Aij } * elems; 4 }; 5 y = SpMV ( y = Ax ) ✞ for (k = 0; k < A.nnz; k++) { 1 i,j,Aij = A.elems[k]; 2 y[i] += Aij * x[j]; 3 } 4 9 / 74

Compressed sparse row (CSR) puts elements of a single row in a contiguous range an index (number) specifies where a particular row begins in the elems array → no need to have i for every single element data format: ✞ struct coo { 1 int n_rows, n_cols, nnz; 2 struct { j, Aij } * elems; // nnz elements 3 int * row_start; // n rows elements 4 }; 5 elems[row start[ i ]] · · · elems[row start[ i + 1 ]] are the elements in the i th row SpMV ( y = Ax ) ✞ for (i = 0; i < A.n_rows; i++) { 1 for (k = A.row_start[i]; k < A.row_start[i+1]; k++) { 2 j,Aij = A.elems[k]; 3 y[i] += Aij * x[j]; 4 } 5 10 / 74 } 6

OpenMP de fact standard model for programming shared memory machines C/C++/Fortran + parallel directives + APIs by #pragma in C/C++ by comments in Fortran many free/vendor compilers, including GCC 11 / 74

OpenMP reference official home page: http://openmp.org/ specification: http://openmp.org/wp/openmp-specifications/ latest version is 4.5 ( http://www.openmp.org/mp-documents/openmp-4.5.pdf ) section numbers below refer to those in OpenMP spec 4.0 ( http: //www.openmp.org/mp-documents/OpenMP4.0.0.pdf ) 12 / 74

GCC and OpenMP http://gcc.gnu.org/wiki/openmp gcc 4.2 → OpenMP spec 2.5 gcc 4.4 → OpenMP spec 3.0 (task parallelism) gcc 4.7 → OpenMP spec 3.1 gcc 4.9 → OpenMP spec 4.0 (SIMD) 13 / 74

Compiling/running OpenMP programs with GCC compile with -fopenmp ✞ $ gcc -Wall -fopenmp program.c 1 run the executable specifying the number of threads with OMP NUM THREADS environment variable ✞ $ OMP NUM THREADS=1 ./a.out # use 1 thread 1 $ OMP NUM THREADS=4 ./a.out # use 4 threads 2 see 2.5.1 “Determining the Number of Threads for a parallel Region” for other ways to control the number of threads 14 / 74

Two pragmas you must know first ... #pragma omp parallel to #pragma omp parallel launch a team of threads (2.5) then #pragma omp for to #pragma omp for for (i = 0; i < n; i++) { ... } distribute iterations to threads (2.7.1) Note: all OpenMP pragmas have the common format: #pragma omp ... ... 16 / 74

#pragma parallel basic syntax: ✞ ... ... 1 #pragma omp parallel 2 S 3 ... 4 basic semantics: S S S S create a team of OMP NUM THREADS threads the current thread becomes the master of the team S will be executed by each ... member of the team the master thread waits for all to finish S and continue 17 / 74

parallel pragma example ✞ #include <stdio.h> 1 int main() { 2 printf("hello\n"); 3 #pragma omp parallel 4 printf("world\n"); 5 return 0; 6 } 7 ✞ $ OMP NUM THREADS=1 ./a.out 1 hello 2 world 3 $ OMP NUM THREADS=4 ./a.out 4 hello 5 world 6 world 7 world 8 world 9 18 / 74

Remarks : what does parallel do? you may assume an OpenMP thread ≈ OS-supported thread (e.g., Pthread) that is, if you write this program ✞ int main() { 1 #pragma omp parallel 2 worker(); 3 } 4 and run it as follows, ✞ $ OMP NUM THREADS=50 ./a.out 1 you will get 50 OS-level threads, each doing worker() 19 / 74

How to distribute work among threads? #pragma omp parallel creates threads, all executing the same statement it’s not a means to parallelize work, per se , but just a means to create a number of similar threads (SPMD) so how to distribute (or partition) work among them? 1 do it yourself 2 use work sharing constructs 20 / 74

Do it yourself: functions to get the number/id of threads omp get num threads() (3.2.2) : the number of threads in the current team omp get thread num() (3.2.4) : the current thread’s id (0, 1, . . . ) in the team they are primitives with which you may partition work yourself by whichever ways you prefer e.g., ✞ #pragma omp parallel 1 { 2 int t = omp_get_thread_num(); 3 int nt = omp_get_num_threads(); 4 / ∗ divide n iterations evenly amont nt threads ∗ / 5 for (i = t * n / nt; i < (t + 1) * n / nt; i++) { 6 ... 7 } 8 } 9 21 / 74

Work sharing constructs in theory, parallel construct is all you need to do things in parallel but it’s too inconvenient OpenMP defines ways to partition work among threads (work sharing constructs) for task section 24 / 74

#pragma omp for (work-sharing for) ... basic syntax: #pragma omp parallel ✞ #pragma omp for 1 for(i=...; i...; i+=...){ 2 S 3 } 4 #pragma omp for for (i = 0; i < n; i++) { ... } basic semantics: the threads in the team divde the iterations among them but how? ⇒ scheduling ... 25 / 74

#pragma omp for restrictions not arbitrary for statement is allowed after a for pragma strong syntactic restrictions apply, so that the iteration counts can easily be identified at the beginning of the loop roughly, it must be of the form: ✞ #pragma omp for 1 for(i = init ; i < limit ; i += incr ) 2 S 3 except < and += may be other similar operators init , limit , and incr must be loop invariant 26 / 74

Parallel SpMV for CSR using #pragma omp for it only takes to work-share the outer for loop ✞ // assume inside #pragma omp parallel 1 ... 2 #pragma omp for 3 for (i = 0; i < A.n_rows; i++) { 4 for (k = A.row_start[i]; k < A.row_start[i+1]; k++) { 5 j,Aij = A.elems[k]; 6 y[i] += Aij * x[j]; 7 } 8 } 9 note: the inner loop ( k ) is executed sequentially 27 / 74

Parallel SpMV COO using #pragma omp for ? the following code does not work (why?) ✞ // assume inside #pragma omp parallel 1 ... 2 #pragma omp for 3 for (k = 0; k < A.nnz; k++) { 4 i,j,Aij = A.elems[k]; 5 y[i] += Aij * x[j]; 6 } 7 a possible workaround will be described later 28 / 74

Scheduling (2.7.1) schedule clause in work-sharing for loop determines how iterations are divided among threads There are three alternatives ( static, dynamic, and guided ) 30 / 74

static, dynamic, and guided #pragma omp for schedule( static) schedule(static [, chunk ] ) : 0 1 2 3 predictable round-robin #pragma omp for schedule( static,3) schedule(dynamic [, chunk ] ) : each thread repeats fetching #pragma omp for schedule( dynamic) chunk iterations #pragma omp for schedule( dynamic,2) schedule(guided [, chunk ] ) : threads grab many iterations in early stages; gradually #pragma omp for schedule( guided) reduce iterations to fetch at a time #pragma omp for schedule( guided,2) chunk specifies the minimum granularity (iteration counts) 31 / 74

OpenMP Kenjiro Taura 1 / 74 Contents 1 Overview 2 A Running - PowerPoint PPT Presentation

OpenMP Kenjiro Taura 1 / 74 Contents 1 Overview 2 A Running Example: SpMV 3 parallel pragma 4 Work sharing constructs loops ( for ) scheduling task parallelism ( task and taskwait ) 5 Data sharing clauses 6 SIMD constructs 2 / 74 Contents 1

Recommended Reading A Brief Introduction to OpenMP OpenMP FAQ http://openmp.org/openmp-faq.html

Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

OpenMP Paolo Burgio paolo.burgio@unimore.it A history of OpenMP 1997 OpenMP for

Threaded Programming Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013

Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is OpenMP? OpenMP is an API

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

Speeding Up Reactive Transport Code Using OpenMP By Jared McLaughlin OpenMP A standard for

Introduction to OpenMP Lecture 6: Further topics in OpenMP Nested parallelism Unlike most

Parallel Programming using OpenMP Qin Liu The Chinese University of Hong Kong 1 Overview Why

SHARED MEMORY PROGRAMMING WITH OPENMP Lecture 9: OpenMP Performance 2 A common scenario.....

Targeting GPUs with OpenMP 4.5 Device Directives James Beyer, NVIDIA Jeff Larkin, NVIDIA OpenMP

OpenMP on GPUs, First Experiences and Best Practices Jeff Larkin, GTC2018 S8344, March 2018 What

How to Get Good Performance by Using OpenMP ! ! Loop optimizations ! ! Measuring OpenMP

OpenMP 1 What is OpenMP? An Application Program Interface (API) used to explicitly direct

Algorithmic Differentiation of Structured Mesh Applications G abor D aniel Balogh

MATH 676 Finite element methods in scientifjc computing Wolfgang Bangerth, T exas A&M

* Dr. Axel Voigt (voigt@caesar.de) research center caesar crystal growth group

Thomas Hhn 4. Juni 2009 TU-Berlin, Berlin Why to How to Worksheets mesh ? mesh ? Outline

The search for Majorana neutrinos with a background-free gaseous Xenon TPC at the tonne scale F

Javad Lavaei Department of Electrical Engineering Columbia University Acknowledgements Caltech:

Impact of Airborne Heavy Metals on Lung Impact of Airborne Heavy Metals on Lung Disease and the

MTLE-6120: Advanced Electronic Properties of Materials Metal-metal junctions, Seebeck effect,

OpenMP Kenjiro Taura 1 / 74 Contents 1 Overview 2 A Running - PowerPoint PPT Presentation

OpenMP Kenjiro Taura 1 / 74 Contents 1 Overview 2 A Running Example: SpMV 3 parallel pragma 4 Work sharing constructs loops ( for ) scheduling task parallelism ( task and taskwait ) 5 Data sharing clauses 6 SIMD constructs 2 / 74 Contents 1

Recommended Reading A Brief Introduction to OpenMP OpenMP FAQ http://openmp.org/openmp-faq.html

Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

OpenMP Paolo Burgio paolo.burgio@unimore.it A history of OpenMP 1997 OpenMP for

Threaded Programming Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013

Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is OpenMP? OpenMP is an API

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

Speeding Up Reactive Transport Code Using OpenMP By Jared McLaughlin OpenMP A standard for

Introduction to OpenMP Lecture 6: Further topics in OpenMP Nested parallelism Unlike most

Parallel Programming using OpenMP Qin Liu The Chinese University of Hong Kong 1 Overview Why

SHARED MEMORY PROGRAMMING WITH OPENMP Lecture 9: OpenMP Performance 2 A common scenario.....

Targeting GPUs with OpenMP 4.5 Device Directives James Beyer, NVIDIA Jeff Larkin, NVIDIA OpenMP

OpenMP on GPUs, First Experiences and Best Practices Jeff Larkin, GTC2018 S8344, March 2018 What

How to Get Good Performance by Using OpenMP ! ! Loop optimizations ! ! Measuring OpenMP

OpenMP 1 What is OpenMP? An Application Program Interface (API) used to explicitly direct

Algorithmic Differentiation of Structured Mesh Applications G abor D aniel Balogh

MATH 676 Finite element methods in scientifjc computing Wolfgang Bangerth, T exas A&amp;M

* Dr. Axel Voigt (voigt@caesar.de) research center caesar crystal growth group

Thomas Hhn 4. Juni 2009 TU-Berlin, Berlin Why to How to Worksheets mesh ? mesh ? Outline

The search for Majorana neutrinos with a background-free gaseous Xenon TPC at the tonne scale F

Javad Lavaei Department of Electrical Engineering Columbia University Acknowledgements Caltech:

Impact of Airborne Heavy Metals on Lung Impact of Airborne Heavy Metals on Lung Disease and the

MTLE-6120: Advanced Electronic Properties of Materials Metal-metal junctions, Seebeck effect,

MATH 676 Finite element methods in scientifjc computing Wolfgang Bangerth, T exas A&M