OpenMP Kenjiro Taura 1 / 74
Contents 1 Overview 2 A Running Example: SpMV 3 parallel pragma 4 Work sharing constructs loops ( for ) scheduling task parallelism ( task and taskwait ) 5 Data sharing clauses 6 SIMD constructs 2 / 74
Contents 1 Overview 2 A Running Example: SpMV 3 parallel pragma 4 Work sharing constructs loops ( for ) scheduling task parallelism ( task and taskwait ) 5 Data sharing clauses 6 SIMD constructs 3 / 74
Goal learn OpenMP, by far the most widespread standard API for shared memory parallel programming learn that various schedulers execute your parallel programs differently 4 / 74
Contents 1 Overview 2 A Running Example: SpMV 3 parallel pragma 4 Work sharing constructs loops ( for ) scheduling task parallelism ( task and taskwait ) 5 Data sharing clauses 6 SIMD constructs 5 / 74
A running example: Sparse Matrix Vector Multiply (SpMV) sparse matrix : a matrix whose elements are mostly zeros i.e. the number of non-zero elements (nnz) ≪ the number of all elements ( M × N ) M : the number of rows N : the number of columns N A y = x M 6 / 74
Sparse matrices appear everywhere meshes in scientific simulation A i,j = a weight connecting nodes i and j in the mesh graphs, which in turn appear in many applications A i,j = the weight of the edge i → j (or j → i ) Web, social network, road/traffic networks, metabolic pathways, etc. many problems can be formulated as SpMV or can be solved using SpMV eigenvalues (including PageRank, graph partitioning, etc.) partial differential equation . . . 7 / 74
What makes “sparse” matrix different from ordinary (dense) matrix? the number of non-zero elements are so small that representing it as M × N array is too wasteful (or just impossible) → use a data structure that takes memory/computation only (or mostly) for non-zero elements (coordinate list, compressed sparse row, etc.) N A y = x M 8 / 74
Coordinate list (COO) represent a matrix as a list of ( i, j, A i,j )’s x data format: j ✞ struct coo { 1 i int n_rows, n_cols, nnz; 2 / ∗ nnz elements ∗ / Aij 3 struct { i, j, Aij } * elems; 4 }; 5 y = SpMV ( y = Ax ) ✞ for (k = 0; k < A.nnz; k++) { 1 i,j,Aij = A.elems[k]; 2 y[i] += Aij * x[j]; 3 } 4 9 / 74
Compressed sparse row (CSR) puts elements of a single row in a contiguous range an index (number) specifies where a particular row begins in the elems array → no need to have i for every single element data format: ✞ struct coo { 1 int n_rows, n_cols, nnz; 2 struct { j, Aij } * elems; // nnz elements 3 int * row_start; // n rows elements 4 }; 5 elems[row start[ i ]] · · · elems[row start[ i + 1 ]] are the elements in the i th row SpMV ( y = Ax ) ✞ for (i = 0; i < A.n_rows; i++) { 1 for (k = A.row_start[i]; k < A.row_start[i+1]; k++) { 2 j,Aij = A.elems[k]; 3 y[i] += Aij * x[j]; 4 } 5 10 / 74 } 6
OpenMP de fact standard model for programming shared memory machines C/C++/Fortran + parallel directives + APIs by #pragma in C/C++ by comments in Fortran many free/vendor compilers, including GCC 11 / 74
OpenMP reference official home page: http://openmp.org/ specification: http://openmp.org/wp/openmp-specifications/ latest version is 4.5 ( http://www.openmp.org/mp-documents/openmp-4.5.pdf ) section numbers below refer to those in OpenMP spec 4.0 ( http: //www.openmp.org/mp-documents/OpenMP4.0.0.pdf ) 12 / 74
GCC and OpenMP http://gcc.gnu.org/wiki/openmp gcc 4.2 → OpenMP spec 2.5 gcc 4.4 → OpenMP spec 3.0 (task parallelism) gcc 4.7 → OpenMP spec 3.1 gcc 4.9 → OpenMP spec 4.0 (SIMD) 13 / 74
Compiling/running OpenMP programs with GCC compile with -fopenmp ✞ $ gcc -Wall -fopenmp program.c 1 run the executable specifying the number of threads with OMP NUM THREADS environment variable ✞ $ OMP NUM THREADS=1 ./a.out # use 1 thread 1 $ OMP NUM THREADS=4 ./a.out # use 4 threads 2 see 2.5.1 “Determining the Number of Threads for a parallel Region” for other ways to control the number of threads 14 / 74
Contents 1 Overview 2 A Running Example: SpMV 3 parallel pragma 4 Work sharing constructs loops ( for ) scheduling task parallelism ( task and taskwait ) 5 Data sharing clauses 6 SIMD constructs 15 / 74
Two pragmas you must know first ... #pragma omp parallel to #pragma omp parallel launch a team of threads (2.5) then #pragma omp for to #pragma omp for for (i = 0; i < n; i++) { ... } distribute iterations to threads (2.7.1) Note: all OpenMP pragmas have the common format: #pragma omp ... ... 16 / 74
#pragma parallel basic syntax: ✞ ... ... 1 #pragma omp parallel 2 S 3 ... 4 basic semantics: S S S S create a team of OMP NUM THREADS threads the current thread becomes the master of the team S will be executed by each ... member of the team the master thread waits for all to finish S and continue 17 / 74
parallel pragma example ✞ #include <stdio.h> 1 int main() { 2 printf("hello\n"); 3 #pragma omp parallel 4 printf("world\n"); 5 return 0; 6 } 7 ✞ $ OMP NUM THREADS=1 ./a.out 1 hello 2 world 3 $ OMP NUM THREADS=4 ./a.out 4 hello 5 world 6 world 7 world 8 world 9 18 / 74
Remarks : what does parallel do? you may assume an OpenMP thread ≈ OS-supported thread (e.g., Pthread) that is, if you write this program ✞ int main() { 1 #pragma omp parallel 2 worker(); 3 } 4 and run it as follows, ✞ $ OMP NUM THREADS=50 ./a.out 1 you will get 50 OS-level threads, each doing worker() 19 / 74
How to distribute work among threads? #pragma omp parallel creates threads, all executing the same statement it’s not a means to parallelize work, per se , but just a means to create a number of similar threads (SPMD) so how to distribute (or partition) work among them? 1 do it yourself 2 use work sharing constructs 20 / 74
Do it yourself: functions to get the number/id of threads omp get num threads() (3.2.2) : the number of threads in the current team omp get thread num() (3.2.4) : the current thread’s id (0, 1, . . . ) in the team they are primitives with which you may partition work yourself by whichever ways you prefer e.g., ✞ #pragma omp parallel 1 { 2 int t = omp_get_thread_num(); 3 int nt = omp_get_num_threads(); 4 / ∗ divide n iterations evenly amont nt threads ∗ / 5 for (i = t * n / nt; i < (t + 1) * n / nt; i++) { 6 ... 7 } 8 } 9 21 / 74
Contents 1 Overview 2 A Running Example: SpMV 3 parallel pragma 4 Work sharing constructs loops ( for ) scheduling task parallelism ( task and taskwait ) 5 Data sharing clauses 6 SIMD constructs 22 / 74
Contents 1 Overview 2 A Running Example: SpMV 3 parallel pragma 4 Work sharing constructs loops ( for ) scheduling task parallelism ( task and taskwait ) 5 Data sharing clauses 6 SIMD constructs 23 / 74
Work sharing constructs in theory, parallel construct is all you need to do things in parallel but it’s too inconvenient OpenMP defines ways to partition work among threads (work sharing constructs) for task section 24 / 74
#pragma omp for (work-sharing for) ... basic syntax: #pragma omp parallel ✞ #pragma omp for 1 for(i=...; i...; i+=...){ 2 S 3 } 4 #pragma omp for for (i = 0; i < n; i++) { ... } basic semantics: the threads in the team divde the iterations among them but how? ⇒ scheduling ... 25 / 74
#pragma omp for restrictions not arbitrary for statement is allowed after a for pragma strong syntactic restrictions apply, so that the iteration counts can easily be identified at the beginning of the loop roughly, it must be of the form: ✞ #pragma omp for 1 for(i = init ; i < limit ; i += incr ) 2 S 3 except < and += may be other similar operators init , limit , and incr must be loop invariant 26 / 74
Parallel SpMV for CSR using #pragma omp for it only takes to work-share the outer for loop ✞ // assume inside #pragma omp parallel 1 ... 2 #pragma omp for 3 for (i = 0; i < A.n_rows; i++) { 4 for (k = A.row_start[i]; k < A.row_start[i+1]; k++) { 5 j,Aij = A.elems[k]; 6 y[i] += Aij * x[j]; 7 } 8 } 9 note: the inner loop ( k ) is executed sequentially 27 / 74
Parallel SpMV COO using #pragma omp for ? the following code does not work (why?) ✞ // assume inside #pragma omp parallel 1 ... 2 #pragma omp for 3 for (k = 0; k < A.nnz; k++) { 4 i,j,Aij = A.elems[k]; 5 y[i] += Aij * x[j]; 6 } 7 a possible workaround will be described later 28 / 74
Contents 1 Overview 2 A Running Example: SpMV 3 parallel pragma 4 Work sharing constructs loops ( for ) scheduling task parallelism ( task and taskwait ) 5 Data sharing clauses 6 SIMD constructs 29 / 74
Scheduling (2.7.1) schedule clause in work-sharing for loop determines how iterations are divided among threads There are three alternatives ( static, dynamic, and guided ) 30 / 74
static, dynamic, and guided #pragma omp for schedule( static) schedule(static [, chunk ] ) : 0 1 2 3 predictable round-robin #pragma omp for schedule( static,3) schedule(dynamic [, chunk ] ) : each thread repeats fetching #pragma omp for schedule( dynamic) chunk iterations #pragma omp for schedule( dynamic,2) schedule(guided [, chunk ] ) : threads grab many iterations in early stages; gradually #pragma omp for schedule( guided) reduce iterations to fetch at a time #pragma omp for schedule( guided,2) chunk specifies the minimum granularity (iteration counts) 31 / 74
Recommend
More recommend