More Advanced OpenMP This is an abbreviated form of Tim Mattsons - PowerPoint PPT Presentation

More Advanced OpenMP • This is an abbreviated form of Tim Mattson’s and Larry Meadow’s (both at Intel) SC ’08 tutorial located at http://openmp.org/mp-documents/o mp-hands-on-SC08.pdf • All errors are my responsibility

T opics (only OpenMP 3 in these slides) • Creating Threads OpenMP 4 • Synchronization • Extensions to tasking • Runtime library calls • User defjned reduction • Data environment operators • Scheduling for and • Construct cancellation sections • Portable SIMD • Memory Model directives • OpenMP 3.0 and • Thread affjnity T asks

Creating T asks • We already know about • parallel regions ( omp parallel ) • parallel sections ( omp parallel sections ) • parallel for ( omp parallel for ) or omp for when in a parallel region • We will now talk about T asks

T asks • OpenMP before OpenMP 3.0 has always had tasks • A parallel construct created implicit tasks, one per thread • A team of threads was created to execute the tasks • Each thread in the team is assigned (and tied ) to one task • Barrier holds the original master thread until all tasks are fjnished (note that the master may also execute a task)

T asks • OpenMP 3.0 allows us to explicitly create tasks. • Every part of an OpenMP program is part of some task, with the master task executing the program even if there is no explicit task

task construct syntax #pragma omp task [ clause [[,] clause ] ...] structured-block if (false) says execute the task by the spawning thread clauses: • difgerent task with respect to if ( expression ) synchronization untied • Data environment is local to the shared ( list ) thread private ( list ) • User optimization for cache affjnity firstprivate ( list ) and cost of executing on a difgerent default( shared | none ) thread Blue options are as untied says the task can be before and executed by more than one thread, associated with i.e., difgerent threads execute whether storage is difgerent parts of the task shared or private

When do we know a task is fjnished? • At explicit or implicit thread barriers • All tasks generated in the current parallel region are fjnished when the barrier for that parallel region fjnishes • Matches what you expect, i.e., when a barrier is reached the work preceding the barrier is fjnished • At task barriers • Wait until all tasks defjned in the current task are fjnished #pragma omp taskwait • Applies to tasks T directly generated in the current task, not to tasks generated by the tasks T

Example: parallel pointer chasing with parallel region #pragma omp parallel { value of p passed is value #pragma omp single private(p) of p at the time of the { p = listhead ; invocation. Saved while (p) { on the stack like with #pragma omp task any function call workfct (p) p=next (p) ; workfct is an ordinary user } function. } }

Example: parallel pointer chasing with for #pragma omp parallel { #pragma omp for private(p) for ( int i =0; i <numlists ; i++) { p = listheads [ i ] ; while (p ) { #pragma omp task workfct (p) p=next (p ) ; } } }

Example: parallel postorder graph traversal Parent task suspended until void postorder(node *p) { child tasks fjnish if (p->left) #pragma omp task postorder(p->left); This is a task if (p->right) scheduling point #pragma omp task postorder(p->right); #pragma omp taskwait // wait for descendants workfct(p->data); }

Example: postorder graph traversal in parallel void postorder(node *p) { // p is initially if (p->left) #pragma omp task postorder(p->left); if (p->right) #pragma omp task postorder(p->right); #pragma omp taskwait // wait for descendants workfct(p->data); } Postorder is called from within an omp parallel region

Postorder graph traversal in parallel — task wait void postorder(node *p) { // p is if (p->left) #pragma omp task postorder(p->left); if (p->right) #pragma omp task postorder(p->right); #pragma omp taskwait // wait for descendants workfct(p->data); }

Postorder graph traversal in parallel — task wait , , , , void postorder(node *p) { // p is if (p->left) workfct workfct #pragma omp task workfct workfct postorder(p->left); if (p->right) #pragma omp task postorder(p->right); #pragma omp taskwait // wait for descendants workfct(p->data); }

Postorder graph traversal in parallel — task wait void postorder(node *p) { // p is if (p->left) #pragma omp task postorder(p->left); if (p->right) #pragma omp task postorder(p->right); workfct #pragma omp taskwait // wait for descendants workfct(p->data); }

Postorder graph traversal in parallel — task wait void postorder(node *p) { // p is if (p->left) #pragma omp task postorder(p->left); if (p->right) #pragma omp task workfct workfct postorder(p->right); #pragma omp taskwait // wait for descendants workfct(p->data); }

Postorder graph traversal in parallel — task wait void postorder(node *p) { if (p->left) #pragma omp task postorder(p->left); if (p->right) #pragma omp task postorder(p->right); #pragma omp taskwait // wait for descendants process workfct(p->data); process } process

T ask scheduling points • Certain constructs contain task scheduling points ( task constructs, taskwait constructs, taskyield [ #pragma omp taskyield ] constructs, barriers (implicit and explicit), the end of a tied region) • Threads at task scheduling points can suspend their task and begin executing another task in the task pool ( task switching) • At the completion of the task or at another task scheduling point it can resume executing the original task

Example: task switching #pragma omp single { for (i=0; i<ONEZILLION; i++) #pragma omp task process(item[i]); } • Many tasks rapidly generated -- eventually more tasks than threads • Generated tasks will have to suspend until a thread can execute them • With task switching, the executing thread can • execute an already generated task, draining the task pool • execute the encountered task (could be cache friendly)

Example: thread switching #pragma omp single The task generating { other tasks is untied , #pragma omp task untied the tasks executing for (i=0; i<ONEZILLION; i++) process( ) are tied. #pragma omp task // tied process(item[i]); } • Eventually too many tasks are generated • T ask that is generating tasks is suspended and the task that is executed executes (for example) a long task • Other threads execute all of the already generated tasks and begin starving for work • With thread switching the task that generates tasks can be resumed by a difgerent thread and generate tasks, ending starvation • Programmer must specify this behavior with untied

sharing data • Supported, but you have to be careful. • Let p be a variable in a task T 1 • Let task T 1 spawn task T 2 • Let T 2 access p shared or lastprivate • If there is no taskwait, T 1 can finish before T 2 does. When T 1 finishes, p no longer exists to be asscessed or copied back to.

Synchronization • Locks • Nested locks

Simple locks • A simple lock is available if it is not set • Lock manipulation routines include: • omp_init_lock(...) • omp_set_lock(...) • omp_unset_lock(...) • omp_test_lock(...) • omp_destroy_lock

Simple lock example omp_lock_t lck; lck 0 omp_init_lock(&lck); #pragma omp parallel private (tmp, id) { lck 1 id = omp_get_thread_num(); tmp = do_lots_of_work(id); omp_set_lock(&lck); printf(“%d %d”, id, tmp); 0 lck omp_unset_lock(&lck); } omp_destroy_lock(&lck); lck

Consider the code below . . . void* items[100000000]; init(items); void* items[100000000]; init(items); omp_lock_t lck; #pragma omp parallel for omp_init_lock(&lck); { #pragma omp parallel for for (int i = 0; i < 100000000; i++) { { #pragma omp conflict for (int i = 0; i < 100000000; i++) { update(items[i]); omp_set_lock(&lck); } update(items[i]); omp_unset_lock(&lck); } omp_destroy_lock(&lck); Left and right code is pretty much the same and will essentially serialize the for loop.

Let’s try and do this with some actual parallelism void* items[100000000]; init(items); // items[i] and items[j] may point to // the same thing omp_lock_t lck[100000000]; This doesn’t work, why? for (int i = 0; i < 100000000; i++) omp_init_lock(&(lck[i])); Hint: what is being changed by #pragma omp parallel for update and what does the set { lock correspond to? for (int i = 0; i < 100000000; i++) { omp_set_lock(&(lck[i])); update(items[i]); omp_unset_lock(&(lck[i])); } for (int i = 0; i < 100000000; i++) omp_destroy_lock(&(lck[i]));

Why it is wrong u v ... • items[u] and items[v] items point to the same storage/object • two difgerent locks are acquired/set by omp_set_lock(&(lck[u])); u v ... omp_set_lock(&(lck[v])); items • Locks are not providing exclusive access to the object • Also, there are implementation limits on the number of locks

More Advanced OpenMP This is an abbreviated form of Tim Mattsons - PowerPoint PPT Presentation

More Advanced OpenMP This is an abbreviated form of Tim Mattsons and Larry Meadows (both at Intel) SC 08 tutorial located at http://openmp.org/mp-documents/o mp-hands-on-SC08.pdf All errors are my responsibility T opics

Recommended Reading A Brief Introduction to OpenMP OpenMP FAQ http://openmp.org/openmp-faq.html

Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

OpenMP Paolo Burgio paolo.burgio@unimore.it A history of OpenMP 1997 OpenMP for

Threaded Programming Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013

Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is OpenMP? OpenMP is an API

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

Parallel Programming using OpenMP Qin Liu The Chinese University of Hong Kong 1 Overview Why

Speeding Up Reactive Transport Code Using OpenMP By Jared McLaughlin OpenMP A standard for

Introduction to OpenMP Lecture 6: Further topics in OpenMP Nested parallelism Unlike most

SHARED MEMORY PROGRAMMING WITH OPENMP Lecture 9: OpenMP Performance 2 A common scenario.....

Usin ing OpenMP Shaohao Chen Research Computing @ Boston University Outline Introduction to

Advanced OpenMP Lecture 4: OpenMP and MPI Motivation In recent years there has been a trend

OpenMP Language Features ! The parallel construct ! ! Work-sharing ! ! Data-sharing !

Targeting GPUs with OpenMP 4.5 Device Directives James Beyer, NVIDIA Jeff Larkin, NVIDIA OpenMP

ECE 6504: Deep Learning for Perception Topics: LSTMs (intuition and variants) [Abhishek:]

Dynamic Memory Management The Linux Perspective Allocating memory: The

Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems Ming Yang,

GARBAGE BAGE CO COLLECTIO LLECTION: N: @EvaAndreasson, @Cloudera AGENDA Garbage

Short-term Memory for Self-collecting Mutators Martin Aigner, Andreas Haas , Christoph M. Kirsch,

Never-Ending Learning ICML 2019 Tutorial Tom Mitchell Partha Talukdar Carnegie Mellon

RDMAP and DDP Overview Renato Recio 11/22/2002 1 Introduction I Direct Data Placement A

Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems Ming Yang,