Tasking in OpenMP Paolo Burgio paolo.burgio@unimore.it
Outline › Expressing parallelism – Understanding parallel threads › Memory Data management – Data clauses › Synchronization – Barriers, locks, critical sections › Work partitioning – Loops, sections, single work, tasks… › Execution devices – Target 2
A history of OpenMP › 1997 – OpenMP for Fortran 1.0 › 1998 – OpenMP for C/C++ 1.0 › 2000 Thread- Regular, loop-based parallelism – centric OpenMP for Fortran 2.0 › 2002 – OpenMP for C/C++ 2.5 › 2008 Task- centric – OpenMP 3.0 Irregular, parallelism ➔ tasking › 2011 – OpenMP 3.1 › 2014 Devices Heterogeneous parallelism, à la GP-GPU – OpenMP 4.5 3
OpenMP programming patterns › "Traditional" OpenMP has a thread-centric execution model – Fork/join – Master-slave › Create a team of threads… – ..then partition the work among them – Using work-sharing constructs 4
OpenMP programming patterns #pragma omp sections #pragma omp for #pragma omp single { for (int i=0; i<8; i++) { #pragma omp section { work(); { A(); } // ... } #pragma omp section { B(); } } #pragma omp section { C(); } #pragma omp section { D(); } } T T T T T T 0 4 5 B 1 A W O 6 C R 2 K D 7 3 5
Let's Exercise code! › Traverse a tree – Perform the same operation on all elements – Download sample code › Recursive r x 6
Let's Exercise code! › Now, parallelize it! void traverse_tree(node_t *n) { – From the example doYourWork(n); if(n->left) traverse_tree(n->left); if(n->right) traverse_tree(n->right); 0 } 1 2 ... traverse_tree(root); 3 4 5 6 7
Solved: traversing a tree in parallel › Recursive void traverse_tree(node_t *n) { – Parreg+section for each call #pragma omp parallel sections { – Nested parallelism #pragma omp section doYourWork(n); › Assume the very first time #pragma omp section we call traverse_tree if(n->left) traverse_tree(n->left); – Root node #pragma omp section if(n->right) traverse_tree(n->right); 0 } } 1 2 ... 3 4 traverse_tree(root); 5 6 8
Catches (1) › Cannot nest worksharing void traverse_tree(node_t *n) { constructs without an doYourWork(n); intervening parreg #pragma omp parallel sections – And its barrier… { – Costly #pragma omp section if(n->left) traverse_tree(n->left); #pragma omp section if(n->right) traverse_tree(n->right); } // Barrier } // Parrg barrier r ... x traverse_tree(root); 9
Catches (2) › #threads grows exponentially void traverse_tree(node_t *n) { – Harder to manage doYourWork(n); #pragma omp parallel sections { #pragma omp section if(n->left) traverse_tree(n->left); #pragma omp section if(n->right) traverse_tree(n->right); } // Barrier T } // Parrg barrier r T ... x traverse_tree(root); T T T 10
Catches (3) › Code is not easy to void traverse_tree(node_t *n) { understand doYourWork(n); › Even harder to modify #pragma omp parallel sections { – What if I add a third #pragma omp section child node? if(n->left) traverse_tree(n->left); #pragma omp section if(n->right) traverse_tree(n->right); } // Barrier T } // Parrg barrier r T ... x traverse_tree(root); T T 11
Limitations of "traditional" WS Cannot nest worksharing constructs without an intervening parreg › Parreg are traditionally costly – A lot of operations to create a team of threads – Barrier… Parreg Static loops prologue Dyn loops start 30k cycles 10-150 cycles 5-6k cycles › The number of threads explodes and it's harder to manage – Parreg => create new threads 12
Limitations of "traditional" WS It is cumbersome to create parallelism dynamically › In loops, sections – Work is statically determined! – Before entering the construct – Even in dynamic loops › "if <condition>, then create work" T T #pragma omp for for (int i=0; i<8; i++) 0 4 { // ... 5 1 } 6 2 7 3 13
Limitations of "traditional" WS Poor semantics for irregular workload › Sections-based parallelism that is anyway cumbersome to write – OpenMP was born for loop-based parallelism › Code not scalable – Even a small modifications causes you to re-think the strategy #pragma omp sections { T T #pragma omp section { A(); } #pragma omp section B { B(); } A #pragma omp section { C(); } C #pragma omp section D { D(); } } 14
A different parallel paradigm A work-oriented paradigm for partitioning workloads › Implements a producer-consumer paradigm – As opposite to OpenmP thread-centric model › Introduce the task pool – Where units of work (OpenMP tasks) – are pushed by threads – and pulled and executed by threads › E.g., implemented as a fifo queue (aka task queue) T T T T t t T t t T Producer(s) Consumer(s) 15
The task directive #pragma omp task [clause [[,] clause]...] new-line structured-block Where clauses can be: if( [ task : ]scalar-expression ) final( scalar-expression ) untied default(shared | none) mergeable private (list ) firstprivate (list ) shared (list ) depend (dependence-type : list ) priority (priority-value ) › We will see only data sharing clauses – Same as parallel but… DEFAULT IS NOT SHARED!!!! 16
Two sides /* Create threads */ › Tasks are produced #pragma omp parallel num_treads(2) { /* Push a task in the q */ › Tasks are consumed #pragma omp task { t0(); Let's } code! /* Push another task in the q */ #pragma omp task › Try this! t1(); – t 0 and t 1 are printf } // Implicit barrier – Also, print who produces T T t 0 T T t 1 Producer(s) Consumer(s) 17
I cheated a bit /* Create threads */ › How many producers? #pragma omp parallel num_treads(2) { – So, how many tasks? /* Push a task in the q */ #pragma omp task { t0(); } /* Push another task in the q */ #pragma omp task t1(); } // Implicit barrier T T t 0 T T t 1 Producer(s) Consumer(s) 18
I cheated a bit /* Create threads */ › How many producers? #pragma omp parallel num_treads(2) { – So, how many tasks? /* Push a task in the q */ #pragma omp task { t0(); } /* Push another task in the q */ #pragma omp task t1(); } // Implicit barrier T T t 1 t 0 T T t 1 t 0 Producer(s) Consumer(s) 18
Let's make it simpler › Work is produced in parallel by threads › Work is consumed in parallel by threads › A lot of confusion! – Number of tasks grows – Hard to control producers › How to make this simpler? 19
Single-producer, multiple consumers › A paradigm! Typically preferred by programmers – Code more understandable /* Create threads */ /* Create threads */ – Simple #pragma omp parallel num_treads(2) #pragma omp parallel num_treads(2) { { – More manageable #pragma omp single { #pragma omp task #pragma omp task › How to do this? t0(); t0(); #pragma omp task #pragma omp task t1(); t1(); } } // Implicit barrier } // Implicit barrier T t 0 T T t 1 Producer(s) Consumer(s) 20
The task directive /* Create threads */ Can be used #pragma omp parallel num_treads(2) { #pragma omp single { /* Push a task in the q */ › in a nested manner #pragma omp task { /* Push a (children) task in the q */ – Before doing work, #pragma omp task produce two other tasks t1(); – Only need one parreg /* Conditionally push task in the q */ "outside" if(cond) #pragma omp task t2(); › in an irregular manner /* After producing t1 and t2, * do some work */ – See cond ? t0(); } – Barriers are not involved! } } // Implicit barrier – Unlike parregs' 21
The task directive /* Create threads */ #pragma omp parallel num_treads(2) t 0 { cond? #pragma omp single { /* Push a task in the q */ #pragma omp task t 2 t 1 { /* Push a (children) task in the q */ #pragma omp task t1(); /* Conditionally push task in the q */ › A task graph if(cond) #pragma omp task t2(); › Edges are "father-son" /* After producing t1 and t2, relationships * do some work */ t0(); › Not timing/precendence!!! } } } // Implicit barrier 22
It's a matter of time › The task directive represents the push in the WQ – And the pull??? › Not "where" it is in the code – But, when! › In OpenMP tasks, we separate the moment in time – when we produce work (push - #pragma omp task ) – when we consume the work (pull - ????) 23
Timing de-coupling › One thread produces /* Create threads */ › All of the thread consume #pragma omp parallel num_treads(2) { #pragma omp single { #pragma omp task › ..but, when???? t0(); #pragma omp task t1(); } // Implicit barrier } // Implicit barrier T t 0 T T Producer(s) Consumer(s) 24
Timing de-coupling › One thread produces /* Create threads */ › All of the thread consume #pragma omp parallel num_treads(2) { #pragma omp single { #pragma omp task › ..but, when???? t0(); #pragma omp task t1(); } // Implicit barrier } // Implicit barrier T t 0 T T t 1 Producer(s) Consumer(s) 24
Recommend
More recommend