tasking in openmp
play

Tasking in OpenMP Paolo Burgio paolo.burgio@unimore.it Outline - PowerPoint PPT Presentation

Tasking in OpenMP Paolo Burgio paolo.burgio@unimore.it Outline Expressing parallelism Understanding parallel threads Memory Data management Data clauses Synchronization Barriers, locks, critical sections Work


  1. Tasking in OpenMP Paolo Burgio paolo.burgio@unimore.it

  2. Outline › Expressing parallelism – Understanding parallel threads › Memory Data management – Data clauses › Synchronization – Barriers, locks, critical sections › Work partitioning – Loops, sections, single work, tasks… › Execution devices – Target 2

  3. A history of OpenMP › 1997 – OpenMP for Fortran 1.0 › 1998 – OpenMP for C/C++ 1.0 › 2000 Thread- Regular, loop-based parallelism – centric OpenMP for Fortran 2.0 › 2002 – OpenMP for C/C++ 2.5 › 2008 Task- centric – OpenMP 3.0 Irregular, parallelism ➔ tasking › 2011 – OpenMP 3.1 › 2014 Devices Heterogeneous parallelism, à la GP-GPU – OpenMP 4.5 3

  4. OpenMP programming patterns › "Traditional" OpenMP has a thread-centric execution model – Fork/join – Master-slave › Create a team of threads… – ..then partition the work among them – Using work-sharing constructs 4

  5. OpenMP programming patterns #pragma omp sections #pragma omp for #pragma omp single { for (int i=0; i<8; i++) { #pragma omp section { work(); { A(); } // ... } #pragma omp section { B(); } } #pragma omp section { C(); } #pragma omp section { D(); } } T T T T T T 0 4 5 B 1 A W O 6 C R 2 K D 7 3 5

  6. Let's Exercise code! › Traverse a tree – Perform the same operation on all elements – Download sample code › Recursive r x 6

  7. Let's Exercise code! › Now, parallelize it! void traverse_tree(node_t *n) { – From the example doYourWork(n); if(n->left) traverse_tree(n->left); if(n->right) traverse_tree(n->right); 0 } 1 2 ... traverse_tree(root); 3 4 5 6 7

  8. Solved: traversing a tree in parallel › Recursive void traverse_tree(node_t *n) { – Parreg+section for each call #pragma omp parallel sections { – Nested parallelism #pragma omp section doYourWork(n); › Assume the very first time #pragma omp section we call traverse_tree if(n->left) traverse_tree(n->left); – Root node #pragma omp section if(n->right) traverse_tree(n->right); 0 } } 1 2 ... 3 4 traverse_tree(root); 5 6 8

  9. Catches (1) › Cannot nest worksharing void traverse_tree(node_t *n) { constructs without an doYourWork(n); intervening parreg #pragma omp parallel sections – And its barrier… { – Costly #pragma omp section if(n->left) traverse_tree(n->left); #pragma omp section if(n->right) traverse_tree(n->right); } // Barrier } // Parrg barrier r ... x traverse_tree(root); 9

  10. Catches (2) › #threads grows exponentially void traverse_tree(node_t *n) { – Harder to manage doYourWork(n); #pragma omp parallel sections { #pragma omp section if(n->left) traverse_tree(n->left); #pragma omp section if(n->right) traverse_tree(n->right); } // Barrier T } // Parrg barrier r T ... x traverse_tree(root); T T T 10

  11. Catches (3) › Code is not easy to void traverse_tree(node_t *n) { understand doYourWork(n); › Even harder to modify #pragma omp parallel sections { – What if I add a third #pragma omp section child node? if(n->left) traverse_tree(n->left); #pragma omp section if(n->right) traverse_tree(n->right); } // Barrier T } // Parrg barrier r T ... x traverse_tree(root); T T 11

  12. Limitations of "traditional" WS Cannot nest worksharing constructs without an intervening parreg › Parreg are traditionally costly – A lot of operations to create a team of threads – Barrier… Parreg Static loops prologue Dyn loops start 30k cycles 10-150 cycles 5-6k cycles › The number of threads explodes and it's harder to manage – Parreg => create new threads 12

  13. Limitations of "traditional" WS It is cumbersome to create parallelism dynamically › In loops, sections – Work is statically determined! – Before entering the construct – Even in dynamic loops › "if <condition>, then create work" T T #pragma omp for for (int i=0; i<8; i++) 0 4 { // ... 5 1 } 6 2 7 3 13

  14. Limitations of "traditional" WS Poor semantics for irregular workload › Sections-based parallelism that is anyway cumbersome to write – OpenMP was born for loop-based parallelism › Code not scalable – Even a small modifications causes you to re-think the strategy #pragma omp sections { T T #pragma omp section { A(); } #pragma omp section B { B(); } A #pragma omp section { C(); } C #pragma omp section D { D(); } } 14

  15. A different parallel paradigm A work-oriented paradigm for partitioning workloads › Implements a producer-consumer paradigm – As opposite to OpenmP thread-centric model › Introduce the task pool – Where units of work (OpenMP tasks) – are pushed by threads – and pulled and executed by threads › E.g., implemented as a fifo queue (aka task queue) T T T T t t T t t T Producer(s) Consumer(s) 15

  16. The task directive #pragma omp task [clause [[,] clause]...] new-line structured-block Where clauses can be: if( [ task : ]scalar-expression ) final( scalar-expression ) untied default(shared | none) mergeable private (list ) firstprivate (list ) shared (list ) depend (dependence-type : list ) priority (priority-value ) › We will see only data sharing clauses – Same as parallel but… DEFAULT IS NOT SHARED!!!! 16

  17. Two sides /* Create threads */ › Tasks are produced #pragma omp parallel num_treads(2) { /* Push a task in the q */ › Tasks are consumed #pragma omp task { t0(); Let's } code! /* Push another task in the q */ #pragma omp task › Try this! t1(); – t 0 and t 1 are printf } // Implicit barrier – Also, print who produces T T t 0 T T t 1 Producer(s) Consumer(s) 17

  18. I cheated a bit /* Create threads */ › How many producers? #pragma omp parallel num_treads(2) { – So, how many tasks? /* Push a task in the q */ #pragma omp task { t0(); } /* Push another task in the q */ #pragma omp task t1(); } // Implicit barrier T T t 0 T T t 1 Producer(s) Consumer(s) 18

  19. I cheated a bit /* Create threads */ › How many producers? #pragma omp parallel num_treads(2) { – So, how many tasks? /* Push a task in the q */ #pragma omp task { t0(); } /* Push another task in the q */ #pragma omp task t1(); } // Implicit barrier T T t 1 t 0 T T t 1 t 0 Producer(s) Consumer(s) 18

  20. Let's make it simpler › Work is produced in parallel by threads › Work is consumed in parallel by threads › A lot of confusion! – Number of tasks grows – Hard to control producers › How to make this simpler? 19

  21. Single-producer, multiple consumers › A paradigm! Typically preferred by programmers – Code more understandable /* Create threads */ /* Create threads */ – Simple #pragma omp parallel num_treads(2) #pragma omp parallel num_treads(2) { { – More manageable #pragma omp single { #pragma omp task #pragma omp task › How to do this? t0(); t0(); #pragma omp task #pragma omp task t1(); t1(); } } // Implicit barrier } // Implicit barrier T t 0 T T t 1 Producer(s) Consumer(s) 20

  22. The task directive /* Create threads */ Can be used #pragma omp parallel num_treads(2) { #pragma omp single { /* Push a task in the q */ › in a nested manner #pragma omp task { /* Push a (children) task in the q */ – Before doing work, #pragma omp task produce two other tasks t1(); – Only need one parreg /* Conditionally push task in the q */ "outside" if(cond) #pragma omp task t2(); › in an irregular manner /* After producing t1 and t2, * do some work */ – See cond ? t0(); } – Barriers are not involved! } } // Implicit barrier – Unlike parregs' 21

  23. The task directive /* Create threads */ #pragma omp parallel num_treads(2) t 0 { cond? #pragma omp single { /* Push a task in the q */ #pragma omp task t 2 t 1 { /* Push a (children) task in the q */ #pragma omp task t1(); /* Conditionally push task in the q */ › A task graph if(cond) #pragma omp task t2(); › Edges are "father-son" /* After producing t1 and t2, relationships * do some work */ t0(); › Not timing/precendence!!! } } } // Implicit barrier 22

  24. It's a matter of time › The task directive represents the push in the WQ – And the pull??? › Not "where" it is in the code – But, when! › In OpenMP tasks, we separate the moment in time – when we produce work (push - #pragma omp task ) – when we consume the work (pull - ????) 23

  25. Timing de-coupling › One thread produces /* Create threads */ › All of the thread consume #pragma omp parallel num_treads(2) { #pragma omp single { #pragma omp task › ..but, when???? t0(); #pragma omp task t1(); } // Implicit barrier } // Implicit barrier T t 0 T T Producer(s) Consumer(s) 24

  26. Timing de-coupling › One thread produces /* Create threads */ › All of the thread consume #pragma omp parallel num_treads(2) { #pragma omp single { #pragma omp task › ..but, when???? t0(); #pragma omp task t1(); } // Implicit barrier } // Implicit barrier T t 0 T T t 1 Producer(s) Consumer(s) 24

Recommend


More recommend