in openmp
play

in OpenMP Paolo Burgio paolo.burgio@unimore.it Outline Expressing - PowerPoint PPT Presentation

Barriers in OpenMP Paolo Burgio paolo.burgio@unimore.it Outline Expressing parallelism Understanding parallel threads Memory Data management Data clauses Synchronization Barriers, locks, critical sections Work


  1. Barriers in OpenMP Paolo Burgio paolo.burgio@unimore.it

  2. Outline › Expressing parallelism – Understanding parallel threads › Memory Data management – Data clauses › Synchronization – Barriers, locks, critical sections › Work partitioning – Loops, sections, single work, tasks… › Execution devices – Target 2

  3. OpenMP synchronization › OpenMP provides the following synchronization constructs: – barrier – flush – master – critical – atomic – taskwait – taskgroup – ordered – ..and OpenMP locks 3

  4. Creating ting a pa parreg eg › Master-slave, fork-join execution model – Master thread spawns a team of Slave threads – They all perform computation in parallel – At the end of the parallel region, implicit barrier int main() { /* Sequential code */ #pragma omp parallel num_threads(4) { /* Parallel code */ } // Parreg end: (implicit) barrier /* (More) sequential code */ } 4

  5. Creating ting a pa parreg eg › Master-slave, fork-join execution model – Master thread spawns a team of Slave threads – They all perform computation in parallel – At the end of the parallel region, implicit barrier int main() { T /* Sequential code */ #pragma omp parallel num_threads(4) { /* Parallel code */ } // Parreg end: (implicit) barrier /* (More) sequential code */ } 4

  6. Creating ting a pa parreg eg › Master-slave, fork-join execution model – Master thread spawns a team of Slave threads – They all perform computation in parallel – At the end of the parallel region, implicit barrier int main() { /* Sequential code */ #pragma omp parallel num_threads(4) { T T T T /* Parallel code */ } // Parreg end: (implicit) barrier /* (More) sequential code */ } 4

  7. Creating ting a pa parreg eg › Master-slave, fork-join execution model – Master thread spawns a team of Slave threads – They all perform computation in parallel – At the end of the parallel region, implicit barrier int main() { /* Sequential code */ #pragma omp parallel num_threads(4) { /* Parallel code */ } // Parreg end: (implicit) barrier T /* (More) sequential code */ } 4

  8. OpenMP explicit barriers #pragma omp barrier new-line (a standalone directive) › All threads in a team must wait for all the other threads before going on – "Each barrier region must be encountered by all threads in a team or by none at all" – "The sequence of barrier regions encountered must be the same for every thread in a team" – Why? › Binding set is the team of threads from the innermost enclosing parreg – "It applies to" › Also, it enforces a consistent view of the shared memory – We'll see this.. 5

  9. Let's Exercise code! › Spawn a team of (many) parallel Threads – Printing "Hello World" – Put a #pragma omp barrier – Reprint "Hello World" after › What do you see? – Now, remove the barrier construct › Now, put the barrier inside an if – E.g., if(omp_get_thread_num() == 0) { ... } – What do you see? – Error!!!! 6

  10. Effects on memory › Besides synchronization, a barrier has the effect of making threads' temporary view of the shared memory consistent – You cannot trust any (potentially modified) shared vars before a barrier – Of course, there are no problems with private vars › ..what??? 7

  11. The OpenMP memory model › Shared memory with relaxed consistency – Threads have access to "a place to store and to retrieve variables, called the memory" – Threads can have a temporary view of the memory › Caches, registers, scratchpads… › Can still be accessed by other threads first/ private(a) Process Priv. Priv. T T Temp Temp Priv. Shared Temp ????? T ?? VAR VAR VAR s hared(a) 8

  12. A bit of architecture…

  13. Caches in a nutshell › A quick memory connected to the core processor – ..and to the main memory – Few KB of data › (If any,) caches are a pure hardware mechanism – Used to store a copy mostly accessed data – To speedup execution even by 10-20 times – Istruction caches/Data caches › They perform their work automatically – And transparently – Poor or no control at all at application level – Extremely dangerous in multi- and many-cores 10

  14. Caches eng.wikipedia.org A cache is a hardware or software component that stores data so future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation, or the duplicate of data stored elsewhere. T T T T T CPU CPU CPU CPU 3 2 0 1 D$ D$ D$ D$ I$ I$ I$ I$ Level-2 $ Offchip memory Main memory, or L3 cache 11

  15. The catch(es) › Caches are power hungry – Some embedded architectures do not have D$ › They are not suitable for critical systems – E.g., BOSCH removed I$s › Hardware mechanism, poor control on them – Flush command (typically, all cache) – Color cache (assign to threads) – Prefetch (move data before it's actually needed) Coherency problem in multi/many-cores!! 12

  16. An example: read stale data a = 5; b = a; // ... T T c = a; CPU CPU CPU CPU 3 2 0 1 D$ D$ D$ D$ Main memory a 11 13

  17. An example: read stale data a = 5; b = a; // ... T T c = a; CPU CPU CPU CPU 3 2 0 1 D$ D$ D$ D$ 11 Main memory a 11 13

  18. An example: read stale data a = 5; b = a; // ... T T c = a; CPU CPU CPU CPU 3 2 0 1 D$ D$ D$ D$ 11 Main memory a 11 5 13

  19. An example: read stale data a = 5; b = a; // ... T T c = a; CPU CPU CPU CPU 3 2 0 1 D$ D$ D$ D$ 11 Main memory a 11 5 13

  20. An example: read stale data a = 5; b = a; // ... T dcache_flush(); T c = a; CPU CPU CPU CPU 3 2 0 1 D$ D$ D$ D$ Main memory a 11 14

  21. An example: read stale data a = 5; b = a; // ... T dcache_flush(); T c = a; CPU CPU CPU CPU 3 2 0 1 D$ D$ D$ D$ 11 Main memory a 11 14

  22. An example: read stale data a = 5; b = a; // ... T dcache_flush(); T c = a; CPU CPU CPU CPU 3 2 0 1 D$ D$ D$ D$ 11 Main memory a 11 5 14

  23. An example: read stale data a = 5; b = a; // ... T dcache_flush(); T c = a; CPU CPU CPU CPU 3 2 0 1 D$ D$ D$ D$ Main memory a 11 5 14

  24. An example: read stale data a = 5; b = a; // ... T dcache_flush(); T c = a; CPU CPU CPU CPU 3 2 0 1 D$ D$ D$ D$ 5 Main memory a 11 5 14

  25. An(other) example: $ writing policies Write-through a = 5; b = a; T T CPU CPU CPU CPU 3 2 0 1 D$ D$ D$ D$ Main memory a 11 11 15

  26. An(other) example: $ writing policies Write-through a = 5; b = a; T T CPU CPU CPU CPU 3 2 0 1 D$ D$ D$ D$ 5 Main memory a 11 11 5 15

  27. An(other) example: $ writing policies Write-through a = 5; b = a; T T CPU CPU CPU CPU 3 2 0 1 D$ D$ D$ D$ 5 Main memory a 11 11 5 15

  28. An(other) example: $ writing policies Write-through a = 5; b = a; T T CPU CPU CPU CPU 3 2 0 1 D$ D$ D$ 5 D$ 5 Main memory a 11 11 5 15

  29. An(other) example: $ writing policies Write-back a = 5; b = a; T T CPU CPU CPU CPU 3 2 0 1 D$ D$ D$ D$ Main memory a 11 16

  30. An(other) example: $ writing policies Write-back a = 5; b = a; T T CPU CPU CPU CPU 3 2 0 1 D$ D$ D$ D$ 5 Main memory a 11 16

  31. An(other) example: $ writing policies Write-back a = 5; b = a; T T CPU CPU CPU CPU 3 2 0 1 D$ D$ D$ D$ 5 11 Main memory a 11 16

  32. An(other) example: $ writing policies Write-back a = 5; b = a; T T CPU CPU CPU CPU 3 2 0 1 D$ D$ D$ D$ 5 11 Main memory a 11 5 16

  33. An(other) example: $ writing policies Write-back w/cache flush a = 5; dcache_flush(); b = a; T T CPU CPU CPU CPU 3 2 0 1 D$ D$ D$ D$ Main memory a 11 11 17

  34. An(other) example: $ writing policies Write-back w/cache flush a = 5; dcache_flush(); b = a; T T CPU CPU CPU CPU 3 2 0 1 D$ D$ D$ D$ 5 Main memory a 11 11 17

  35. An(other) example: $ writing policies Write-back w/cache flush a = 5; dcache_flush(); b = a; T T CPU CPU CPU CPU 3 2 0 1 D$ D$ D$ D$ Main memory a 11 5 11 17

  36. An(other) example: $ writing policies Write-back w/cache flush a = 5; dcache_flush(); b = a; T T CPU CPU CPU CPU 3 2 0 1 D$ D$ D$ D$ 5 Main memory a 11 5 11 17

  37. The flush directive #pragma omp flush [ ( list ) ] new-line › Binding thread set is the encountering thread – More "relaxed" › "It executes the OpenMP flush operation" – Makes its temporary view of the shared memory consistent with other threads – "Calls to dcache_flush() " › Enforces an order on the memory operations on the variables specified in list 18

  38. Semantics: barrier vs flush #pragma omp barrier › Joins the threads of a team › Applies to all threads of a team › Forces consistency of threads' temporary view of the shared memory #pragma omp flush › Applies to one thread › Forces consistency of its temporary view of the shared memory › Much lighter! 19

  39. OpenMP software stack › Multi-layer stack User code – Engineered for portability a = 5; T #pragma omp flush void GOMP_flush() { OpenMP runtime dcache_flush(); } void dcache_flush() { Operating System asm("mov r15, #1"); } Hardware D$ 20

Recommend


More recommend