deterministic openmp
play

Deterministic OpenMP Amittai Aviram Dissertation Defense - PowerPoint PPT Presentation

Deterministic OpenMP Amittai Aviram Dissertation Defense Department of Computer Science Yale University 20 September 2012 Committee Bryan Ford, Yale University, Advisor Zhong Shao, Yale University Ramakrishna Gummadi, Yale


  1. Outline ● The Big Picture √ ● Background √ ● Analysis ● Design and Semantics ● Implementation ● Evaluation ● Conclusion 20 September 2012 Amittai Aviram | Yale University CS 39

  2. How easily could real programs conform to DOMP's deterministic programming model? 20 September 2012 Amittai Aviram | Yale University CS 40

  3. Method ● Used three parallel benchmark suites ● SPLASH2, NPB-OMP, PARSEC ● Total 35 benchmarks ● Hand-counted instances of synchronization constructs ● Recorded instances of deterministic constructs ● Classified and recorded instances of non determinstic constructs by their use 20 September 2012 Amittai Aviram | Yale University CS 41

  4. Deterministic Constructs ● Fork/join ● Barrier ● OpenMP work sharing constructs ● Loop ● Master ● (Sections) ● (Task) 20 September 2012 Amittai Aviram | Yale University CS 42

  5. Nondeterministic Constructs ● Mutex lock/unlock ● Condition variable wait/broadcast ● (Semaphore wait/post) ● OpenMP critical ● OpenMP atomic ● (OpenMP flush ) 20 September 2012 Amittai Aviram | Yale University CS 43

  6. Use in Idioms long ProcessId; /* Get unique ProcessId */ LOCK(Global->CountLock); ProcessId = Global->current_id++; UNLOCK(Global->CountLock); barnes (SPLASH2) Work sharing 20 September 2012 Amittai Aviram | Yale University CS 44

  7. Idioms ● Work sharing ● Reduction ● Pipeline ● Task queue ● Legacy ● Obsolete: Making I/O or heap allocation thread safe ● Nondeterministic ● Load balancing, random simulated interaction … 20 September 2012 Amittai Aviram | Yale University CS 45

  8. Work Sharing Thread Task A 0 LOOP Thread n iterations Task B 1 Thread (t-1)n/t...n-1 n/t...2n/t-1 0...n/t-1 Task C 2 2n/t...3n/t-1 Thread Thread Thread Thread Thread .. Task D 3 0 1 2 t . “Data Parallelism” “Task Parallelism” cf. OpenMP LOOP work cf. OpenMP sections and sharing construct task work sharing constructs 20 September 2012 Amittai Aviram | Yale University CS 46

  9. Reduction * v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 X (((((((((X * V 0 ) * V 1 ) * V 2 ) * V 3 ) * V 4 ) * V 5 ) * V 6 ) * V 7 ) 20 September 2012 Amittai Aviram | Yale University CS 47

  10. Reduction * v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 X (((((((((X * V 0 ) * V 1 ) * V 2 ) * V 3 ) * V 4 ) * V 5 ) * V 6 ) * V 7 ) Pthreads (low-level threading) has no reduction construct. OpenMP's reduction construct allows only scalar types and simple operations. 20 September 2012 Amittai Aviram | Yale University CS 48

  11. Pipeline 20 September 2012 Amittai Aviram | Yale University CS 49

  12. Pipeline 20 September 2012 Amittai Aviram | Yale University CS 50

  13. Task Queue 20 September 2012 Amittai Aviram | Yale University CS 51

  14. Idioms ● Work sharing ● Reduction DETERMINISTIC IDIOMS ● Pipeline ● Task queue ● Legacy ● Obsolete: Making I/O or heap allocation thread safe ● Nondeterministic ● Load balancing, random simulated interaction … 20 September 2012 Amittai Aviram | Yale University CS 52

  15. SPLASH2 water-nsquared water-spatial TOTAL cholesky radiosity raytrace volrend barnes ocean radix fmm fft lu fork/join 1 2 1 3 1 5 1 1 1 1 2 1 20 7% barrier 6 13 40 5 1 15 9 9 4 7 10 7 126 46% Deterministic Constructs work sharing - - - - - - - - - - - - 0 0% reduction - - - - - - - - - - - - 0 0% work sharing 2 1 1 2 5 5 1 1 1 1 2 1 23 8% reduction 1 1 3 5 - - 7 4 - - - - 21 8% Deterministic pipeline - 3 - - - - - - - - - 2 5 2% Idioms task queue - - - 7 - - - - 2 - - - 9 3% legacy 1 15 - - 6 1 - 1 4 - - - 28 10% nondeterministic 2 8 - 23 2 6 - 2 - - - - 43 16% 20 September 2012 Amittai Aviram | Yale University CS 53

  16. NPB-OMP TOTAL BT CG DC EP FT IS LU MG SP UA fork/join 12 7 1 3 8 7 12 11 13 60 134 25% barrier - 8 - - - 1 4 - - - 13 2% Deterministic Constructs work sharing 37 20 - 1 8 11 71 16 38 78 280 52% reduction - 6 - 1 1 - 3 2 - 4 17 3% work sharing - - - - - - - - - - 0 0% reduction 2 - 1 1 - 1 2 - 2 80 89 17% Deterministic pipeline - - - - - - 5 - - - 5 1% Idioms task queue - - - - - - - - - - 0 0% legacy - - - - - - - - - - 0 0% - - - - - - - - - - 0 0% nondeterministic 538 20 September 2012 Amittai Aviram | Yale University CS 54

  17. PARSEC streamcluster blackscholes fluidanimate swaptions bodytrack freqmine raytrace canneal facesim TOTAL dedup ferret x264 vips fork/join 2 5 2 1 13 7 1 3 1 2 1 5 5 48 23% barrier - - - - 14 - 3 - - - 1 - 34 52 25% Deterministic Constructs work sharing 2 5 - - - 21 - - - - - - - 28 14% reduction - - - - - - - - - - - - - 0 0% work sharing - - - 2 - - - - - - 1 - - 3 1% reduction 3% - - - - - 7 - - - - - - - 7 Deterministic pipeline - - - - - - - - - - - 17 4 21 10% Idioms task queue - - 14 9 - - 2 - - - - - - 25 12% legacy - - - - - - - - - - - - - 0 0% - - - - 15 - - - 6 - - - 21 10% nondeterministic 20 September 2012 Amittai Aviram | Yale University CS 55

  18. Aggregate Nondeterministic Legacy Fork/Join 8.40% 2.98% 17.87% Task Queue Idioms 3.62% Pipeline Idioms 3.30% Reduction Idioms Barrier 11.70% 14.79% Work Sharing Idioms 2.77% Reduction Constructs 1.81% Work Sharing Constructs 32.77% 20 September 2012 Amittai Aviram | Yale University CS 56

  19. OpenMP Benchmarks Pipeline Idioms 0.85% Reduction Idioms 16.35% Fork/Join 25.21% Simple Reductions 2.90% Barrier 2.21% All NPB-OMP plus PARSEC blackscholes, bodytrack, and freqmine. Work Sharing 52.47% 20 September 2012 Amittai Aviram | Yale University CS 57

  20. Nondeterministic Synchronization Work Sharing Idioms 8.44% Nondeterministic 25.65% Reduction Idioms 35.71% Legacy 9.09% Task Queue Idioms 11.04% Pipeline Idioms 10.06% 20 September 2012 Amittai Aviram | Yale University CS 58

  21. Conclusions ● Deterministic parallel programming model compatible with many programs ● Reductions can help increase the number 20 September 2012 Amittai Aviram | Yale University CS 59

  22. Outline ● The Big Picture √ ● Background √ ● Analysis √ ● Design and Semantics ● Implementation ● Evaluation ● Conclusion 20 September 2012 Amittai Aviram | Yale University CS 60

  23. Outline ● The Big Picture √ ● Background √ ● Analysis √ ● Design and Semantics ● Extended Reduction ● Implementation ● Evaluation ● Conclusion 20 September 2012 Amittai Aviram | Yale University CS 61

  24. Foundations ● Workspace consistency ● Memory consistency model ● Naturally deterministic synchronization ● Working Copies Determinism ● Programming model ● Based on workspace consistency 20 September 2012 Amittai Aviram | Yale University CS 62

  25. “Parallel Swap” Example Thread 0 Thread 1 x := 42 y := 33 x := 42 barrier y := 33 (x,y) := (y,x) x := y y := x x = y = 33 x = y = 42 20 September 2012 Amittai Aviram | Yale University CS 63

  26. Memory Consistency Model Communication Events ● Acquire ● Acquires access to a location in shared memory ● Involves a read ● Release ● Enables access to a location in shared memory for other threads ● Involves a write 20 September 2012 Amittai Aviram | Yale University CS 64

  27. Workspace Consistency WoDet '11 ● Pair each release with a determinate acquire ● Delay visibility of updates until the next synchronization event 20 September 2012 Amittai Aviram | Yale University CS 65

  28. WC “Parallel Swap” Thread 0 Thread 1 BARRIER (0,0) (1,0) rel(1,1) rel(0,1) (0,1) (1,1) acq(1,0) acq(0,0) 20 September 2012 Amittai Aviram | Yale University CS 66

  29. WC Fork/Join Thread 0 FORK Thread 1 start (0,0) rel(1,0) (1,0) acq(0,0) Thread 2 start (0,1) (2,0) (2,0) Thread 3 rel(2,0) acq(0,1) start (0,2) rel(3,0) (3,0) acq(0,2) compute compute compute compute JOIN (0,3) acq(1,1) (1,1) rel(0,3) exit (0,4) (2,1) acq(2,1) rel(0,4) exit (0,5) (3,1) acq(3,1) rel(0,5) exit 20 September 2012 Amittai Aviram | Yale University CS 67

  30. WC Barrier Thread 0 Thread 1 Thread 2 Thread 3 BARRIER JOIN (0,3) (1,1) acq(1,1) rel(0,3) (0,4) acq(2,1) (2,1) rel(0,4) ` (0,5) (3,1) acq(3,1) rel(0,5) FORK (0,0) rel(1,0) (1,0) acq(0,0) (0,1) (2,0) (2,0) rel(2,0) acq(0,1) (0,2) rel(3,0) (3,0) acq(0,2) 20 September 2012 Amittai Aviram | Yale University CS 68

  31. Kahn Process Network while (true) { Results send(new_task(), out_1); Worker send(next_task(), out_2); out A result = wait(in_1); in store(result); result = wait(in_2); Tasks store(result); in1 out1 } Master Tasks out2 in2 in out Worker R B e s u l t s while(true) { task = receive(in); result = process(task); send(result, out); } 20 September 2012 Amittai Aviram | Yale University CS 69

  32. Nondeterministic Network (For Contrast) while(true) { Worker result = receive(in); in A out store(result); mute Tasks send(new_task(), out); x Results } locks out Tasks Master in Tasks Results common channels Results in Worker out B while(true) { task = receive(in); result = process(task); send(result, out); } 20 September 2012 Amittai Aviram | Yale University CS 70

  33. Working Copies Determinism Fork: copy state Shared memory Thread Thread A B B's writes B reads “old” values A's writes Join: merge changes Conflicting writes → ERROR! 20 September 2012 Amittai Aviram | Yale University CS 71

  34. parent thread working copy 20 September 2012 Amittai Aviram | Yale University CS 72

  35. parent thread working copy FORK 20 September 2012 Amittai Aviram | Yale University CS 73

  36. parent thread working copy hide reference FORK copy copy copy copy working working working copy copy copy 20 September 2012 Amittai Aviram | Yale University CS 74

  37. parent thread working copy hide reference FORK copy copy copy copy ... master thread 1 thread n-1 working working working copy copy copy 20 September 2012 Amittai Aviram | Yale University CS 75

  38. parent thread working copy hide reference FORK copy copy copy copy ... master thread 1 thread n-1 working working working copy copy copy 20 September 2012 Amittai Aviram | Yale University CS 76

  39. parent thread working copy hide reference FORK copy copy copy copy ... master thread 1 thread n-1 working working working copy copy copy JOIN 20 September 2012 Amittai Aviram | Yale University CS 77

  40. parent thread working copy hide reference FORK copy copy copy copy ... master thread 1 thread n-1 working working working copy copy copy merge merge merge JOIN 20 September 2012 Amittai Aviram | Yale University CS 78

  41. parent thread working copy hide reference FORK copy copy copy copy ... master thread 1 thread n-1 working working working copy copy copy merge merge merge JOIN release working copy 20 September 2012 Amittai Aviram | Yale University CS 79

  42. parent thread working copy hide reference FORK copy copy copy copy ... master thread 1 thread n-1 working working working copy copy copy merge merge merge JOIN release working parent thread copy 20 September 2012 Amittai Aviram | Yale University CS 80

  43. DOMP API ● Supports most OpenMP constructs ● Parallel blocks ● Work sharing ● Simple (scalar-type) reductions ● Excludes OpenMP's few nondeterministic constructs ● atomic , critical , flush ● Extends OpenMP with a generalized reduction 20 September 2012 Amittai Aviram | Yale University CS 81

  44. SEQUENTIAL Example // Multiply an n x m matrix A by an m x p matrix B // to get an n x p matrix C. void matrixMultiply(int n, int m, int p, double ** A, double ** B, double ** C) { for (int i = 0; i < n; i++) for (int j = 0; j < p; j++) { C[i][j] = 0.0; for (int k = 0; k < m; k++) C[i][j] += A[i][k] * B[k][j]; } } 20 September 2012 Amittai Aviram | Yale University CS 82

  45. OpenMP Example // Multiply an n x m matrix A by an m x p matrix B // to get an n x p matrix C. void matrixMultiply(int n, int m, int p, Creates double ** A, double ** B, double ** C) { new threads, #pragma omp parallel for distributes work for (int i = 0; i < n; i++) for (int j = 0; j < p; j++) { C[i][j] = 0.0; Joins threads for (int k = 0; k < m; k++) to parent C[i][j] += A[i][k] * B[k][j]; } } 20 September 2012 Amittai Aviram | Yale University CS 83

  46. DOMP Example // Multiply an n x m matrix A by an m x p matrix B // to get an n x p matrix C. Creates void matrixMultiply(int n, int m, int p, new threads, double ** A, double ** B, double ** C) { distributes work + #pragma omp parallel for copies of shared state for (int i = 0; i < n; i++) for (int j = 0; j < p; j++) { Merges copies C[i][j] = 0.0; of shared vars into for (int k = 0; k < m; k++) parent's state and C[i][j] += A[i][k] * B[k][j]; joins threads to parent } } 20 September 2012 Amittai Aviram | Yale University CS 84

  47. Extended Reduction ● OpenMP's reduction is limited ● Scalar types (no pointers!) ● Arithmetic, logical, or bitwise operations ● Benchmark programmers used nondeterministic synchronization to compensate 20 September 2012 Amittai Aviram | Yale University CS 85

  48. Typical Workaround In NPB-OMP EP (vector sum): do 155 i = 0, nq - 1 !$omp atomic q(i) = q(i) + qq(i) 155 continue 20 September 2012 Amittai Aviram | Yale University CS 86

  49. Typical Workaround In NPB-OMP EP (vector sum): do 155 i = 0, nq - 1 !$omp atomic q(i) = q(i) + qq(i) 155 continue Nondeterministic programming model Unpredictable evaluation order 20 September 2012 Amittai Aviram | Yale University CS 87

  50. DOMP Reduction API ● Binary operation op ● Arbitrary, user-defined ● Associative but not necessarily commutative ● Identity object idty ● Defined in contiguous memory ● Reduction variable object var ● Also defined in contiguous memory ● Size in bytes of idty and var 20 September 2012 Amittai Aviram | Yale University CS 88

  51. DOMP Reduction API ● Binary operation op ● Associative but not necessarily commutative ● Identity object idty ● Defined in contiguous memory ● Reduction variable object var ● Also defined in contiguous memory ● Size in bytes of idty and var void domp_xreduction(void*(*op)(void*,void*), void** var, void* idty, size t size); 20 September 2012 Amittai Aviram | Yale University CS 89

  52. Why the Identity Object? ● DOMP preserves OpenMP's guaranteed sequential-parallel equivalence semantics ● Each thread runs op on the rhs and idty ● At merge time, each merging thread (“up- buddy”) runs op on its own and the other thread's (the “down-buddy's”) version if var ● The master thread runs op on the original var and the cumulative var from merges. 20 September 2012 Amittai Aviram | Yale University CS 90

  53. DOMP Replacement In NPB-OMP EP (vector sum): do 155 i = 0, nq - 1 !$omp atomic q(i) = q(i) + qq(i) 155 continue call xreduction_add(q_ptr, nq) ------------------------------------------- void xreduction_add_(void ** input, int * nq_val) { nq = *nq_val; init_idty(); domp_xreduction(&add_, input, (void *)idty, nq * sizeof(double)); } 20 September 2012 Amittai Aviram | Yale University CS 91

  54. Desirable Future Extensions ● Pipeline ● Task Queue or Task Object 20 September 2012 Amittai Aviram | Yale University CS 92

  55. Desirable Future Extensions ● Pipeline ● Task Queue or Task Object #pragma omp sections pipeline { while (more_work()) { #pragma omp section { do_step_a(); } #pragma omp section { do_step_b(); } /* ... */ #pragma omp section { do_step_n(); } } } 20 September 2012 Amittai Aviram | Yale University CS 93

  56. Outline ● The Big Picture √ ● Background √ ● Analysis √ ● Design and Semantics √ ● Implementation ● Evaluation ● Conclusion 20 September 2012 Amittai Aviram | Yale University CS 94

  57. Outline ● The Big Picture √ ● Background √ ● Analysis √ ● Design and Semantics √ ● Implementation ● Evaluation ● Conclusion 20 September 2012 Amittai Aviram | Yale University CS 95

  58. Stats ● 8 files in libgomp ● ~ 5600 LOC ● Changes in gcc/omp-low.c and *.def files ● To support deterministic simple reductions 20 September 2012 Amittai Aviram | Yale University CS 96

  59. Naive Merge Loop for each data segment seg in (stack, heap, bss) for each byte b in seg writer = WRITER_NONE for each thread t if ( seg [ t ][ b ]] ≠ reference_copy [ b ]) if ( writer ≠ WRITER_NONE) race condition exception() writer = t seg [MASTER][ b ] = seg [ writer ][ b ] 20 September 2012 Amittai Aviram | Yale University CS 97

  60. Improvements ● Copy on write (page granularity) ● Merge or copy pages only as needed ● Parallel merge (binary tree) ● Thread pool 20 September 2012 Amittai Aviram | Yale University CS 98

  61. Binary Tree Merge 20 September 2012 Amittai Aviram | Yale University CS 99

  62. Binary Tree Merge 20 September 2012 Amittai Aviram | Yale University CS 100

Recommend


More recommend