Deterministic OpenMP Amittai Aviram Dissertation Defense - PowerPoint PPT Presentation

Outline ● The Big Picture √ ● Background √ ● Analysis ● Design and Semantics ● Implementation ● Evaluation ● Conclusion 20 September 2012 Amittai Aviram | Yale University CS 39

How easily could real programs conform to DOMP's deterministic programming model? 20 September 2012 Amittai Aviram | Yale University CS 40

Method ● Used three parallel benchmark suites ● SPLASH2, NPB-OMP, PARSEC ● Total 35 benchmarks ● Hand-counted instances of synchronization constructs ● Recorded instances of deterministic constructs ● Classified and recorded instances of non determinstic constructs by their use 20 September 2012 Amittai Aviram | Yale University CS 41

Deterministic Constructs ● Fork/join ● Barrier ● OpenMP work sharing constructs ● Loop ● Master ● (Sections) ● (Task) 20 September 2012 Amittai Aviram | Yale University CS 42

Nondeterministic Constructs ● Mutex lock/unlock ● Condition variable wait/broadcast ● (Semaphore wait/post) ● OpenMP critical ● OpenMP atomic ● (OpenMP flush ) 20 September 2012 Amittai Aviram | Yale University CS 43

Use in Idioms long ProcessId; /* Get unique ProcessId */ LOCK(Global->CountLock); ProcessId = Global->current_id++; UNLOCK(Global->CountLock); barnes (SPLASH2) Work sharing 20 September 2012 Amittai Aviram | Yale University CS 44

Idioms ● Work sharing ● Reduction ● Pipeline ● Task queue ● Legacy ● Obsolete: Making I/O or heap allocation thread safe ● Nondeterministic ● Load balancing, random simulated interaction … 20 September 2012 Amittai Aviram | Yale University CS 45

Work Sharing Thread Task A 0 LOOP Thread n iterations Task B 1 Thread (t-1)n/t...n-1 n/t...2n/t-1 0...n/t-1 Task C 2 2n/t...3n/t-1 Thread Thread Thread Thread Thread .. Task D 3 0 1 2 t . “Data Parallelism” “Task Parallelism” cf. OpenMP LOOP work cf. OpenMP sections and sharing construct task work sharing constructs 20 September 2012 Amittai Aviram | Yale University CS 46

Reduction * v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 X (((((((((X * V 0 ) * V 1 ) * V 2 ) * V 3 ) * V 4 ) * V 5 ) * V 6 ) * V 7 ) 20 September 2012 Amittai Aviram | Yale University CS 47

Reduction * v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 X (((((((((X * V 0 ) * V 1 ) * V 2 ) * V 3 ) * V 4 ) * V 5 ) * V 6 ) * V 7 ) Pthreads (low-level threading) has no reduction construct. OpenMP's reduction construct allows only scalar types and simple operations. 20 September 2012 Amittai Aviram | Yale University CS 48

Pipeline 20 September 2012 Amittai Aviram | Yale University CS 49

Pipeline 20 September 2012 Amittai Aviram | Yale University CS 50

Task Queue 20 September 2012 Amittai Aviram | Yale University CS 51

Idioms ● Work sharing ● Reduction DETERMINISTIC IDIOMS ● Pipeline ● Task queue ● Legacy ● Obsolete: Making I/O or heap allocation thread safe ● Nondeterministic ● Load balancing, random simulated interaction … 20 September 2012 Amittai Aviram | Yale University CS 52

SPLASH2 water-nsquared water-spatial TOTAL cholesky radiosity raytrace volrend barnes ocean radix fmm fft lu fork/join 1 2 1 3 1 5 1 1 1 1 2 1 20 7% barrier 6 13 40 5 1 15 9 9 4 7 10 7 126 46% Deterministic Constructs work sharing - - - - - - - - - - - - 0 0% reduction - - - - - - - - - - - - 0 0% work sharing 2 1 1 2 5 5 1 1 1 1 2 1 23 8% reduction 1 1 3 5 - - 7 4 - - - - 21 8% Deterministic pipeline - 3 - - - - - - - - - 2 5 2% Idioms task queue - - - 7 - - - - 2 - - - 9 3% legacy 1 15 - - 6 1 - 1 4 - - - 28 10% nondeterministic 2 8 - 23 2 6 - 2 - - - - 43 16% 20 September 2012 Amittai Aviram | Yale University CS 53

NPB-OMP TOTAL BT CG DC EP FT IS LU MG SP UA fork/join 12 7 1 3 8 7 12 11 13 60 134 25% barrier - 8 - - - 1 4 - - - 13 2% Deterministic Constructs work sharing 37 20 - 1 8 11 71 16 38 78 280 52% reduction - 6 - 1 1 - 3 2 - 4 17 3% work sharing - - - - - - - - - - 0 0% reduction 2 - 1 1 - 1 2 - 2 80 89 17% Deterministic pipeline - - - - - - 5 - - - 5 1% Idioms task queue - - - - - - - - - - 0 0% legacy - - - - - - - - - - 0 0% - - - - - - - - - - 0 0% nondeterministic 538 20 September 2012 Amittai Aviram | Yale University CS 54

PARSEC streamcluster blackscholes fluidanimate swaptions bodytrack freqmine raytrace canneal facesim TOTAL dedup ferret x264 vips fork/join 2 5 2 1 13 7 1 3 1 2 1 5 5 48 23% barrier - - - - 14 - 3 - - - 1 - 34 52 25% Deterministic Constructs work sharing 2 5 - - - 21 - - - - - - - 28 14% reduction - - - - - - - - - - - - - 0 0% work sharing - - - 2 - - - - - - 1 - - 3 1% reduction 3% - - - - - 7 - - - - - - - 7 Deterministic pipeline - - - - - - - - - - - 17 4 21 10% Idioms task queue - - 14 9 - - 2 - - - - - - 25 12% legacy - - - - - - - - - - - - - 0 0% - - - - 15 - - - 6 - - - 21 10% nondeterministic 20 September 2012 Amittai Aviram | Yale University CS 55

Aggregate Nondeterministic Legacy Fork/Join 8.40% 2.98% 17.87% Task Queue Idioms 3.62% Pipeline Idioms 3.30% Reduction Idioms Barrier 11.70% 14.79% Work Sharing Idioms 2.77% Reduction Constructs 1.81% Work Sharing Constructs 32.77% 20 September 2012 Amittai Aviram | Yale University CS 56

OpenMP Benchmarks Pipeline Idioms 0.85% Reduction Idioms 16.35% Fork/Join 25.21% Simple Reductions 2.90% Barrier 2.21% All NPB-OMP plus PARSEC blackscholes, bodytrack, and freqmine. Work Sharing 52.47% 20 September 2012 Amittai Aviram | Yale University CS 57

Nondeterministic Synchronization Work Sharing Idioms 8.44% Nondeterministic 25.65% Reduction Idioms 35.71% Legacy 9.09% Task Queue Idioms 11.04% Pipeline Idioms 10.06% 20 September 2012 Amittai Aviram | Yale University CS 58

Conclusions ● Deterministic parallel programming model compatible with many programs ● Reductions can help increase the number 20 September 2012 Amittai Aviram | Yale University CS 59

Outline ● The Big Picture √ ● Background √ ● Analysis √ ● Design and Semantics ● Implementation ● Evaluation ● Conclusion 20 September 2012 Amittai Aviram | Yale University CS 60

Outline ● The Big Picture √ ● Background √ ● Analysis √ ● Design and Semantics ● Extended Reduction ● Implementation ● Evaluation ● Conclusion 20 September 2012 Amittai Aviram | Yale University CS 61

Foundations ● Workspace consistency ● Memory consistency model ● Naturally deterministic synchronization ● Working Copies Determinism ● Programming model ● Based on workspace consistency 20 September 2012 Amittai Aviram | Yale University CS 62

“Parallel Swap” Example Thread 0 Thread 1 x := 42 y := 33 x := 42 barrier y := 33 (x,y) := (y,x) x := y y := x x = y = 33 x = y = 42 20 September 2012 Amittai Aviram | Yale University CS 63

Memory Consistency Model Communication Events ● Acquire ● Acquires access to a location in shared memory ● Involves a read ● Release ● Enables access to a location in shared memory for other threads ● Involves a write 20 September 2012 Amittai Aviram | Yale University CS 64

Workspace Consistency WoDet '11 ● Pair each release with a determinate acquire ● Delay visibility of updates until the next synchronization event 20 September 2012 Amittai Aviram | Yale University CS 65

WC “Parallel Swap” Thread 0 Thread 1 BARRIER (0,0) (1,0) rel(1,1) rel(0,1) (0,1) (1,1) acq(1,0) acq(0,0) 20 September 2012 Amittai Aviram | Yale University CS 66

WC Fork/Join Thread 0 FORK Thread 1 start (0,0) rel(1,0) (1,0) acq(0,0) Thread 2 start (0,1) (2,0) (2,0) Thread 3 rel(2,0) acq(0,1) start (0,2) rel(3,0) (3,0) acq(0,2) compute compute compute compute JOIN (0,3) acq(1,1) (1,1) rel(0,3) exit (0,4) (2,1) acq(2,1) rel(0,4) exit (0,5) (3,1) acq(3,1) rel(0,5) exit 20 September 2012 Amittai Aviram | Yale University CS 67

WC Barrier Thread 0 Thread 1 Thread 2 Thread 3 BARRIER JOIN (0,3) (1,1) acq(1,1) rel(0,3) (0,4) acq(2,1) (2,1) rel(0,4) ` (0,5) (3,1) acq(3,1) rel(0,5) FORK (0,0) rel(1,0) (1,0) acq(0,0) (0,1) (2,0) (2,0) rel(2,0) acq(0,1) (0,2) rel(3,0) (3,0) acq(0,2) 20 September 2012 Amittai Aviram | Yale University CS 68

Kahn Process Network while (true) { Results send(new_task(), out_1); Worker send(next_task(), out_2); out A result = wait(in_1); in store(result); result = wait(in_2); Tasks store(result); in1 out1 } Master Tasks out2 in2 in out Worker R B e s u l t s while(true) { task = receive(in); result = process(task); send(result, out); } 20 September 2012 Amittai Aviram | Yale University CS 69

Nondeterministic Network (For Contrast) while(true) { Worker result = receive(in); in A out store(result); mute Tasks send(new_task(), out); x Results } locks out Tasks Master in Tasks Results common channels Results in Worker out B while(true) { task = receive(in); result = process(task); send(result, out); } 20 September 2012 Amittai Aviram | Yale University CS 70

Working Copies Determinism Fork: copy state Shared memory Thread Thread A B B's writes B reads “old” values A's writes Join: merge changes Conflicting writes → ERROR! 20 September 2012 Amittai Aviram | Yale University CS 71

parent thread working copy 20 September 2012 Amittai Aviram | Yale University CS 72

parent thread working copy FORK 20 September 2012 Amittai Aviram | Yale University CS 73

parent thread working copy hide reference FORK copy copy copy copy working working working copy copy copy 20 September 2012 Amittai Aviram | Yale University CS 74

parent thread working copy hide reference FORK copy copy copy copy ... master thread 1 thread n-1 working working working copy copy copy 20 September 2012 Amittai Aviram | Yale University CS 75

parent thread working copy hide reference FORK copy copy copy copy ... master thread 1 thread n-1 working working working copy copy copy 20 September 2012 Amittai Aviram | Yale University CS 76

parent thread working copy hide reference FORK copy copy copy copy ... master thread 1 thread n-1 working working working copy copy copy JOIN 20 September 2012 Amittai Aviram | Yale University CS 77

parent thread working copy hide reference FORK copy copy copy copy ... master thread 1 thread n-1 working working working copy copy copy merge merge merge JOIN 20 September 2012 Amittai Aviram | Yale University CS 78

parent thread working copy hide reference FORK copy copy copy copy ... master thread 1 thread n-1 working working working copy copy copy merge merge merge JOIN release working copy 20 September 2012 Amittai Aviram | Yale University CS 79

parent thread working copy hide reference FORK copy copy copy copy ... master thread 1 thread n-1 working working working copy copy copy merge merge merge JOIN release working parent thread copy 20 September 2012 Amittai Aviram | Yale University CS 80

DOMP API ● Supports most OpenMP constructs ● Parallel blocks ● Work sharing ● Simple (scalar-type) reductions ● Excludes OpenMP's few nondeterministic constructs ● atomic , critical , flush ● Extends OpenMP with a generalized reduction 20 September 2012 Amittai Aviram | Yale University CS 81

SEQUENTIAL Example // Multiply an n x m matrix A by an m x p matrix B // to get an n x p matrix C. void matrixMultiply(int n, int m, int p, double ** A, double ** B, double ** C) { for (int i = 0; i < n; i++) for (int j = 0; j < p; j++) { C[i][j] = 0.0; for (int k = 0; k < m; k++) C[i][j] += A[i][k] * B[k][j]; } } 20 September 2012 Amittai Aviram | Yale University CS 82

OpenMP Example // Multiply an n x m matrix A by an m x p matrix B // to get an n x p matrix C. void matrixMultiply(int n, int m, int p, Creates double ** A, double ** B, double ** C) { new threads, #pragma omp parallel for distributes work for (int i = 0; i < n; i++) for (int j = 0; j < p; j++) { C[i][j] = 0.0; Joins threads for (int k = 0; k < m; k++) to parent C[i][j] += A[i][k] * B[k][j]; } } 20 September 2012 Amittai Aviram | Yale University CS 83

DOMP Example // Multiply an n x m matrix A by an m x p matrix B // to get an n x p matrix C. Creates void matrixMultiply(int n, int m, int p, new threads, double ** A, double ** B, double ** C) { distributes work + #pragma omp parallel for copies of shared state for (int i = 0; i < n; i++) for (int j = 0; j < p; j++) { Merges copies C[i][j] = 0.0; of shared vars into for (int k = 0; k < m; k++) parent's state and C[i][j] += A[i][k] * B[k][j]; joins threads to parent } } 20 September 2012 Amittai Aviram | Yale University CS 84

Extended Reduction ● OpenMP's reduction is limited ● Scalar types (no pointers!) ● Arithmetic, logical, or bitwise operations ● Benchmark programmers used nondeterministic synchronization to compensate 20 September 2012 Amittai Aviram | Yale University CS 85

Typical Workaround In NPB-OMP EP (vector sum): do 155 i = 0, nq - 1 !$omp atomic q(i) = q(i) + qq(i) 155 continue 20 September 2012 Amittai Aviram | Yale University CS 86

Typical Workaround In NPB-OMP EP (vector sum): do 155 i = 0, nq - 1 !$omp atomic q(i) = q(i) + qq(i) 155 continue Nondeterministic programming model Unpredictable evaluation order 20 September 2012 Amittai Aviram | Yale University CS 87

DOMP Reduction API ● Binary operation op ● Arbitrary, user-defined ● Associative but not necessarily commutative ● Identity object idty ● Defined in contiguous memory ● Reduction variable object var ● Also defined in contiguous memory ● Size in bytes of idty and var 20 September 2012 Amittai Aviram | Yale University CS 88

DOMP Reduction API ● Binary operation op ● Associative but not necessarily commutative ● Identity object idty ● Defined in contiguous memory ● Reduction variable object var ● Also defined in contiguous memory ● Size in bytes of idty and var void domp_xreduction(void*(*op)(void*,void*), void** var, void* idty, size t size); 20 September 2012 Amittai Aviram | Yale University CS 89

Why the Identity Object? ● DOMP preserves OpenMP's guaranteed sequential-parallel equivalence semantics ● Each thread runs op on the rhs and idty ● At merge time, each merging thread (“up- buddy”) runs op on its own and the other thread's (the “down-buddy's”) version if var ● The master thread runs op on the original var and the cumulative var from merges. 20 September 2012 Amittai Aviram | Yale University CS 90

DOMP Replacement In NPB-OMP EP (vector sum): do 155 i = 0, nq - 1 !$omp atomic q(i) = q(i) + qq(i) 155 continue call xreduction_add(q_ptr, nq) ------------------------------------------- void xreduction_add_(void ** input, int * nq_val) { nq = *nq_val; init_idty(); domp_xreduction(&add_, input, (void *)idty, nq * sizeof(double)); } 20 September 2012 Amittai Aviram | Yale University CS 91

Desirable Future Extensions ● Pipeline ● Task Queue or Task Object 20 September 2012 Amittai Aviram | Yale University CS 92

Desirable Future Extensions ● Pipeline ● Task Queue or Task Object #pragma omp sections pipeline { while (more_work()) { #pragma omp section { do_step_a(); } #pragma omp section { do_step_b(); } /* ... */ #pragma omp section { do_step_n(); } } } 20 September 2012 Amittai Aviram | Yale University CS 93

Outline ● The Big Picture √ ● Background √ ● Analysis √ ● Design and Semantics √ ● Implementation ● Evaluation ● Conclusion 20 September 2012 Amittai Aviram | Yale University CS 94

Outline ● The Big Picture √ ● Background √ ● Analysis √ ● Design and Semantics √ ● Implementation ● Evaluation ● Conclusion 20 September 2012 Amittai Aviram | Yale University CS 95

Stats ● 8 files in libgomp ● ~ 5600 LOC ● Changes in gcc/omp-low.c and *.def files ● To support deterministic simple reductions 20 September 2012 Amittai Aviram | Yale University CS 96

Naive Merge Loop for each data segment seg in (stack, heap, bss) for each byte b in seg writer = WRITER_NONE for each thread t if ( seg [ t ][ b ]] ≠ reference_copy [ b ]) if ( writer ≠ WRITER_NONE) race condition exception() writer = t seg [MASTER][ b ] = seg [ writer ][ b ] 20 September 2012 Amittai Aviram | Yale University CS 97

Improvements ● Copy on write (page granularity) ● Merge or copy pages only as needed ● Parallel merge (binary tree) ● Thread pool 20 September 2012 Amittai Aviram | Yale University CS 98

Binary Tree Merge 20 September 2012 Amittai Aviram | Yale University CS 99

Binary Tree Merge 20 September 2012 Amittai Aviram | Yale University CS 100

Deterministic OpenMP Amittai Aviram Dissertation Defense - PowerPoint PPT Presentation

Deterministic OpenMP Amittai Aviram Dissertation Defense Department of Computer Science Yale University 20 September 2012 Committee Bryan Ford, Yale University, Advisor Zhong Shao, Yale University Ramakrishna Gummadi, Yale

Recommended Reading A Brief Introduction to OpenMP OpenMP FAQ http://openmp.org/openmp-faq.html

Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

OpenMP Paolo Burgio paolo.burgio@unimore.it A history of OpenMP 1997 OpenMP for

Threaded Programming Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013

Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is OpenMP? OpenMP is an API

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

Speeding Up Reactive Transport Code Using OpenMP By Jared McLaughlin OpenMP A standard for

Introduction to OpenMP Lecture 6: Further topics in OpenMP Nested parallelism Unlike most

Parallel Programming using OpenMP Qin Liu The Chinese University of Hong Kong 1 Overview Why

SHARED MEMORY PROGRAMMING WITH OPENMP Lecture 9: OpenMP Performance 2 A common scenario.....

Training Deterministic Parsers with Non-Deterministic Oracles by Yoav Goldberg and Joakim

Targeting GPUs with OpenMP 4.5 Device Directives James Beyer, NVIDIA Jeff Larkin, NVIDIA OpenMP

OpenMP on GPUs, First Experiences and Best Practices Jeff Larkin, GTC2018 S8344, March 2018 What

How to Get Good Performance by Using OpenMP ! ! Loop optimizations ! ! Measuring OpenMP

Systems 01/22/2014 Heechul Yun 1 About Instructor Assistant Prof., Dept. of EECS,

Mobile App Development NativeScript e Angular 2+ Kaleidoscope Kaleidoscope filippo Filippo

The Man That Refused To Give The Nazi Salute, 1936 Nikola Tesla In His Laboratory, Sitting Behind

Announcements for September 27, 2020 Be sure to let us know if you need communion juice and

Today CS 188: Artificial Intelligence Constraint Satisfaction Problems II Efficient Solution

Christopher Docksey Ho Hon. . Dire Director r Ge General ral, , EDP DPS Guernse Gu

CineGrid @ TERENA E2E Workshop Building a New User Community for Very High Quality Media u d g

Faculty Disclosures I have nothing to disclose Kathleen Puntillo RN, PhD, FAAN, FCCM Professor