Synchronization-Free Parallelism Today SPMD and OpenMP programming - PDF document

Synchronization-Free Parallelism � Today – � SPMD and OpenMP programming models – � Synchronization-free affine partitioning algorithm – � Deriving primitive affine transformations CS553 Lecture Synchronization-Free Parallelism 1 Two Parallel Programming Models � SPMD – � single program multiple data – � program should check what processor it is running on and execute some subset of the iterations based on that MPI_Init(&Argc,&Argv); � // p is the processor id � MPI_Comm_rank(MPI_COMM_WORLD,&p); � � OpenMP – � shared memory, thread-based parallelism – � pragmas indicate that a loop is fully parallel #pragma omp for � for (i=0; i<N; i++) { � } � CS553 Lecture Synchronization-Free Parallelism 2

Diagonal Partitioning Example � Example � do i = 1,6 do j = 1, 5 � A(i,j) = A(i-1,j-1)+1 � enddo � � enddo j i � Goal – � Determine an affine space partitioning that results in no synchronization needed between processors. CS553 Lecture Synchronization-Free Parallelism 3 Space-Partition Constraints � Accesses sharing a dependence should be mapped to the same processor read : A ( i ′ − 1 , j ′ − 1) – � loop bounds write : A ( i, j ) , 1 ≤ i, i ′ ≤ 6 1 ≤ j, j ′ ≤ 5 – � equality constraints on dependence – � equality constraints on space partition CS553 Lecture Synchronization-Free Parallelism 4

Solving the Space-Partition Constraints � Ad-hoc approach – � Reduce the number of unknowns – � Simplify – � Determine independent solutions for space partition matrix – � Find constant terms (would like min of mapping to be non-negative) CS553 Lecture Synchronization-Free Parallelism 5 Generate Simple Code � Algorithm 11.45 Generate code that executes partitions of a program sequentially – � for each statement, project out all loop index variables from the system with original loop bounds and space partition constraints – � use FM-based code gen algorithm to determine bounds over partitions – � not needed for example – � union the partition bounds over all statements – � not needed for example – � insert space partition predicate before each statement � do p = 0, 9 � do i = 1,6 � do j = 1, 5 if (i-j+4 = p) A(i,j) = A(i-1,j-1)+1 � CS553 Lecture Synchronization-Free Parallelism 6

Eliminate Empty Iterations � Apply FM-based code generation algorithm to resulting iteration space – � for each statement – � use unioned p bounds, the statement iteration space and the space partition constraints – � determine new bounds for the statement iteration space – � union the iteration space for all statements in the same loop – � For example, did this when determining bounds on p � do p = 0, 9 � do i = max(1,-3+p), min(6,p+1) � do j = max(1,i+4-p), min(5,i+4-p) � if (i-j+4 = p) A(i,j) = A(i-1,j-1)+1 CS553 Lecture Synchronization-Free Parallelism 7 Eliminate Tests from Innermost Loops � General approach: apply the following repeatedly – � select an inner loop with statements with different bounds – � split the loop using a condition that causes a statement to be in only one of the splits – � generate code for the split iteration spaces � do p = 0, 9 � do i = max(1,-3+p), min(6,p+1) � do j = max(1,i+4-p), min(5,i+4-p) � A(i,j) = A(i-1,j-1)+1 CS553 Lecture Synchronization-Free Parallelism 8

Using the Two Programming Models � SPMD and MPI MPI_Comm_rank(MPI_COMM_WORLD,&p); � for (i = max(1,-3+p); i<= min(6,p+1); i++) � � for (j = max(1,i+4-p); j<=min(5,i+4-p); j++) � � A[i][j] = A[i-1][j-1]+1 � � OpenMP � #pragma omp for � for (p = 0; p<=9; p++) � for (i = max(1,-3+p); i<=min(6,p+1); i++) � for (j = max(1,i+4-p); j<= min(5,i+4-p); j++) � A[i][j] = A[i-1][j-1]+1 � CS553 Lecture Synchronization-Free Parallelism 9 Derive Re-indexing by using Space Partition Constraints � Source Code � for (i=1; i<=N; i++) { � � Y[i] = Z[i]; /* s1 */ � � X[i] = Y[i-1]; /* s2 */ � � } � � Transformed Code � if (N>=1) X[1]=Y[0]; � � for (p=1; p<=N-1; p++) { � � Y[p] = Z[p]; � � X[p+1]=Y[p]; � � } � � if (N>=1) Y[N] = Z[N]; � CS553 Lecture Synchronization-Free Parallelism 10

Concepts � Two Parallel Programming Models – � SPMD – � OpenMP � Deriving a Synchronization-Free Affine Partitioning – � setting up the space partition constraints (keep iterations involved in a dependence on the same processor) – � solve the sparse partition constraints (linear algebra) – � eliminate empty iterations (Fourier-Motzkin) – � eliminate tests from inner loop (more Fourier-Motzkin) – � using the above to derive primitive affine transformations CS553 Lecture Synchronization-Free Parallelism 11 Next Time � Lecture – � Tiling! � Suggested Exercises – � Be able to derive the synchronization-free affine partitioning for Example 11.41 in the book. – � Show how the other primitive affine transformations are derived. CS553 Lecture Synchronization-Free Parallelism 12

Synchronization-Free Parallelism Today SPMD and OpenMP programming - PDF document

Synchronization-Free Parallelism Today SPMD and OpenMP programming models Synchronization-free affine partitioning algorithm Deriving primitive affine transformations CS553 Lecture Synchronization-Free Parallelism 1

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Content Synchronization Content Synchronization March 2nd 2005 Jukka Honkola T-110.456

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

LOCK/WAIT FREE SYNCHRONIZATION Synchronization Mutex Blocking Lock-free At

Parallel Models Different ways to exploit parallelism Outline Shared-Variables Parallelism

Parallelism ! Multiple processes concurrently Parallelism CPU1 CPU1 CPU1 Pseudo- Process 1

CO444H parallelism Ben Livshits 1 Why Parallelism? One way to speed up a computation is to

Multi-core Programming: Implicit Parallelism Tuukka Haapasalo April 16, 2009 Tuukka Haapasalo

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk

Opportunities for Parallelism Dr. Michael K. Bane HIGH END COMPUTE Questions 1. What do you

CS 5220: Locality and parallelism in simulations I David Bindel 2017-09-12 1 Parallelism and

Clock Synchronization Synchronization Clock Henrik Lnn Electronics & Software Volvo

Parallel Programming Patterns Overview and Concepts Practical Outline Why parallel

Delivery Group 10 May 19 Ofgem Delivery Group meeting agenda Agenda topic Timing Welcome

Delivery II + Truth, Beauty, and Stories Telling Stories with Data December 13, 2017 Plan for

Sysco Fiscal 4Q15 and Fiscal 2015 Financial Results August 10, 2015 Forward-Looking Statements

Overview Parallel computing platforms Approaches to building parallel computers

Parallel Processing Uniprocessors (single core) come to an end Slowing ability to extract

Changing How Programmers Think about Parallel Programming William Gropp www.cs.illinois.edu/ ~

Deep Learning on Massively Parallel Processing Databases Frank McQuillan Feb 2019 2 A Brief

Synchronization-Free Parallelism Today SPMD and OpenMP programming - PDF document

Synchronization-Free Parallelism Today SPMD and OpenMP programming models Synchronization-free affine partitioning algorithm Deriving primitive affine transformations CS553 Lecture Synchronization-Free Parallelism 1

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Content Synchronization Content Synchronization March 2nd 2005 Jukka Honkola T-110.456

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

LOCK/WAIT FREE SYNCHRONIZATION Synchronization Mutex Blocking Lock-free At

Parallel Models Different ways to exploit parallelism Outline Shared-Variables Parallelism

Parallelism ! Multiple processes concurrently Parallelism CPU1 CPU1 CPU1 Pseudo- Process 1

CO444H parallelism Ben Livshits 1 Why Parallelism? One way to speed up a computation is to

Multi-core Programming: Implicit Parallelism Tuukka Haapasalo April 16, 2009 Tuukka Haapasalo

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk

Opportunities for Parallelism Dr. Michael K. Bane HIGH END COMPUTE Questions 1. What do you

CS 5220: Locality and parallelism in simulations I David Bindel 2017-09-12 1 Parallelism and

Clock Synchronization Synchronization Clock Henrik Lnn Electronics &amp; Software Volvo

Parallel Programming Patterns Overview and Concepts Practical Outline Why parallel

Delivery Group 10 May 19 Ofgem Delivery Group meeting agenda Agenda topic Timing Welcome

Delivery II + Truth, Beauty, and Stories Telling Stories with Data December 13, 2017 Plan for

Sysco Fiscal 4Q15 and Fiscal 2015 Financial Results August 10, 2015 Forward-Looking Statements

Overview Parallel computing platforms Approaches to building parallel computers

Parallel Processing Uniprocessors (single core) come to an end Slowing ability to extract

Changing How Programmers Think about Parallel Programming William Gropp www.cs.illinois.edu/ ~

Deep Learning on Massively Parallel Processing Databases Frank McQuillan Feb 2019 2 A Brief

Clock Synchronization Synchronization Clock Henrik Lnn Electronics & Software Volvo