Programming Distributed Memory Sytems Using OpenMP Rudolf Eigenmann, Ayon Basumallik, Seung-Jai Min, School of Electrical and Computer Engineering, Purdue University, http://www.ece.purdue.edu/ParaMount
Is OpenMP a useful programming model for distributed systems? OpenMP is a parallel programming model that assumes a shared address space #pragma OMP parallel for for (i=1; 1<n; i++) {a[i] = b[i];} Why is it difficult to implement OpenMP for distributed processors? The compiler or runtime system will need to partition and place data onto the distributed memories send/receive messages to orchestrate remote data accesses HPF (High Performance Fortran) was a large-scale effort to do so - without success So, why should we try (again)? OpenMP is an easier programming (higher-productivity?) programming model. It allows programs to be incrementally parallelized starting from the serial versions, relieves the programmer of the task of managing the movement of logically shared data. R. Eigenmann, Purdue HIPS 2007 2
Two Translation Approaches Use a Software Distributed Shared Memory System Translate OpenMP directly to MPI R. Eigenmann, Purdue HIPS 2007 3
Approach 1: Compiling OpenMP for Software Distributed Shared Memory R. Eigenmann, Purdue HIPS 2007 4
Inter-procedural Shared Data Analysis SUBROUTINE SUB0 INTEGER DELTAT CALL DCDTZ(DELTAT,…) SUBROUTINE DCDTZ(A, B, C) CALL DUDTZ(DELTAT,…) INTEGER A,B,C END C$OMP PARALLEL C$OMP+PRIVATE (B, C) A = … SUBROUTINE DUDTZ(X, Y, Z) CALL CCRANK INTEGER X,Y,Z … C$OMP PARALLEL C$OMP END PARALLEL C$OMP+REDUCTION(+:X) X = X + … END C$OMP END PARALLEL END SUBROUTINE CCRANK() … beta = 1 – alpha … END R. Eigenmann, Purdue HIPS 2007 5
Access Pattern SUBROUTINE RHS() Analysis !$OMP PARALLEL DO u (i, j, k ) = … !$OMP END PARALLEL DO DO istep = 1, itmax, 1 !$OMP PARALLEL DO !$OMP PARALLEL DO … = u (i, j, k ).. rsd (i, j, k ) = … rsd (i, j, k ) = rsd (i, j, k ).. !$OMP END PARALLEL DO !$OMP END PARALLEL DO !$OMP PARALLEL DO !$OMP PARALLEL DO rsd ( i , j , k) = … … = u (i, j, k ).. !$OMP END PARALLEL DO rsd (i, j, k ) = rsd (i, j, k ).. !$OMP END PARALLEL DO !$OMP PARALLEL DO u (i, j, k ) = rsd (i, j, k ) !$OMP PARALLEL DO !$OMP END PARALLEL DO … = u (i, j , k).. rsd (i, j , k) = ... CALL RHS() !$OMP END PARALLEL DO ENDDO R. Eigenmann, Purdue HIPS 2007 6
=> Data Distribution-Aware Optimization SUBROUTINE RHS() !$OMP PARALLEL DO u (i, j , k) = … DO istep = 1, itmax, 1 !$OMP END PARALLEL DO !$OMP PARALLEL DO !$OMP PARALLEL DO rsd (i, j , k) = … … = u (i, j , k).. !$OMP END PARALLEL DO rsd (i, j , k) = rsd (i, j , k).. !$OMP END PARALLEL DO !$OMP PARALLEL DO rsd (i, j , k) = … !$OMP PARALLEL DO !$OMP END PARALLEL DO … = u (i, j , k).. rsd (i, j , k) = rsd (i, j , k).. !$OMP PARALLEL DO !$OMP END PARALLEL DO u (i, j , k) = rsd (i, j , k) !$OMP END PARALLEL DO !$OMP PARALLEL DO … = u (i, j , k).. CALL RHS() rsd (i, j , k) = ... !$OMP END PARALLEL DO ENDDO R. Eigenmann, Purdue HIPS 2007 7
Adding Redundant Computation to Eliminate Communication Optimized S-DSM Code S-DSM Program OpenMP Program init00 = (N/proc_num)*(pid-1)… DO k = 1, z init00 = (N/proc_num)*(pid-1)… limit00 = (N/proc_num)*pid… limit00 = (N/proc_num)*pid … !$OMP PARALLEL DO new_init = init00 - 1 DO j = 1, N, 1 new_limit = limit00 + 1 DO k = 1, z flux(m, j) = u(3, i, j, k) + … DO k = 1, z ENDDO DO j = init00, limit00, 1 DO j = new_init, new_limit, 1 flux(m, j) = u(3, i, j, k) + … !$OMP PARALLEL DO flux(m, j) = u(3, i, j, k) + … ENDDO DO j = 1, N, 1 ENDDO CALL TMK_BARRIER(0) DO m = 1, 5, 1 CALL TMK_BARRIER(0) DO j = init00, limit00, 1 rsd(m, i, j, k) = … + DO j = init00, limit00, 1 DO m = 1, 5, 1 flux(m, j+1)-flux(m, j-1)) DO m = 1, 5, 1 rsd(m, i, j, k) = … + ENDDO rsd(m, i, j, k) = … + flux(m, j+1)-flux(m, j-1)) ENDDO flux(m, j+1)-flux(m, j-1)) ENDDO ENDDO ENDDO ENDDO ENDDO ENDDO ENDDO R. Eigenmann, Purdue HIPS 2007 8
Access Privatization Example from equake (SPEC OMPM2001) If (master) { // Done by all nodes shared->ARCHnodes = ….. { ARCHnodes = ….. shared->ARCHduration = … ARCHduration = … ... ... } PRIVATE } READ-ONLY VARIABLES SHARED VARS /* Parallel Region */ /* Parallel Region */ N = ARCHnodes ; N = shared->ARCHnodes ; iter = ARCHduration ; iter = shared->ARCHduration; …... …... R. Eigenmann, Purdue HIPS 2007 9
Optimized Performance of OMPM2001 Benchmarks SPEC OMP2001M Performance 6 5 4 3 2 1 0 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 swim mgrid applu equake art wupwise Baseline Performance Optimized Performance R. Eigenmann, Purdue HIPS 2007 10
A Key Question: How Close Are we to MPI Performance ? SPEC OMP2001 Performance 8 7 6 Baseline Performance 5 Optimized Performance 4 MPI Performance 3 2 1 0 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 wupwise mgrid applu swim R. Eigenmann, Purdue HIPS 2007 11
Towards Adaptive Optimization A combined Compiler-Runtime Scheme Compiler identifies repetitive access patterns Runtime system learns the actual remote addresses and sends data early. Ideal program characteristics: Data addresses are Inner, parallel invariant or a linear loops Outer, serial sequence, w.r.t. loop outer loop Communication points at barriers R. Eigenmann, Purdue HIPS 2007 12
Current Best Performance of OpenMP for S-DSM 6 5 4 3 2 1 0 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 wupwise swim applu SpMul CG Baseline(No Opt.) Locality Opt Locality Opt + Comp/Run Opt R. Eigenmann, Purdue HIPS 2007 13
Approach 2: Translating OpenMP directly to MPI Baseline translation Overlapping computation and communication for irregular accesses R. Eigenmann, Purdue HIPS 2007 14
Baseline Translation of OpenMP to MPI Execution Model SPMD model Serial Regions are replicated on all processes Iterations of parallel for loops are distributed (using static block scheduling) Shared Data is allocated on all nodes There is no concept of “owner” – only producers and consumers of shared data At the end of a parallel loop, producers communicate shared data to “potential” future consumers Array section analysis is used for summarizing array accesses R. Eigenmann, Purdue HIPS 2007 15
Baseline Translation Translation Steps: Identify all shared data 1. Create annotations for accesses to shared data 2. (use regular section descriptors to summarize array accesses) Use interprocedural data flow analysis to identify 3. potential consumers ; incorporate OpenMP relaxed consistency specifications Create message sets to communicate data 4. between producers and consumers R. Eigenmann, Purdue HIPS 2007 16
Message Set Generation For every write, V1: <write,A,1,l1(p),u1(p)> determine all future reads … <read,A,1,l2(p),u2(p)> <write,A,1,l3(p),u3(p)> Message Set at RSD vertex V1, for array <read,A,1,l5(p),u5(p)> A from process p to process q computed as … SApq = Elements of A with subscripts in the set <read,A,1,l4(p),u4(p)> {[l1(p),u1(p)] ∩ [l2(q),u2(q)]} U { [l1(p),u1(p)] ∩ [l4(q),u4(q)]} … U ( [ l1(p),u1(p)] ∩ {[l5(q),u5(q)]- [l3(p),u3(p)]}) R. Eigenmann, Purdue HIPS 2007 17
Baseline Translation of Irregular Accesses Irregular Access – A[B[i]], A[f(i)] Reads: assumed the whole array accessed Writes: inspect at runtime, communicate at the end of parallel loop We often can do better than “conservative”: Monotonic array values => sharpen access regions R. Eigenmann, Purdue HIPS 2007 18
Optimizations based on Collective Communication Recognition of Reduction Idioms Translate to MPI_Reduce / MPI_Allreduce functions. Casting sends/receives in terms of alltoall calls Beneficial where the producer-consumer relationship is many-to-many and there is insufficient distance between producers and consumers. R. Eigenmann, Purdue HIPS 2007 19
Performance of the Baseline OpenMP to MPI Translation Platform II – Sixteen IBM SP-2 WinterHawk-II nodes connected by a high-performance switch. R. Eigenmann, Purdue HIPS 2007 20
We can do more for Irregular Applications ? L1 : #pragma omp parallel for Subscripts of accesses to shared for(i=0;i<10;i++) arrays not always analyzable at A[i] = ... compile-time Baseline OpenMP to MPI translation: L2 : #pragma omp parallel for Conservatively estimate that each for(j=0;j<20;j++) process accesses the entire array B[j] = A[ C[j] ] + ... Try to deduce properties such as monotonicity for the irregular subscript to refine the estimate produced by produced by Still, there may be redundant process 1 process 2 communication Array Runtime tests (inspection) are A needed to resolve accesses Array 1, 3, 5, 0, 2 ….. 2, 4, 8, 1, 2 ... C accesses on accesses on process 1 process 2 R. Eigenmann, Purdue HIPS 2007 21
Recommend
More recommend