SIMD Single Instruction Multiple Data Parallelism through simultaneous operations on different data Lecture 12: SIMD-machines & data parallelism, Fine grain parallelism dependency analysis for automatic vectorizing and parallelizing of serial program Part 1 • Systolic arrays • Parallel SIMD machines – 10k++ processors • Vector/Pipeline units 1 2 Systolic Array SIMD Machine • Network of “processors”, memory around • Front End – Performance by doing all computations before – Normal von Neuman – Runs the application program restoring Host Controller • Processor array • Often hardware implementations solving one problem – Synchronous – Special topologies – The same operation at the same time or idle – Extends the FPU:s instructions – Small memory/processor Memory – Smart memory – I/O • Example – ILLIAC IV, IBM GF 11, Maspar, CM200(Bellman 16k) 3 4 Building Blocks in Data Parallel Programming Data Parallell Programming • The user controls the placing of data on processors – Minimize communication: keep all processors busy • Idea: update the elements of an array at the same time • Operations on whole arrays • Divides the work between the programmer and the compiler – Apply one operation on each element in the array in parallel • The programmers solves the problem in their model • Methods to access parts of an array – Concentrates on structure and concepts on a hight level • Operations can operate on these parts – Collective operations on large data structures – Example: element < 0 ⇒ element := 1 – Keeps data in large arrays with mapping information • Reduction operations on arrays • The compiler maps the program on a physical machine – produces a result from a combination of many array elements: sum, – Fills in all the details (gladly receives hints from the user) max, min, ... – Optimizes computations and communications • Shift operations along the axis on multidimensional arrays • Scan-operations – prefix/suffix-operations • Generalized communication 5 6
C* C* � Supports broadcast, reduktion and interprocessor communication Parallel variables has type and shape � shape [16384] employees; � shape defines number of elements and their organization struct date{ int month; shape [16384] employees /* 1-D */ int day; shape [512] [512] image /* 2 - D */ int year; }; struct date: employees birthday • left-indexing: indexing that refers to parallel variables 1:st dim as axis 0, 2:nd as axis 1 etc Each element in the parallel variable birthday contains a date. birthday.month specifies all 16384 month fields in the parallel int: employees employee_id variable birthday. [2]employee_id : refers to the 3:rd element in employee_id 7 8 C* C* - Parallel Operations setting the context • Overloading • where – x = y + z (adds y and z in each position in the shape) – Limits the area where the operation is performed • New operations – a, b scalar or parallel with (numbers) where (z != 0) /* sets active positions */ – a <? b - min of two variables x = y/z – a >? b - max of two variables else /* reverses active positions */ x = y • Selection of shape (with) ❖ everywhere shape [16384] numbers; – all positions active independently of earlier context int: numbers x, y, z; with (numbers) x = y + z 9 10 C* - Communication Compute Pi in C* • Grid communication Pi = 1/N * Σ (Ν−1) i=0 4/(1+ x i * x i ), where x i = (i+1/2)/N – pcoord (~mynode) gives my index along axis in shape #define N = 400000 • Example: Send the value of source to element dest that is shape[N] chunk double: chunk x; one position higher up main() { double sum; [pcoord(0) + 1]dest = source double width; width = 1.0/N in parallel with (chunk) dot (.) is sometimes used instead of pcoord { x = (pcoord(0) + 0.5)*width; sum = (+=(4.0/(1.0+x*x))); [. + 1]dest = source } [. + 1][. -2]dest = source sum =sum * width; printf(“Estimate of Pi = %14.12f\n”, sum); } 11 12
High Performance Fortran Compute Partial Sums in Array (C*) • Data parallel language (Many similarities to CM FORTRAN) • For SIMD and MIMD (NUMA) machines Select shape • Based on F90 (F77) #define N = 1024 shape [N] ArrayShape – Array operations HPF int: ArrayShape x; Active positions int i; – User defined data types main() – Recursion and dynamic memory allocation { with (ArrayShape) – Pointers F77 + Mess. Pass for (i = 0; i < log(N); i++) where (pcoord(0) >= pow(2, i-1) – Control of data distribution SPMD x += [pcoord(0) - pow(2, i-1)]x } – Parallel constructs • Data mapping directives Left indexing • FORALL statements and constructs Exe-file • INDEPENDENT directive, etc 13 14 The PROCESSOR directive The DISTRIBUTE directive • Declares an abstract processor arrangement on which data is mapped • Controls the mapping of data onto processors • BLOCK distribution • Each element of this arrangement corresponds to a – Each processor stores a consecutive block of the array node on the physical machine REAL a(16) P1 P2 P3 P4 • The declarations are often parametrized with the !HPF$ PROCESSORS p(4) 1 5 9 13 !HPF$ DISTRIBUTE a(BLOCK) ONTO p intrinsic function NUMBER_OF_PROCESSORS 2 6 10 14 3 7 11 15 4 8 12 16 ● BLOCK, BLOCK distribution – For multidimensional arrays, separate blocking !hpf$ processors p(NUMBER_OF_PROCESSORS()/2,2) in each dimension. REAL a(7,7) !HPF$ PROCESSORS p(2,2) !HPF$ DISTRIBUTE a(BLOCK, BLOCK) ONTO p Comment 15 16 The DISTRIBUTE directive The DISTRIBUTE directive • CYCLIC,BLOCK distribution – It is not necessary to have the same distribution in all dimensions ● CYCLIC distribution P1 P2 P3 P4 1 2 3 4 REAL a(7,7) REAL a(16) 5 6 7 8 !HPF$ PROCESSORS p(2,2) !HPF$ PROCESSORS p(4) 9 10 11 12 !HPF$ DISTRIBUTE a(CYCLIC) ONTO p 13 14 15 16 !HPF$ DISTRIBUTE a(CYCLIC, BLOCK) ONTO p ● CYCLIC,CYCLIC distribution REAL a(7,7) !HPF$ PROCESSORS p(2,2) !HPF$ DISTRIBUTE a(CYCLIC, CYCLIC) ONTO p !HPF$ DISTRIBUTE a(BLOCK, CYCLIC) ONTO p 17 18
The ALIGN directive Example: Simple Matrix Multiplication • Describes mapping relations between interacting PROGRAM ABmult C A B objects INTEGER, PARAMETER :: N = 100 INTEGER, DIMENSION (N,N) :: A, B, C • Both objects are allocated on the same processor INTEGER :: i, j !HPF$ PROCESSORS SQ(2,2) !HPF$ DISTRIBUTE C(BLOCK,BLOCK) ONTO SQ !HPF$ ALIGN A(i,*) WITH C(i,*) a(1) a(2) a(3) a(4) a(5) a(6) REAL a(6), b(6) ! replicate copies of row A(i,*) b(1) b(2) b(3) b(4) b(5) b(6) !HPF$ ALIGN a(I) WITH b(I) ! onto processors which compute C(i,j) !HPF$ ALIGN B(*,j) WITH C(*,j) ! replicate copies of column B(*,j)) REAL a(4,4), b(4,10) ! onto processors which compute C(i,j) !HPF$ ALIGN a(I,J) WITH b(I, 2*J+1) A = 1, B = 2, C = 0 DO i = 1, N DO j = 1, N a 1 2 3 4 ! All the work is local due to ALIGNs C(i,j) = DOT_PRODUCT(A(i,:), B(:,j)) 1 b(1,3) b(1,5) b(1,7) b(1,9) END DO 2 b(2,3) b(2,5) b(2,7) b(2,9) END DO 3 b(3,3) b(3,5) b(3,7) b(3,9) END b(4,3) b(4,5) b(4,7) b(4,9) 4 19 20 The FORALL statement The INDEPENDENT directive • Generalization of array assignment and masked • States that no iteration affects any other iteration in any way array assignment (NOT a loop) – Is used to give the compiler extra information about the • Single statement FORALL execution of a DO or FORALL – FORALL (index, mask) forall-assignment • Applied on DO : states that there are no loop carried – Equivalent to array assignment in F90 dependencies – For every index, controll the mask • Applied on FORALL : states that no index points to an address – Compute right hand side for unmasked values used by any other object – Carry out the assignments to the left hand side • Multiple statement FORALL -semantics !HPF$ INDEPENDENT – FORALL (index, mask) forall-body-list END FORALL DO I = 1, N – forall-body can be FORALL, WHERE , or ordinary forall- A(INDX(I)) = B(I) END DO assignments – Abbreviation of a series of single statement FORALL s 21 22 The INDEPENDENT directive Game of LIFE FORALL (I=1:3) !HPF$ INDEPENDENT INTEGER LIFE(64, 64), NCOUNT(64, 64) L1(I) = R1(I) FORALL (I=1:3) !HPF$ ALIGN LIFE WITH NCOUNT L2(I) = R2(I) L1(I) = R1(I) !HPF$ DISTRIBUTE LIFE(BLOCK, BLOCK) END FORALL L2(I) = R2(I) ..... INIT LIFE ..... Assume that END FORALL NCOUNT = 0 R1(3) & R2(1) DO M = 1, NUMBER_OF_GENERATIONS takes longer time due FORALL (I=2:63, J=2:63) to communication NCOUNT(I,J) = SUM(LIFE(I-1:I+1,J-1:J+1))-LIFE(I,J) END FORALL R1(1) R1(2) R1(1) R1(2) R1(3) R1(3) ! Create next generation L1(1) L1(2) Sync WHERE ((LIFE.EQ.0).AND.(NCOUNT.EQ.3)) L1(3) LIFE=1 L1(1) L1(2) L1(3) R2(2) END WHERE R2(1) Sync R2(3) WHERE ((LIFE.EQ.1).AND.(NCOUNT.NE.2).AND.(NCOUNT.EQ.3)) L2(2) R2(2) R2(3) LIFE = 0 R2(1) L2(1) L2(3) END WHERE Sync END DO L2(1) L2(2) L2(3) END Time gained 23 24
