Distributed Memory Programming Distributed Memory Programming Wolfgang Schreiner Research Institute for Symbolic Computation (RISC-Linz) Johannes Kepler University, A-4040 Linz, Austria Wolfgang.Schreiner@risc.uni-linz.ac.at http://www.risc.uni-linz.ac.at/people/schreine Wolfgang Schreiner RISC-Linz
Distributed Memory Programming SIMD Mesh Matrix Multiplication Single Instruction, Multiple Data • n 2 processors, • 3 n time. Algorithm: see slide. Wolfgang Schreiner 1
Distributed Memory Programming SIMD Mesh Matrix Multiplication 1. Precondition array • Shift row i by i − 1 elements west, • Shift column j by j − 1 elements north. 2. Multiply and add On processor � i, j � : c = k a ik ∗ b kj � • Inverted dimensions – Matrix ↓ i, → j . – Processor array ↓ iyproc , → ixproc . • n shift and n arithmetic operations. • n 2 processors. Maspar program: see slide. Wolfgang Schreiner 2
Distributed Memory Programming SIMD Cube Matrix Multiplication Cube of d 3 processors nzproc nxproc N D W nyproc U S Idea • Map A ( i, j ) to all P ( j, i, k ) • Map B ( i, j ) to all P ( i, k, j ) B C A Wolfgang Schreiner 3
Distributed Memory Programming SIMD Cube Matrix Multiplication Multiplication and Addition • Each processor computes single product P ijk : c ijk = a ik ∗ b kj • Bars along x-directions are added P 0 ij : C ij = k c ijk � B(k,j) C(i,j) A(i,k) Wolfgang Schreiner 4
Distributed Memory Programming SIMD Cube Matrix Multiplication Maspar Program int A[N,N], B[N,N], C[N,N]; plural int a, b, c; a = A[iyproc, ixproc]; b = B[ixproc, izproc]; c = a*b; for (i = 0; i < N-1; i++) if (ixproc > 0) c = xnetE[1].c else c += xnetE[1].c; if (ixproc == 0) C[iyproc, izproc] = c; • O ( n 3 ) processors, • O ( n ) time. Wolfgang Schreiner 5
Distributed Memory Programming SIMD Cube Matrix Multiplication Tree-like summation plural x, d; ... x = ixproc; d = 1; while (d < N) { if (x % 2 != 0) break; c += xnetE[d].c; x /= 2; d *= 2; } if (ixproc == 0) C[iyproc, izproc] = c; • O (log n ) time • O ( n 3 ) processors Long-distance communication required! Wolfgang Schreiner 6
Distributed Memory Programming SIMD Hypercube Mat. Multiplication 101 111 1 01 11 001 011 100 110 _ 0 00 10 000 010 d=0 d=1 d=2 d=3 1010 0010 d=4 • d -dimensional hypercube ⇒ processors in- dexed with d bits. • p 1 and p 2 differ in i bits ⇒ shortest path between p 1 and p 2 has length i . Wolfgang Schreiner 7
Distributed Memory Programming SIMD Hypercube Matrix Multiplica- tion Mapping of cube with dimension n to hyper- cube with dimension d . • Hypercube of n 3 = 2 d processors ⇒ d = 3 s (for some s ). • 64 processors ⇒ n = 4 , d = 6 , s = 2 . Hypercube d 5 d 4 d 3 d 2 d 1 d 0 Cube x y z • Embedding algorithm – Cube indices in binary form ( s bits each) – Concatenate indices ( 3 s = d bits) • Neighbor processors in cube remain neigh- bors in hypercube. • Any cube algorithm can be executed with same efficiency on hypercube. Wolfgang Schreiner 8
Distributed Memory Programming SIMD Hypercube Matrix Multiplica- tion Tree summation in hypercube. Processors 000 001 010 011 100 101 110 111 Step 1 r 0 s 0 r 1 s 1 r 2 s 2 r 3 s 3 Step 2 r 0 s 0 r 1 s 1 Step 3 r 0 s 0 • Each processor receives value from neigh- boring processors only. • Only short-distance communication is re- quired. Cube algorithm can be more efficient on hy- percube! Wolfgang Schreiner 9
Distributed Memory Programming Row/Column-Oriented Matrix Multi- plication A B C 1. Load A i on every processor P i . 2. For all P i do: for j =0 to N -1 Receive B j from root C ij = A i * B j 3. Collect C i Broadcasting of each B j ⇒ Step 2 takes O ( N log N ) time. Wolfgang Schreiner 10
Distributed Memory Programming Ring Algorithm See Quinn, Figure 7-15. • Change order of multiplication by • Using a ring of processors. 1. Load A i and B i on every processor P i . 2. For all P i do: p = ( i +1) mod N j = i for k =0 to N -1 do C ij = A i * B j j = ( j +1) mod N Receive B j from P p 3. Collect C i Point-to-point communication ⇒ Step 2 takes O ( N ) time. Wolfgang Schreiner 11
Distributed Memory Programming Hypercube Algorithm Problem: How to embed ring into hypercube? • Simple solution H ( i ) = i : – Ring processor i is mapped to hypercube processor H ( i ) . – Massive non-neighbor communication! • How to preserve neighbor-to-neighbor communication? (see Quinn, Figure 5-13) • Requirements for H ( i ) : – H must be a 1-to-1 mapping. – H ( i ) and H ( i + 1) must differ in 1 bit. – H (0) and H ( N − 1) must differ in 1 bit. Can we construct such a function H ? Wolfgang Schreiner 12
Distributed Memory Programming Ring Successor Assume H is given. • Given: hypercube processor number i • Wanted: “ring successor” S ( i ) 0 , if i = N − 1 S ( i ) = H ( H − 1 ( i ) + 1) , otherwise Same technique for embedding a 2-D mesh into an hypercube (see Quinn, Figure 5-14). Wolfgang Schreiner 13
Distributed Memory Programming Gray Codes Recursive construction. • 1-bit Gray code G 1 i G 1 ( i ) 0 0 1 1 • n -bit Gray code G n i G n ( i ) i G n ( i ) 0 0 G n − 1 (0) n − 1 1 G n − 1 (0) 1 0 G n − 1 (1) n − 2 1 G n − 1 (1) . . . . . . . . . . . . n 2 − 1 0 G n − 1 ( n n 1 G n − 1 ( n 2 − 1) 2 − 1) 2 • Required properties preserved by construc- tion! H ( i ) = G ( i ) = i xor i 2 . Wolfgang Schreiner 14
Distributed Memory Programming Gray Code Computation C functions. • Gray-Code int G(int i) { return(i ^ (i/2)); } • Inverse Gray-Code int G_inv(int i) { int answer, mask; answer = i; mask = answer/2; while (mask > 0) { answer = answer ^ mask; mask = mask / 2; } return(answer); } Wolfgang Schreiner 15
Distributed Memory Programming Block-Oriented Algorithm A 11 A 12 B 11 B 12 B = A = A 21 A 22 B 21 B 22 C 11 C 12 = C = C 21 C 22 A 11 B 11 + A 12 B 21 A 11 B 12 + A 12 B 22 A 21 B 11 + A 22 B 21 A 21 B 12 + A 22 B 22 • Use block-oriented distribution introduced for shared memory multiprocessors. Block-matrix multiplication is analogous to scalar ma- trix multiplication. • Use staggering technique introduced for 2D SIMD mesh. Rotation along rows and columns. • Perform the SIMD matrix multiplication al- gorithm on whole submatrices . Submatrices are multiplied and shifted. Wolfgang Schreiner 16
Distributed Memory Programming Analysis of Algorithm n 2 matrix, p processors. • Row/Column-oriented – Computation: n 2 /p ∗ n/p = n 3 /p 2 . – Communication: 2( λ + βn 2 /p ) – p iterations. • Block-oriented (staggering ignored) – Computation: n 2 /p ∗ n/p = n 3 /p 2 . – Communication: 4( λ + βn 2 /p ) – √ p − 1 iterations. • Comparison 2 p ( λ + βn 2 /p ) > 4( √ p − 1)( λ + βn 2 /p ) 2 λp + 2 βn 2 > 4 λ ( √ p − 1) + 4 β ( √ p − 1) n 2 /p 1. p > 2( √ p − 1) 2. 1 > 2( √ p − 1) /p True for all p ≥ 1 . Also including staggering, for larger p the block-oriented algorithm performs better! Wolfgang Schreiner 17
Recommend
More recommend