CS 140 : Numerical Examples on Shared Memory with Cilk++ • Matrix-matrix multiplication • Matrix-vector multiplication • Hyperobjects Thank nks to Charles E. L Leiserson on for some of t these slides 1
Work and Span (Recap) T P = execution time on P processors T 1 = work rk T ∞ = sp span an * Speedup eedup on p p processors cessors ∙ T 1 /T p Par aral allelism lelism ∙ T 1 /T ∞ *Also called cr critica cal-pa path th length or co computa putatio tiona nal depth . 2
Cilk Loops: Divide and Conquer cilk_for (int i=0; i<n; ++i) { Vector A[i]+=B[i]; addition } Implementati plementation on Work: Work: k: T 1 = Θ (n) k: T 1 = G grain gr in siz ize Span: T ∞ = Θ (lg n) Span: T ∞ = Paralle Paralle llelism: llelism: lism: T 1 /T ∞ = lism: T 1 /T ∞ = Θ (n/lg n) Assume that G = Θ (1). 3
Squa uare re-Matrix Matrix Mu Mult ltipl iplication ication c 11 c 12 ⋯ c 1n a 11 a 12 ⋯ a 1n b 11 b 12 ⋯ b 1n = · c 21 c 22 ⋯ c 2n a 21 a 22 ⋯ a 2n b 21 b 22 ⋯ b 2n ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ ⋱ ⋮ a n1 a n2 ⋯ a nn b n1 b n2 ⋯ b nn c n1 c n2 ⋯ c nn C A B n c ij = a ik b kj k = 1 Assume for simplicity that n = 2 k . 4
Parallelizing Matrix Multiply cilk_for (int i=1; i<n; ++i) { cilk_for (int j=0; j<n; ++j) { for (int k=0; k<n; ++k { C[i][j] += A[i][k] * B[k][j]; } } Work: T 1 = Θ (n 3 ) Span: T ∞ = Sp Θ (n) Parall llel elism: ism: T 1 /T ∞ = Θ (n 2 ) For 1000 × 1000 matrices, parallelism ≈ (10 3 ) 2 = 10 6 . 5
Recursive Matrix Multiplication Divide vide an and conquer quer — C 11 C 12 A 11 A 12 B 11 B 12 = · C 21 C 22 A 21 A 22 B 21 B 22 A 11 B 11 A 11 B 12 A 12 B 21 A 12 B 22 = + A 21 B 11 A 21 B 12 A 22 B 21 A 22 B 22 8 multiplications of n/2 × n/2 matrices. 1 addition of n × n matrices. 6
D&C Matrix Multiplication template <typename T> void MMult(T *C, T *A, T *B, int n) { T * D = new T[n*n]; // base case & partition matrices cilk_spawn MMult(C11, A11, B11, n/2); Row/column cilk_spawn MMult(C12, A11, B12, n/2); length of cilk_spawn MMult(C22, A21, B12, n/2); matrices cilk_spawn MMult(C21, A21, B11, n/2); cilk_spawn MMult(D11, A12, B21, n/2); Coarsen for cilk_spawn MMult(D12, A12, B22, n/2); efficiency Determine cilk_spawn MMult(D22, A22, B22, n/2); submatrices MMult(D21, A22, B21, n/2); cilk_sync; by index MAdd(C, D, n); // C += D; calculation } 7
Matrix Addition template <typename T> void MMult(T *C, T *A, T *B, int n) { T * D = new T[n*n]; // base case & partition matrices cilk_spawn MMult(C11, A11, B11, n/2); cilk_spawn MMult(C12, A11, B12, n/2); cilk_spawn MMult(C22, A21, B12, n/2); cilk_spawn MMult(C21, A21, B11, n/2); cilk_spawn MMult(D11, A12, B21, n/2); template <typename T> cilk_spawn MMult(D12, A12, B22, n/2); void MAdd(T *C, T *D, int n) { cilk_spawn MMult(D22, A22, B22, n/2); cilk_for (int i=0; i<n; ++i) { MMult(D21, A22, B21, n/2); cilk_for (int j=0; j<n; ++j) { cilk_sync; C[n*i+j] += D[n*i+j]; MAdd(C, D, n); // C += D; } } } } 8
Analysis of Matrix Addition template <typename T> void MAdd(T *C, T *D, int n) { cilk_for (int i=0; i<n; ++i) { cilk_for (int j=0; j<n; ++j) { C[n*i+j] += D[n*i+j]; } } } Work: A 1 (n) = Θ (n 2 ) Sp Span: A ∞ (n) = Θ (lg n) Nested cilk_for statements have the same Θ(lg n) span 9
Work of Matrix Multiplication template <typename T> void MMult(T *C, T *A, T *B, int n) { T * D = new T [n*n]; // base case & partition matrices cilk_spawn MMult(C11, A11, B11, n/2); CASE SE 1: cilk_spawn MMult(C12, A11, B12, n/2); n log b a = n log 2 8 = n 3 ⋮ f(n) = Θ (n 2 ) cilk_spawn MMult(D22, A22, B22, n/2); MMult(D21, A22, B21, n/2); cilk_sync; MAdd(C, D, n); // C += D; } Work: M 1 (n) = 8M 1 (n/2) + A 1 (n) + Θ (1) = 8M 1 (n/2) + Θ (n 2 ) = Θ (n 3 ) 10
Span of Matrix Multiplication template <typename T> void MMult(T *C, T *A, T *B, int n) { T * D = new T [n*n]; // base case & partition matrices maximum cilk_spawn MMult(C11, A11, B11, n/2); CASE SE 2: cilk_spawn MMult(C12, A11, B12, n/2); ⋮ n log b a = n log 2 1 = 1 cilk_spawn MMult(D22, A22, B22, n/2); f(n) = Θ (n log b a lg 1 n) MMult(D21, A22, B21, n/2); cilk_sync; MAdd(C, D, n, size); // C += D; } Sp Span: M ∞ (n) = M ∞ (n/2) + A ∞ (n) + Θ (1) = M ∞ (n/2) + Θ (lg n) = Θ (lg 2 n) 11
Parallelism of Matrix Multiply Work: M 1 (n) = Θ (n 3 ) Sp Span: M ∞ (n) = Θ (lg 2 n) M 1 (n) Para rall llel elism: ism: = Θ (n 3 /lg 2 n) M ∞ (n) For 1000 × 1000 matrices, parallelism ≈ (10 3 ) 3 /10 2 = 10 7 . 12
Stack Temporaries template <typename T> void MMult(T *C, T *A, T *B, int n) { T * D = new T [n*n]; // base case & partition matrices cilk_spawn MMult(C11, A11, B11, n/2); cilk_spawn MMult(C12, A11, B12, n/2); cilk_spawn MMult(C22, A21, B12, n/2); cilk_spawn MMult(C21, A21, B11, n/2); cilk_spawn MMult(D11, A12, B21, n/2); cilk_spawn MMult(D12, A12, B22, n/2); cilk_spawn MMult(D22, A22, B22, n/2); MMult(D21, A22, B21, n/2); cilk_sync; MAdd(C, D, n); // C += D; } I DEA EA : : Since minimizing storage tends to yield higher performance, trade off parallelism for less storage. 13
No-Temp Matrix Multiplication // C += A*B; template <typename T> void MMult2(T *C, T *A, T *B, int n) { // base case & partition matrices cilk_spawn MMult2(C11, A11, B11, n/2); cilk_spawn MMult2(C12, A11, B12, n/2); cilk_spawn MMult2(C22, A21, B12, n/2); MMult2(C21, A21, B11, n/2); cilk_sync; cilk_spawn MMult2(C11, A12, B21, n/2); cilk_spawn MMult2(C12, A12, B22, n/2); cilk_spawn MMult2(C22, A22, B22, n/2); MMult2(C21, A22, B21, n/2); cilk_sync; } Saves space, but at what expense? 14
Work of No-Temp Multiply // C += A*B; template <typename T> void MMult2(T *C, T *A, T *B, int n) { // base case & partition matrices cilk_spawn MMult2(C11, A11, B11, n/2); cilk_spawn MMult2(C12, A11, B12, n/2); cilk_spawn MMult2(C22, A21, B12, n/2); CASE SE 1: MMult2(C21, A21, B11, n/2); cilk_sync; n log b a = n log 2 8 = n 3 cilk_spawn MMult2(C11, A12, B21, n/2); f(n) = Θ (1) cilk_spawn MMult2(C12, A12, B22, n/2); cilk_spawn MMult2(C22, A22, B22, n/2); MMult2(C21, A22, B21, n/2); cilk_sync; } Work: M 1 (n) = 8M 1 (n/2) + Θ (1) = Θ (n 3 ) 15
Span of No-Temp Multiply // C += A*B; template <typename T> void MMult2(T *C, T *A, T *B, int n) { // base case & partition matrices cilk_spawn MMult2(C11, A11, B11, n/2); max cilk_spawn MMult2(C12, A11, B12, n/2); cilk_spawn MMult2(C22, A21, B12, n/2); CASE SE 1: MMult2(C21, A21, B11, n/2); cilk_sync; n log b a = n log 2 2 = n cilk_spawn MMult2(C11, A12, B21, n/2); f(n) = Θ (1) cilk_spawn MMult2(C12, A12, B22, n/2); max cilk_spawn MMult2(C22, A22, B22, n/2); MMult2(C21, A22, B21, n/2); cilk_sync; } Sp Span: M ∞ (n) = 2M ∞ (n/2) + Θ (1) = Θ (n) 16
Parallelism of No-Temp Multiply Work: M 1 (n) = Θ (n 3 ) Sp Span: M ∞ (n) = Θ (n) M 1 (n) Para rall llel elism: ism: = Θ (n 2 ) M ∞ (n) For 1000 × 1000 matrices, parallelism ≈ (10 3 ) 2 = 10 6 . Faste ter r in in p practic ice! 17
How general that was? ∙ Matrices are often rectangular ∙ Even when they are square, the dimensions are hardly a power of two n k B k = · A C m Which dimension to split? 18
General Matrix Multiplication template <typename T> void MMult3(T *A, T* B, T* C, int i0, int i1, int j0, int j1, int k0, int k1) { Split t m if it is int di = i1 - i0; int dj = j1 - j0; int dk = k1 - k0; the largest if (di >= dj && di >= dk && di >= THRESHOLD) { int mi = i0 + di / 2; MMult3 (A, B, C, i0, mi, j0, j1, k0, k1); MMult3 (A, B, C, mi, i1, j0, j1, k0, k1); Split t n if it is the } else if (dj >= dk && dj >= THRESHOLD) { larges est int mj = j0 + dj / 2; MMult3 (A, B, C, i0, i1, j0, mj, k0, k1); MMult3 (A, B, C, i0, i1, mj, j1, k0, k1); } else if (dk >= THRESHOLD) { Split t k if it is the int mk = k0 + dk / 2; larges est MMult3 (A, B, C, i0, i1, j0, j1, k0, mk); MMult3 (A, B, C, i0, i1, j0, j1, mk, k1); } else { // Iterative (triple-nested loop) multiply } } for (int i = i0; i < i1; ++i) { for (int j = j0; j < j1; ++j) { for (int k = k0; k < k1; ++k) C[i][j] += A[i][k] * B[k][j]; 19
Parallelizing General MMult template <typename T> void MMult3(T *A, T* B, T* C, int i0, int i1, int j0, int j1, int k0, int k1) { int di = i1 - i0; int dj = j1 - j0; int dk = k1 - k0; if (di >= dj && di >= dk && di >= THRESHOLD) { int mi = i0 + di / 2; cilk_spawn MMult3 (A, B, C, i0, mi, j0, j1, k0, k1); MMult3 (A, B, C, mi, i1, j0, j1, k0, k1); } else if (dj >= dk && dj >= THRESHOLD) { int mj = j0 + dj / 2; cilk_spawn MMult3 (A, B, C, i0, i1, j0, mj, k0, k1); Unsa safe e to to spawn wn MMult3 (A, B, C, i0, i1, mj, j1, k0, k1); here re unles less s we use } else if (dk >= THRESHOLD) { a tempo porar ary y ! int mk = k0 + dk / 2; MMult3 (A, B, C, i0, i1, j0, j1, k0, mk); MMult3 (A, B, C, i0, i1, j0, j1, mk, k1); } else { // Iterative (triple-nested loop) multiply } } for (int i = i0; i < i1; ++i) { for (int j = j0; j < j1; ++j) { for (int k = k0; k < k1; ++k) C[i][j] += A[i][k] * B[k][j]; 20
Split m No races, safe to spawn ! n k B k = · A C m 21
Split n No races, safe to spawn ! n k B k = · A C m 22
Split k Data races, unsafe to spawn ! n k B k = · A C m 23
Recommend
More recommend