Matrix reuse and data locality in parallel y = A z and z = A T x Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication on Many-Core Architectures Kadir Akbudak Ozan Karsavuran 1 (speaker) 2 Cevdet Aykanat 1 sites.google.com/site/kadircs kadir.cs@gmail.com 1 Bilkent University, Turkey 2 KAUST, KSA SIAM Workshop on Combinatorial Scientific Computing (CSC), Albuquerque, NM, USA, October 10-12, 2016 O. Karsavuran, K. Akbudak , and C. Aykanat, Locality-Aware Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication on Many-Core Architectures , IEEE Transactions on Parallel and Distributed Systems (TPDS) , vol. 27(6), pp. 1713-1726, 2016, available at ieeexplore.ieee.org/document/7152923/ 1 / 14
Matrix reuse and data locality in parallel y = A z and z = A T x Introduction: y = AA T x 1 Open problems & Related work 2 Parallel SpAA T based on 1D partitioning of A and A T matrices 3 Quality criteria for efficient parallelization of SpAA T Proposed SpAA T algorithms Experiments References 4 2 / 14
Matrix reuse and data locality in parallel y = A z and z = A T x Introduction: y = AA T x Thread-level parallelization of y = AA T x ( SpAA T ) y = AA T x is computed as two Sparse Matrix-Vector Multiplies ( SpMV ) x Sparse Matrix- Transpose–Vector z = A T x and then z A T Multiply ( Sp A T ) z Sparse Matrix-Vector y A y = Az Multiply ( Sp A ) Thread-level parallelization of repeated and consecutive Sp A and Sp A T that involve the same sparse matrix A Examples: Linear Programming (LP) problems via interior point methods nonsymmetric systems via Bi-CG, CGNE, Lanczos Bi-ortagonalization least squares problem via LSQR linear feasibility problem via Surrogate Constraints method Krylov-based balancing algorithms used as preconditioners for sparse eigensolvers web page ranking via HITS algorithm 3 / 14
Matrix reuse and data locality in parallel y = A z and z = A T x Open problems & Related work Open problems Utilize the opportunity of reusing A -matrix nonzeros? Obtain close performance for both z = A T x and y = Az at the same time? Single storage of A for both z = A T x and y = Az Storage of A T for z = A T x and a separate storage of A for y = Az Related work Optimized Sparse Kernel Interface (OSKI), Berkeley x Serial z y Each row/column is reused. A z A T Compressed Sparse Blocks (CSB) by Buluc et. al. [10] Parallel Same data structure for both Sp A and Sp A T operations without any performance degradation Two phase, i.e., Sp A and Sp A T are not performed simultaneously 4 / 14
Matrix reuse and data locality in parallel y = A z and z = A T x Parallel SpAA T based on 1D partitioning of A and A T matrices Thread-level baseline parallelization of SpAA T z 1 z 2 z 3 z 4 x 1 x 2 x 3 x 4 x z R T z 1 y 1 R 1 1 R T z 2 y 2 R 2 2 = y = C T C T C T C T = z = C 1 C 2 C 3 C 4 R T z 3 1 2 3 4 y 3 R 3 3 R T z 4 y 4 R 4 4 A T RRp A A T A CCp Row-Row parallel Column-Column parallel z 1 z 2 z 3 z 4 x 1 x 2 x 3 x 4 x z C T z 1 y 1 R 1 1 C T z 2 y 2 R 2 2 R T R T R T R T y = z = C 1 C 2 C 3 C 4 = = 1 2 3 4 C T z 3 y 3 R 3 3 C T z 4 y 4 R 4 4 A T A T CRp RCp A A Column-Row parallel Row-Column parallel YELLOW scale tone: exclusive accesses by a single thread RED color: concurrent accesses by multiple threads. Four baseline SpAA T algorithms for computing y = A z after z = A T x by four threads. 5 / 14
Matrix reuse and data locality in parallel y = A z and z = A T x Parallel SpAA T based on 1D partitioning of A and A T matrices Contributions Identify five quality criteria (QC), which have impact on performance of parallel SpAA T Singly-bordered block-diagonal (SB) form based methods: sb CRp and sb RCp Matrix A partitioned in to four and the subma- z 1 z 2 z 3 z 4 trices are processed by four threads. y 1 A 11 For sb CRp (SB-based Column-Row parallel algorithm), y 2 A 22 we permute matrix A into a rowwise SB form, which y 3 = A 33 induces a columnwise SB form of matrix A T y 4 A 44 z 1 z 2 z 3 z 4 z B y B A B 1 A B 2 A B 3 A B 4 y 1 A 11 A B 1 For sb RCp (SB-based Row-Column parallel algorithm), y 2 we permute matrix A into a columnwise SB form, A 22 A B 2 = which induces a rowwise SB form of matrix A T y 3 A 33 A B 3 y 4 A 44 A B 4 Achieve (a) ( z -vector reuse) and (b) ( A -matrix reuse). Objectives of minimizing the size of the row/column border in the SB form of A ≈ achieve QC (c), (d), and (e) in sb CRp/ sb RCp. 6 / 14
Matrix reuse and data locality in parallel y = A z and z = A T x Parallel SpAA T based on 1D partitioning of A and A T matrices Quality criteria for efficient parallelization of SpAA T Quality criteria for efficient parallelization of SpAA T Quality Criteria RRp CRp RCp sb CRp sb RCp – (a)Reusing z -vector entries generated in z = A T x and 1 � � � × × then read in y = A z – 2 (b)Reusing matrix nonzeros (together with their in- � � � × × dices) in z = A T x and y = A z × 3 × 3 × 3 (c) Exploiting temporal locality in reading input vector � � entries in row-parallel SpMVs × 3 × 3 (d)Exploiting temporal locality in updating output vec- � � − tor entries in column-parallel SpMVs (e) Minimizing the number of concurrent writes per- � � � × × formed by different threads in column-parallel SpMVs z 1 z 2 z 3 z 4 x z 1 C T – 1 1 : satisfied except z B border subvectors � : satisfied � z 2 C T y = 2 – C 1 C 2 C 3 C 4 = 2 : satisfied except A kB border submatrices − : not applicable � C T z 3 3 × 3 : may be satisfied through row/column reordering × : not satisfied C T z 4 4 A T A CRp Column-Row parallel 7 / 14
Matrix reuse and data locality in parallel y = A z and z = A T x Parallel SpAA T based on 1D partitioning of A and A T matrices Proposed SpAA T algorithms Maintaining balance on the number of nonzeros at each slice Reducing parallel time under arbitrary task scheduling Reducing border size x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 × × × × × × × × × × × × × × × × c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 Reducing # of cache misses due to loss r 1 r 1 × ×× × × of temporal locality r 2 r 4 × × r 3 ×× r 2 × λ ( c j ) = |{ A k : r 4 × r 3 × c j has at least one nonzero at A k , r 5 r 5 × × r 6 r 6 × × × × ∀ k ∈ 1 , . . . , K }| � λ ( c j ) = 3 + 5 � λ ( c j ) = 3 + 3 Matrix A partitioned in to three and the submatrices are processed by three threads. Reducing # of concurrent writes λ ( r i ) = |{ A k : r i has at least one nonzero at A k , ∀ k ∈ 1 , . . . , K }| 8 / 14
Matrix reuse and data locality in parallel y = A z and z = A T x Parallel SpAA T based on 1D partitioning of A and A T matrices Proposed SpAA T algorithms Merits of Singly-Bordered Block Diagonal (SB) Form on CRp SB Form z 1 z 2 z 3 z 4 z 1 z 2 z 3 z 4 x 1 x 2 x 3 x 4 x B x y 1 A 11 z 1 A T A T C T z 1 11 B 1 1 y 2 A 22 z 2 A T A T z 2 C T 22 B 2 y = 2 y 3 = = A 33 C 1 C 2 C 3 C 4 = A T A T z 3 C T z 3 y 4 33 B 3 3 A 44 z 4 C T z 4 A T 44 A T y B A B 1 A B 2 A B 3 A B 4 4 B 4 A T A T A A CRp sb CRp Concurrent accesses Whole x and y vectors Only x B and y B subvectors Exploits temporal locality in reading x -vector entries in row parallel z = A T x Exploits temporal locality in updating y -vector entries in column-parallel y = A z Minimizing number of concurrent writes by Minimizing border different threads in column-parallel y = A z size in the SB form 9 / 14
Matrix reuse and data locality in parallel y = A z and z = A T x Parallel SpAA T based on 1D partitioning of A and A T matrices Proposed SpAA T algorithms Require: A kk and A Bk matrices; x , y , and z vectors Singly-bordered block-diagonal (SB) form 1: for k ← 1 to K in parallel do z 1 z 2 z 3 z 4 x 1 x 2 x 3 x 4 x B z k ← A T kk x k 2: z k ← C T y 1 k x A 11 A T z k ← z k + A T z 1 A T Bk x B 3: 11 B 1 y 2 y k ← A kk z k A 22 4: z 2 A T A T y ← C k z k ⊲ Concurrent 22 B 2 y B ← y B + A Bk z k y 3 5: = = A 33 writes A T A T z 3 6: end for y 4 33 B 3 A 44 z 4 A T 44 A T y B A B 1 A B 2 A B 3 A B 4 B 4 A T A sb CRp SB-based Column-Row parallel Require: A kk and A kB matrices; x , y , and z vectors Singly-bordered block-diagonal (SB) form 1: for k ← 1 to K in parallel do x 1 x 2 x 3 x 4 z 1 z 2 z 3 z 4 z B z k ← A T kk x k 2: z ← R T k x k kB x k ⊲ Concurrent z B ← z B + A T A T z 1 3: y 1 A 11 A 1 B 11 writes y k ← A kk z k z 2 A T 4: 22 y 2 A 22 A 2 B 5: end for A T z 3 = = 33 y k ← R k z 6: for k ← 1 to K in parallel do y 3 A 33 A 3 B z 4 A T 44 y k ← y k + A kB z B 7: y 4 A T A T A T A T A 44 A 4 B z B 1 B 2 B 3 B 4 B 8: end for A T sb RCp A SB-based Row-Column parallel 10 / 14
Recommend
More recommend