Partitioning sparse matrices for parallel preconditioned iterative methods Bora Uçar Emory University, Atlanta, GA Joint work with Prof C. Aykanat Bilkent University, Ankara, Turkey
Iterative methods • Used for solving linear systems A x b – usually A is sparse while not converged do computations • Involves check convergence – linear vector operations • x = x y x i = x i y i – inner products • = x , y = x i y i – sparse matrix-vector multiplies (SpMxV) y i = A i ,x • y = A x y i = A T i ,x • y = A T x 2
Preconditioned iterative methods • Transform A x b to another system that is easier to solve • Preconditioner is a matrix that does the desired transformation • Focus: approximate inverse preconditioners • Right approximate inverse M provides AM I • Instead of solving A x b , use right preconditioning and solve AM y = b and then set x = M y 3
Parallelizing iterative methods • Avoid communicating vector entries for linear vector operations and inner products • Inner products require communication – regular communication – cost remains the same with the increasing problem size – there are cost optimal algorithms to perform these communications. • Efficiently parallelize the SpMxV operations • Efficiently parallelize the application of the preconditioner 4
Preconditioned iterative methods • Applying approximate inverse preconditioners – additional SpMxV operations with M • never form the matrix AM; perform SpMxVs • Parallelizing a full step requires efficient SpMxV with A and M – partition A and M simultaneously • What has been done? – a bipartite graph model (Hendrickson and Kolda, SISC 00) 5
Row-parallel y=Ax • Rows (and hence y) and x is partitioned P 1 P 2 P 3 P 4 x 1 x 2 x 3 x 4 11 14 15 16 17 20 21 22 23 24 25 26 1. Expand x vector 10 12 13 18 19 2 3 4 5 6 7 9 1 8 1 (sends/receives) 2 P 1 y 1 3 4 2. Compute with 5 P 2 6 y 2 diagonal blocks 7 8 9 3. Receive x and 10 P 3 y 3 11 compute with off- 12 13 diagonal blocks 14 P 4 y 4 15 16 6
Row-parallel y=Ax P 1 P 2 P 3 P 4 x 1 x 2 x 3 x 4 Communication requirements 11 14 15 16 17 20 21 22 23 24 25 26 10 12 13 18 19 2 3 4 5 6 7 9 1 8 1 2 P 1 y 1 Total volume: 3 4 #nonzero column 5 P 2 6 y 2 segments in off 7 8 9 diagonal blocks (13) 10 P 3 y 3 11 12 Total number : 13 14 P 4 y 4 #nonzero off diagonal 15 16 blocks (9) Total volume and number of messages Per processor: addressed previously (Catalyurek above two confined and Aykanat, IEEE TPDS 99; U. and within a column stripe Aykanat, SISC 04; Vastenhouw and Bisseling, SIREV 05) 7
Minimize volume in row-parallel y=Ax: Revisiting 1D hypergraph models • Three entities to partition y, rows of A, & x – three types of vertices y i , r i & x j • y i is computed by a single r i – connect y i and r i (edge, hyperedge) • x j is a data source; r i 's where a ij ≠0 need x j – connect x j and all such r i (definitely a hyperedge) 8
Minimize volume in row-parallel y=Ax: Revisiting 1D hypergraph models General hypergraph model for Combine y i and r i : owner 1D rowwise partitioning computes rule Partition the vertices into K parts (partition the data among K processors) 9
Hypergraph partitioning • Partition the vertices of a hypergraph into two or more partitions such that: – ∑con ( n i )–1 is minimized (total volume) con ( n i )=number of parts connected by hyperedge n i – a balance criterion among the part weights is maintained (load balance) 10
Column-parallel y=Ax P 1 P 2 P 3 P 4 x 1 x 2 x 3 Communication requirements x4 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 1 Total volume: 2 3 #nonzero row segments in off y 1 4 P 1 5 diagonal blocks (13) 6 7 8 Total number : 9 10 #nonzero off diagonal blocks (9) y 2 P 2 11 12 13 Per processor: 14 15 above two confined within a row 16 P 3 y 3 17 stripe 18 19 20 Total volume and number of messages 21 22 P 4 addressed previously (Catalyurek 23 y 4 24 25 and Aykanat, IEEE TPDS 99; U. 26 and Aykanat, SISC 04; Vastenhouw and Bisseling, SIREV 05). 11
Preconditioned iterative methods • Linear vector operations and inner product computations are done: – all vectors in a single operation have the same partition • Partition A and M simultaneously • A blend of dependencies and interactions among matrices and vectors – different partitioning requirements in different methods • Figure out partitioning requirements through analyzing linear vector operations and inner products 12
Preconditioned BiCG-STAB p , r , v should i i 1 i 1 i 1 p r p v be partitioned i 1 i 1 conformably i ˆ p Mp i ˆ v A p s should be with r i 1 i s r v i and v ˆ s Ms ˆ t A s t should be with s t , s t , t i i i 1 i x x p s x should be with i i p and s i r s t i 13
Preconditioned BiCG-STAB p , r , v, s, t, and, x should be partitioned conformably • What remains? Columns of M and ˆ i p Mp ˆ p rows of A should i ˆ be conformal v A p ˆ s Ms ˆ s ˆ t A s should be conformal P A Q T Rows of M and Q M P T columns of A should be conformal 14
Partitioning requirements BiCG-STAB PAQ T QMP T TFQMR PAP T and PM 1 M 2 P T GMRES PAP T and PMP T CGNE PAQ and PMP T • “and” means there is a synchronization point between SpMxV’s – Load balance each SpMxV individually 15
Model for simultaneous partitioning • We use the previously proposed models – define operators to build composite models Rowwise model (y=Ax) Col.wise model (w=Mz) 16
Combining hypergraph models • Vertex amalgamation: combine vertices of individual hypergraphs, and connect the composite vertex to the hyperedges of the individual vertices • Vertex weighting: define multiple weights; individual vertex weights are not added up Never amalgamate hyperedges of individual hypergraphs! 17
Combining guideline 1. Determine partitioning requirements 2. Decide on partitioning dimensions • generate rowwise model for the matrices to be partitioned rowwise • generate columnwise model for the matrices to be partitioned columnwise 3. Apply vertex operations ● to impose identical partition on two vertices amalgamate them ● if the applications of matrices are interleaved with synchronization apply vertex weighting 18
Combining example • BiCG-STAB requires PAQ T QMP T 1 • A rowwise (y=Ax), M columnwise (w=Mz) 2 19
Combining example (Cont') • AQ T QM: C olumns of A and rows of M (y=Ax, w=Mz) 3i 20
Combining example (Cont') • PAMP T: Rows of A and columns of M (y=Ax, w=Mz) 3i 21
Remarks on composite models • Partitioning the composite hypergraphs – balances computational loads of processors – minimizes the total communication volume in a full step of the preconditioned iterative methods • Assumption: A and M or their sparsity patterns are available 22
Experiments: Set up • Sparse nonsymmetric square matrices from Univ. Florida sparse matrix collection • SPAI by Grote and Huckle (SISC 97) • AINV by Benzi and Tůma (SISC 98) • PaToH by Çatalyürek and Aykanat (TPDS 99) 23
Experiments: Comparison With respect to partitioning A and applying the same partition to M (SPAI experiments) percent gain in total volume CC RR 32-way 64-way 32-way 64-way min 7 8 6 8 max 31 34 36 36 average 20 20 20 20 (Ten different matrices) 24
Recommend
More recommend