partitioning sparse matrices for parallel preconditioned
play

Partitioning sparse matrices for parallel preconditioned iterative - PowerPoint PPT Presentation

Partitioning sparse matrices for parallel preconditioned iterative methods Bora Uar Emory University, Atlanta, GA Joint work with Prof C. Aykanat Bilkent University, Ankara, Turkey Iterative methods Used for solving linear systems A x


  1. Partitioning sparse matrices for parallel preconditioned iterative methods Bora Uçar Emory University, Atlanta, GA Joint work with Prof C. Aykanat Bilkent University, Ankara, Turkey

  2. Iterative methods • Used for solving linear systems A x  b – usually A is sparse while not converged do computations • Involves check convergence – linear vector operations • x = x  y  x i = x i   y i – inner products •  =  x , y    =  x i  y i – sparse matrix-vector multiplies (SpMxV)  y i =  A i ,x  • y = A x  y i =  A T i ,x  • y = A T x 2

  3. Preconditioned iterative methods • Transform A x  b to another system that is easier to solve • Preconditioner is a matrix that does the desired transformation • Focus: approximate inverse preconditioners • Right approximate inverse M provides AM  I • Instead of solving A x  b , use right preconditioning and solve AM y = b and then set x = M y 3

  4. Parallelizing iterative methods • Avoid communicating vector entries for linear vector operations and inner products • Inner products require communication – regular communication – cost remains the same with the increasing problem size – there are cost optimal algorithms to perform these communications. • Efficiently parallelize the SpMxV operations • Efficiently parallelize the application of the preconditioner 4

  5. Preconditioned iterative methods • Applying approximate inverse preconditioners – additional SpMxV operations with M • never form the matrix AM; perform SpMxVs • Parallelizing a full step requires efficient SpMxV with A and M – partition A and M simultaneously • What has been done? – a bipartite graph model (Hendrickson and Kolda, SISC 00) 5

  6. Row-parallel y=Ax • Rows (and hence y) and x is partitioned P 1 P 2 P 3 P 4 x 1 x 2 x 3 x 4 11 14 15 16 17 20 21 22 23 24 25 26 1. Expand x vector 10 12 13 18 19 2 3 4 5 6 7 9 1 8       1        (sends/receives) 2 P 1 y 1       3       4       2. Compute with 5         P 2 6 y 2        diagonal blocks 7      8       9 3. Receive x and          10 P 3 y 3     11 compute with off-       12       13 diagonal blocks        14 P 4 y 4         15     16 6

  7. Row-parallel y=Ax P 1 P 2 P 3 P 4 x 1 x 2 x 3 x 4 Communication requirements 11 14 15 16 17 20 21 22 23 24 25 26 10 12 13 18 19 2 3 4 5 6 7 9 1 8       1        2 P 1 y 1 Total volume:       3       4 #nonzero column       5         P 2 6 y 2        segments in off 7      8       9 diagonal blocks (13)          10 P 3 y 3     11       12 Total number :       13        14 P 4 y 4 #nonzero off diagonal         15     16 blocks (9) Total volume and number of messages Per processor: addressed previously (Catalyurek above two confined and Aykanat, IEEE TPDS 99; U. and within a column stripe Aykanat, SISC 04; Vastenhouw and Bisseling, SIREV 05) 7

  8. Minimize volume in row-parallel y=Ax: Revisiting 1D hypergraph models • Three entities to partition y, rows of A, & x – three types of vertices y i , r i & x j • y i is computed by a single r i – connect y i and r i (edge, hyperedge) • x j is a data source; r i 's where a ij ≠0 need x j – connect x j and all such r i (definitely a hyperedge) 8

  9. Minimize volume in row-parallel y=Ax: Revisiting 1D hypergraph models General hypergraph model for Combine y i and r i : owner 1D rowwise partitioning computes rule Partition the vertices into K parts (partition the data among K processors) 9

  10. Hypergraph partitioning • Partition the vertices of a hypergraph into two or more partitions such that: – ∑con ( n i )–1 is minimized (total volume) con ( n i )=number of parts connected by hyperedge n i – a balance criterion among the part weights is maintained (load balance) 10

  11. Column-parallel y=Ax P 1 P 2 P 3 P 4 x 1 x 2 x 3 Communication requirements x4 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9    1 Total volume:     2    3    #nonzero row segments in off y 1 4 P 1    5       diagonal blocks (13) 6     7     8 Total number :   9    10 #nonzero off diagonal blocks (9)    y 2 P 2 11       12    13 Per processor:      14    15 above two confined within a row    16   P 3 y 3 17 stripe    18         19      20     Total volume and number of messages 21    22    P 4 addressed previously (Catalyurek 23 y 4     24      25 and Aykanat, IEEE TPDS 99; U.       26 and Aykanat, SISC 04; Vastenhouw and Bisseling, SIREV 05). 11

  12. Preconditioned iterative methods • Linear vector operations and inner product computations are done: – all vectors in a single operation have the same partition • Partition A and M simultaneously • A blend of dependencies and interactions among matrices and vectors – different partitioning requirements in different methods • Figure out partitioning requirements through analyzing linear vector operations and inner products 12

  13. Preconditioned BiCG-STAB   p , r , v should         i i 1 i 1 i 1 p r p v be partitioned   i 1 i 1 conformably  i ˆ p Mp i  ˆ v A p s should be with r   i 1   i s r v i and v  ˆ s Ms  ˆ t A s t should be with s   t , s t , t i       i i 1 i x x p s x should be with i i p and s i    r s t i 13

  14. Preconditioned BiCG-STAB  p , r , v, s, t, and, x should be partitioned conformably • What remains? Columns of M and  ˆ i p Mp ˆ p rows of A should  i ˆ be conformal v A p  ˆ s Ms ˆ s  ˆ t A s should be conformal P A Q T Rows of M and Q M P T columns of A should be conformal 14

  15. Partitioning requirements BiCG-STAB PAQ T QMP T TFQMR PAP T and PM 1 M 2 P T GMRES PAP T and PMP T CGNE PAQ and PMP T • “and” means there is a synchronization point between SpMxV’s – Load balance each SpMxV individually 15

  16. Model for simultaneous partitioning • We use the previously proposed models – define operators to build composite models Rowwise model (y=Ax) Col.wise model (w=Mz) 16

  17. Combining hypergraph models • Vertex amalgamation: combine vertices of individual hypergraphs, and connect the composite vertex to the hyperedges of the individual vertices • Vertex weighting: define multiple weights; individual vertex weights are not added up Never amalgamate hyperedges of individual hypergraphs! 17

  18. Combining guideline 1. Determine partitioning requirements 2. Decide on partitioning dimensions • generate rowwise model for the matrices to be partitioned rowwise • generate columnwise model for the matrices to be partitioned columnwise 3. Apply vertex operations ● to impose identical partition on two vertices amalgamate them ● if the applications of matrices are interleaved with synchronization apply vertex weighting 18

  19. Combining example • BiCG-STAB requires PAQ T QMP T 1 • A rowwise (y=Ax), M columnwise (w=Mz) 2 19

  20. Combining example (Cont') • AQ T QM: C olumns of A and rows of M (y=Ax, w=Mz) 3i 20

  21. Combining example (Cont') • PAMP T: Rows of A and columns of M (y=Ax, w=Mz) 3i 21

  22. Remarks on composite models • Partitioning the composite hypergraphs – balances computational loads of processors – minimizes the total communication volume in a full step of the preconditioned iterative methods • Assumption: A and M or their sparsity patterns are available 22

  23. Experiments: Set up • Sparse nonsymmetric square matrices from Univ. Florida sparse matrix collection • SPAI by Grote and Huckle (SISC 97) • AINV by Benzi and Tůma (SISC 98) • PaToH by Çatalyürek and Aykanat (TPDS 99) 23

  24. Experiments: Comparison With respect to partitioning A and applying the same partition to M (SPAI experiments) percent gain in total volume CC RR 32-way 64-way 32-way 64-way min 7 8 6 8 max 31 34 36 36 average 20 20 20 20 (Ten different matrices) 24

Recommend


More recommend