parallel sparse matrix vector and matrix
play

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector - PowerPoint PPT Presentation

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse Blocks Aydn Bulu, UCSB Jeremy T. Fineman (MIT) Matteo Frigo (Cilk Arts) John R. Gilbert (UCSB) Charles E. Leiserson (MIT & Cilk Arts) 1


  1. Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse Blocks Aydın Buluç, UCSB Jeremy T. Fineman (MIT) Matteo Frigo (Cilk Arts) John R. Gilbert (UCSB) Charles E. Leiserson (MIT & Cilk Arts) 1

  2. Sparse Matrix-Dense Vector Multiplication (SpMV) A T y A x y x A is an n-by-n sparse matrix with nnz << n 2 nonzero s Applications: • Iterative methods for solving linear systems : Krylov subspace methods based on Lanczos biorthogonalization: Biconjugate gradients (BCG) & quasi-minimal residual (QMR ) • Graph analysis: Betweenness centrality computation 2

  3. The Landscape: Where does our work fit? Hardware specific Matrix specific optimizations optimizations (permutations, index/value (prefetching, TLB compression, register blocking) blocking, vectorization) Plenty of parallelism (for any nonzero distribution ) Our Contribution This is our plane of focus ! Equally fast y=Ax and y=A T x (simultaneously) 3

  4. Theoretical and Experimental: Main Results Our parallel algorithms for y Ax and y A T x using the new compressed sparse blocks ( CSB ) layout has   • span, and work, ( n lg n ) ( nnz )  • yielding parallelism. ( nnz / n lg n ) 600 Our CSB algorithms 500 400 MFlops/sec 300 Serial 200 (Naïve CSR) 100 Star-P 0 1 2 3 4 5 6 7 8 (CSR+blockrow Processors distribution) 4

  5. Compressed Sparse Rows (CSR): A Standard Layout Dense collection of “sparse rows” Row pointers 0 4 8 10 n 11 12 13 16 17 n × n matrix with colind nnz nonzeroes 0 2 3 4 0 1 5 7 2 3 3 4 5 4 5 6 7 data • Stores entries in row-major order  • n lg nnz nnz lg n Uses bits of index data. • Reading rows in parallel is easy, but columns is hard. 5

  6. Parallelizing SpMV_T is hard using the standard CSR format CSR_SPMV_T(A,x,y) for i  0 to n-1 do for k  A.rowptr[i] to A.rowptr[i+1]-1 do y[A.colind[k]]  y[A.colind[k]] + A.data [k]∙x[ i] 1. Parallelize the outer loop? × Race conditions on vector y . a. Locking on y is not scalable. b. Using p copies of y is work-inefficient A T y x 2. Parallelize the inner loop? × Span is ,  ( n )  ( nnz / n ) thus parallelism at most 6

  7. Compressed Sparse Blocks (CSB) Dense collection of “sparse blocks” Block pointer β =4 0,0 0,1 1,0 1,1 0,0 0,1 n data 1,0 1,1 rowind 0 1 1 0 0 2 2 3 0 1 1 0 1 2 2 2 3 n × n matrix with colind nnz nonzeroes, 0 0 1 2 3 2 3 3 0 1 3 0 1 0 1 2 3 in β × β blocksasd • Store blocks in row-major order (*) • Store entries within block in Z-morton order (*)   • For , matches CSR on storage. n Reading blockrows or blockcolumns in parallel is now easy. 7

  8. CSB Matrix-Vector Multiplication Our algorithm uses three levels of = parallelism 1) Multiply each blockrow in = parallel, each writing to a disjoint output subvector. 2) If a blockrow is “dense,” parallelize the blockrow multiplication. + = 3) If a single block is dense, parallelize the = block multiplication. 8

  9. Blockrow-Vector Multiplication Until things   nnz recurse in parallel =   = = n   r nonzeros then sum results = + • Divide-and-conquer based on the nonzero count, not spatial. • Allocation & accumulation costs of temporary vectors are amortized.   • Lemma: For , our parallel blockrow-vector n    multiplication has work and span on a n ( r ) O ( n lg n )   blockrow containing nonzeroes. r 9

  10. Block-Vector Multiplication A 00 A 01 For any (sub)block, first perform A 00 and A 11 in parallel; then A 01 and A 10 in parallel. Updates on y are race-free A 10 A 11 • With Z-morton ordering, spatial division to quadrants takes lg dim dim  dim time on a (sub)block using three binary searches   • n Lemma: For , our parallel block-vector multiplication has  work and span on a block with r nonzeroes. ( r ) O ( n ) 10

  11. Block-Vector Multiplication A 00 A 01 For any (sub)block, first perform A 00 and A 11 in parallel; then A 01 and A 10 in parallel. Updates on y are race-free A 10 A 11 • lg dim With Z-morton ordering, spatial division to quadrants takes dim  dim time on a (sub)block using three binary searches   • n Lemma: For , our parallel block-vector multiplication has  work and span on a block with r nonzeroes. ( r ) O ( n ) 11

  12. Block-Vector Multiplication A 00 A 01 For any (sub)block, first perform A 00 and A 11 in parallel; then A 01 and A 10 in parallel. Updates on y are race-free A 10 A 11 • lg dim With Z-morton ordering, spatial division to quadrants takes dim  dim time on a (sub)block using three binary searches.   • n Lemma: For , our parallel block-vector multiplication has  work and span on a block with r nonzeroes. ( r ) O ( n ) 12

  13. Main Theorem and The Choice of β Theorem: Our parallel matrix-vector multiplication has   work and span   , yielding     2 2 O ( n nnz ) O ( lg n n ) n  nonzeroes . nnz  on an CSB matrix containing n n    For ,this yields a parallelism of n ( nnz / n lg n ) On our test matrices, parallelism ranges from 186 to 3498   n Sensitivity to β in theory Sensitivity to β in practice 13

  14. Ax and A T x perform equally well Matrix-Vector Product Matrix-Transpose-Vector Product 1 proc 2 procs 1 proc 2 procs 800 800 4 procs 8 procs 4 procs 8 procs 700 700 600 600 500 500 MFlops/sec 400 400 300 300 200 200 100 100 0 0 4-socket dual-core 2.2GHz AMD Opteron 8214 Most test matrices had similar performance when multiplying by the matrix and its transpose. 14

  15. Test Matrices and Performance Overview CSB_SpMV CSB_SpMV_T Star-P_SpMV Star-P_SpMV_T Serial CSR_SPMV Serial CSR_SpMV_T 600 500 400 MFlops/sec 300 200 100 0 1 2 3 4 5 6 7 8 Processors 15

  16. Reality Check and Related Work FACT: Sparse matrix-dense vector multiplication (and the transpose) is bandwidth limited . This work: motivated by multicore/manycore architectures where parallelism and memory bandwidth are key resources. • Previous work mostly focused on reducing communication volume in distributed memory, often by using graph or hypergraph partitioning [Catalyurek & Aykanat ’99]. • Great optimization work for SpMV on multicores by Williams, et al. ‘09, but without parallelism guarantees or multiplication with the transpose (SpMV_T). • Blocking for sparse matrices is not new, but mostly for cache performance, not parallelism [Im et al.’04, Nishtala et al. ’07]. 16

  17. Good Speedup Until Bandwidth Limit • 16 Slowed down processors (artificially 14 introduced extra 12 instructions) for test to 10 Speedup hide memory issues. 8 • 6 Shows algorithm 4 scales well given sufficient memory 2 bandwidth. 0 0 2 4 6 8 10 12 14 16 Processors Ran on the smallest (and one of the most irregular) test matrix 17

  18. All about Bandwidth: Harpertown vs Nehalem Intel Xeon X5460 @3.16Ghz Intel Core i7 920 @2.66Ghz Dual-socket Quad-core Single-socket Quad-core FSB @1333Mhz Quickpath+Hyperthreading 18

  19. Conclusions & Future Work • CSB allows for efficient multiplication of a sparse matrix and its transpose by a dense vector. Future Work: • Does CSB work well with other computations? Sparse LU decomposition? Sparse matrix-matrix multiplication? • For a symmetric matrix, need only store upper triangle of matrix. Can we multiply with one read (i.e., ½ the bandwidth)? Code (in C++ and Cilk++) available from: http://gauss.cs.ucsb.edu/ ∼ aydin/software.html

  20. Thank You !

  21. CSB Space Usage    n lg nnz nnz lg n Lemma: For , CSB uses bits of n index data, matching CSR. Proof:   n n n blocks => n block pointers. Block pointer • Each block pointer uses lg nnz bits, for total n lg nnz • Each row (or column) offset rowind requires bits, for   lg lg n colind total. ( nnz / 2 ) lg n nnz

Recommend


More recommend