csr spmv with guaranteed workload balance
play

CSR SpMV with guaranteed workload balance Merge-based Parallel - PowerPoint PPT Presentation

CSR SpMV with guaranteed workload balance Merge-based Parallel Decomposition NVIDIA Research Duane Merrill (dumerrill@nvidia.com) Michael Garland (mgarland@nvidia.com) January 26, 2019 D. Merrill and M. Garland, "Merge-Based Parallel


  1. CSR SpMV with guaranteed workload balance Merge-based Parallel Decomposition NVIDIA Research Duane Merrill (dumerrill@nvidia.com) Michael Garland (mgarland@nvidia.com) January 26, 2019 D. Merrill and M. Garland, "Merge-Based Parallel Sparse Matrix-Vector Multiplication," SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , Salt Lake City, UT , 2016, pp. 678-689. doi: 10.1109/SC.2016.57 1

  2. My soapbox 1. Algorithmic parallel decomposition matters too • Versus delegation of scheduling entirely to compiler/runtime 2. Workload imbalance in sparse applications • The biggest killer of machine utilization • Performance response for arbitrary inputs: reliable vs . capricious “face - planting” 3. Standard data formats • Performance portability 4. Evaluation methodology Avoid overfitting by benchmarking on 1Ks-1Ms of datasets, not 10s of datasets • 2

  3. PERFORMANCE (IN)CONSISTENCY Faceplant “Consistency is far better than rare moments of greatness” -Scott Ginsberg 3

  4. SPARSE MATRIX-VECTOR MULTIPLICATION Lots of available parallelism 1.0 -- 1.0 -- 1.0 (1.0)(1.0) + (1.0)(1.0) -- -- -- -- 1.0 0.0 = * -- -- 3.0 3.0 1.0 (3.0)(1.0) + (3.0)(1.0) 4.0 4.0 4.0 4.0 1.0 (4.0)(1.0) + (4.0)(1.0) + (4.0)(1.0) +(4.0)(1.0) sparse matrix dense vector dense vector A x y 4

  5. CSR PARALLEL DECOMPOSITION Option (a): row-based 1.0 1.0 3.0 3.0 4.0 4.0 4.0 4.0 values 1.0 -- 1.0 -- column indices 0 2 2 3 0 1 2 3 -- -- -- -- -- -- 3.0 3.0 p 1 p 2 p 3 p 0 4.0 4.0 4.0 4.0 row 0 2 2 4 8 offsets A 5

  6. CSR PARALLEL DECOMPOSITION imbalance! Option (a): row-based p 0 p 1 p 2 p 3 1.0 1.0 3.0 3.0 4.0 4.0 4.0 4.0 values 1.0 -- 1.0 -- column indices 0 2 2 3 0 1 2 3 -- -- -- -- -- -- 3.0 3.0 p 1 p 2 p 3 p 0 4.0 4.0 4.0 4.0 row 0 2 2 4 8 offsets A 6

  7. CSR PARALLEL DECOMPOSITION Option (b): nonzero splitting p 0 p 1 p 2 p 3 1.0 1.0 3.0 3.0 4.0 4.0 4.0 4.0 values 1.0 -- 1.0 -- column indices 0 2 2 3 0 1 2 3 -- -- -- -- -- -- 3.0 3.0 4.0 4.0 4.0 4.0 row 0 2 2 4 8 offsets A 7

  8. CSR PARALLEL DECOMPOSITION Option (b): nonzero splitting p 0 p 1 p 2 p 3 1.0 1.0 3.0 3.0 4.0 4.0 4.0 4.0 values 1.0 -- 1.0 -- column indices 0 2 2 3 0 1 2 3 -- -- -- -- -- -- 3.0 3.0 p 1 p 2 p 3 p 0 4.0 4.0 4.0 4.0 row 0 2 2 4 8 offsets A imbalance! 8

  9. CSR PARALLEL DECOMPOSITION Option (c): logical merger p 0 p 1 p 2 p 3 1.0 -- 1.0 -- 1.0 1.0 3.0 3.0 4.0 4.0 4.0 4.0 values -- -- -- -- 0 2 2 4 row_offsets -- -- 1.0 1.0 column indices 0 2 2 3 0 1 2 3 1.0 1.0 1.0 1.0 A 9

  10. IMBALANCE: CsrMV WITH ~35M NON-ZEROS thermomech_dK cnr-2000 ASIC_320k (temperature deformation) (Web connectivity) (circuit simulation) Row-length coeff. of variation 0.10 2.1 61.4 (24-cores each) Xeon E5-2690 2x Intel MKL (DP-GFLOPs) 17.9 13.4 11.8 35% slower 10

  11. IMBALANCE: CsrMV WITH ~35M NON-ZEROS thermomech_dK cnr-2000 ASIC_320k (temperature deformation) (Web connectivity) (circuit simulation) Row-length coeff. of variation 0.10 2.1 61.4 (24-cores each) Xeon E5-2690 2x Intel MKL (DP-GFLOPs) 17.9 13.4 11.8 NVIDIA K40M cuSPARSE (DP-GFLOPs) 12.4 5.9 0.12 100x faceplant 11

  12. IMBALANCE: CsrMV WITH ~35M NON-ZEROS thermomech_dK cnr-2000 ASIC_320k (temperature deformation) (Web connectivity) (circuit simulation) Row-length coeff. of variation 0.10 2.1 61.4 (24-cores each) Xeon E5-2690 2x Intel MKL (DP-GFLOPs) 17.9 13.4 11.8 Merge-based (DP-GFLOPs) 21.2 22.8 23.2 cuSPARSE (DP-GFLOPs) 12.4 5.9 0.12 NVIDIA K40M Merge-based (DP-GFLOPs) 15.5 16.7 14.1 12

  13. GPU CsrMV PERFORMANCE LANDSCAPE The entire Florida Sparse Matrix Collection (4.2K datasets, NVIDIA K40M) 1000 1000 100 100 Highly correlated with 10 10 Runtime (ms) Runtime (ms) problem size! 1 1 0.1 0.1 0.01 0.01 0.001 0.001 Matrices by size Matrices by size cuSPARSE CsrMV Merge-based CsrMV 13

  14. CPU CsrMV PERFORMANCE LANDSCAPE The entire Florida Sparse Matrix Collection (4.2K datasets, 2x Intel Xeon E5-2690) 1000 1000 100 100 Highly correlated with 10 10 Runtime (ms) Runtime (ms) problem size! 1 1 0.1 0.1 0.01 0.01 0.001 0.001 Matrices by size Matrices by size MKL CsrMV Merge-based CsrMV 14

  15. CSRMV VISUALIZATION AS 2D “MERGE - PATH” 15

  16. CsrMV visualization as 2D “merge -path ” 2 0 2 4 8 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (1.0)(1.0) row_offsets 2 (3.0)(1.0) ▪ Moves down when advancing within ℕ Ax dot products 3 (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 16 16

  17. CsrMV visualization as 2D “merge - path” 2 0 2 4 8 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (+1.0) (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (1.0)(1.0) row_offsets 2 (3.0)(1.0) ▪ Moves down when advancing within ℕ Ax dot products 3 (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 17 17

  18. CsrMV visualization as 2D “merge - path” 2 0 2 4 8 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (+1.0) (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (+1.0) (1.0)(1.0) row_offsets 2 (3.0)(1.0) ▪ Moves down when advancing within ℕ Ax dot products 3 (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 18 18

  19. CsrMV visualization as 2D “merge - path” 2 0 2 4 8 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (+1.0) (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (+1.0) (1.0)(1.0) row_offsets 2 (3.0)(1.0) 2.0 ▪ Moves down when advancing within ℕ Ax dot products 3 (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 19 19

  20. CsrMV visualization as 2D “merge - path” 4 0 2 2 8 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (1.0)(1.0) row_offsets 2 (3.0)(1.0) 0.0 ▪ Moves down when advancing within ℕ Ax dot products 3 (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 20 20

  21. CsrMV visualization as 2D “merge - path” 4 0 2 2 8 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (1.0)(1.0) row_offsets 2 (+3.0) (3.0)(1.0) ▪ Moves down when advancing within ℕ Ax dot products 3 (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 21 21

  22. CsrMV visualization as 2D “merge - path” 4 0 2 2 8 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (1.0)(1.0) row_offsets 2 (+3.0) (3.0)(1.0) ▪ Moves down when advancing within ℕ Ax dot products 3 (+3.0) (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 22 22

  23. CsrMV visualization as 2D “merge - path” 8 0 2 2 4 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (1.0)(1.0) row_offsets 2 (+3.0) (3.0)(1.0) ▪ Moves down when advancing within ℕ Ax dot products 3 (+3.0) (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 6.0 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 23 23

Recommend


More recommend