CSR SpMV with guaranteed workload balance Merge-based Parallel Decomposition NVIDIA Research Duane Merrill (dumerrill@nvidia.com) Michael Garland (mgarland@nvidia.com) January 26, 2019 D. Merrill and M. Garland, "Merge-Based Parallel Sparse Matrix-Vector Multiplication," SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , Salt Lake City, UT , 2016, pp. 678-689. doi: 10.1109/SC.2016.57 1
My soapbox 1. Algorithmic parallel decomposition matters too • Versus delegation of scheduling entirely to compiler/runtime 2. Workload imbalance in sparse applications • The biggest killer of machine utilization • Performance response for arbitrary inputs: reliable vs . capricious “face - planting” 3. Standard data formats • Performance portability 4. Evaluation methodology Avoid overfitting by benchmarking on 1Ks-1Ms of datasets, not 10s of datasets • 2
PERFORMANCE (IN)CONSISTENCY Faceplant “Consistency is far better than rare moments of greatness” -Scott Ginsberg 3
SPARSE MATRIX-VECTOR MULTIPLICATION Lots of available parallelism 1.0 -- 1.0 -- 1.0 (1.0)(1.0) + (1.0)(1.0) -- -- -- -- 1.0 0.0 = * -- -- 3.0 3.0 1.0 (3.0)(1.0) + (3.0)(1.0) 4.0 4.0 4.0 4.0 1.0 (4.0)(1.0) + (4.0)(1.0) + (4.0)(1.0) +(4.0)(1.0) sparse matrix dense vector dense vector A x y 4
CSR PARALLEL DECOMPOSITION Option (a): row-based 1.0 1.0 3.0 3.0 4.0 4.0 4.0 4.0 values 1.0 -- 1.0 -- column indices 0 2 2 3 0 1 2 3 -- -- -- -- -- -- 3.0 3.0 p 1 p 2 p 3 p 0 4.0 4.0 4.0 4.0 row 0 2 2 4 8 offsets A 5
CSR PARALLEL DECOMPOSITION imbalance! Option (a): row-based p 0 p 1 p 2 p 3 1.0 1.0 3.0 3.0 4.0 4.0 4.0 4.0 values 1.0 -- 1.0 -- column indices 0 2 2 3 0 1 2 3 -- -- -- -- -- -- 3.0 3.0 p 1 p 2 p 3 p 0 4.0 4.0 4.0 4.0 row 0 2 2 4 8 offsets A 6
CSR PARALLEL DECOMPOSITION Option (b): nonzero splitting p 0 p 1 p 2 p 3 1.0 1.0 3.0 3.0 4.0 4.0 4.0 4.0 values 1.0 -- 1.0 -- column indices 0 2 2 3 0 1 2 3 -- -- -- -- -- -- 3.0 3.0 4.0 4.0 4.0 4.0 row 0 2 2 4 8 offsets A 7
CSR PARALLEL DECOMPOSITION Option (b): nonzero splitting p 0 p 1 p 2 p 3 1.0 1.0 3.0 3.0 4.0 4.0 4.0 4.0 values 1.0 -- 1.0 -- column indices 0 2 2 3 0 1 2 3 -- -- -- -- -- -- 3.0 3.0 p 1 p 2 p 3 p 0 4.0 4.0 4.0 4.0 row 0 2 2 4 8 offsets A imbalance! 8
CSR PARALLEL DECOMPOSITION Option (c): logical merger p 0 p 1 p 2 p 3 1.0 -- 1.0 -- 1.0 1.0 3.0 3.0 4.0 4.0 4.0 4.0 values -- -- -- -- 0 2 2 4 row_offsets -- -- 1.0 1.0 column indices 0 2 2 3 0 1 2 3 1.0 1.0 1.0 1.0 A 9
IMBALANCE: CsrMV WITH ~35M NON-ZEROS thermomech_dK cnr-2000 ASIC_320k (temperature deformation) (Web connectivity) (circuit simulation) Row-length coeff. of variation 0.10 2.1 61.4 (24-cores each) Xeon E5-2690 2x Intel MKL (DP-GFLOPs) 17.9 13.4 11.8 35% slower 10
IMBALANCE: CsrMV WITH ~35M NON-ZEROS thermomech_dK cnr-2000 ASIC_320k (temperature deformation) (Web connectivity) (circuit simulation) Row-length coeff. of variation 0.10 2.1 61.4 (24-cores each) Xeon E5-2690 2x Intel MKL (DP-GFLOPs) 17.9 13.4 11.8 NVIDIA K40M cuSPARSE (DP-GFLOPs) 12.4 5.9 0.12 100x faceplant 11
IMBALANCE: CsrMV WITH ~35M NON-ZEROS thermomech_dK cnr-2000 ASIC_320k (temperature deformation) (Web connectivity) (circuit simulation) Row-length coeff. of variation 0.10 2.1 61.4 (24-cores each) Xeon E5-2690 2x Intel MKL (DP-GFLOPs) 17.9 13.4 11.8 Merge-based (DP-GFLOPs) 21.2 22.8 23.2 cuSPARSE (DP-GFLOPs) 12.4 5.9 0.12 NVIDIA K40M Merge-based (DP-GFLOPs) 15.5 16.7 14.1 12
GPU CsrMV PERFORMANCE LANDSCAPE The entire Florida Sparse Matrix Collection (4.2K datasets, NVIDIA K40M) 1000 1000 100 100 Highly correlated with 10 10 Runtime (ms) Runtime (ms) problem size! 1 1 0.1 0.1 0.01 0.01 0.001 0.001 Matrices by size Matrices by size cuSPARSE CsrMV Merge-based CsrMV 13
CPU CsrMV PERFORMANCE LANDSCAPE The entire Florida Sparse Matrix Collection (4.2K datasets, 2x Intel Xeon E5-2690) 1000 1000 100 100 Highly correlated with 10 10 Runtime (ms) Runtime (ms) problem size! 1 1 0.1 0.1 0.01 0.01 0.001 0.001 Matrices by size Matrices by size MKL CsrMV Merge-based CsrMV 14
CSRMV VISUALIZATION AS 2D “MERGE - PATH” 15
CsrMV visualization as 2D “merge -path ” 2 0 2 4 8 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (1.0)(1.0) row_offsets 2 (3.0)(1.0) ▪ Moves down when advancing within ℕ Ax dot products 3 (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 16 16
CsrMV visualization as 2D “merge - path” 2 0 2 4 8 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (+1.0) (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (1.0)(1.0) row_offsets 2 (3.0)(1.0) ▪ Moves down when advancing within ℕ Ax dot products 3 (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 17 17
CsrMV visualization as 2D “merge - path” 2 0 2 4 8 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (+1.0) (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (+1.0) (1.0)(1.0) row_offsets 2 (3.0)(1.0) ▪ Moves down when advancing within ℕ Ax dot products 3 (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 18 18
CsrMV visualization as 2D “merge - path” 2 0 2 4 8 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (+1.0) (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (+1.0) (1.0)(1.0) row_offsets 2 (3.0)(1.0) 2.0 ▪ Moves down when advancing within ℕ Ax dot products 3 (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 19 19
CsrMV visualization as 2D “merge - path” 4 0 2 2 8 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (1.0)(1.0) row_offsets 2 (3.0)(1.0) 0.0 ▪ Moves down when advancing within ℕ Ax dot products 3 (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 20 20
CsrMV visualization as 2D “merge - path” 4 0 2 2 8 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (1.0)(1.0) row_offsets 2 (+3.0) (3.0)(1.0) ▪ Moves down when advancing within ℕ Ax dot products 3 (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 21 21
CsrMV visualization as 2D “merge - path” 4 0 2 2 8 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (1.0)(1.0) row_offsets 2 (+3.0) (3.0)(1.0) ▪ Moves down when advancing within ℕ Ax dot products 3 (+3.0) (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 22 22
CsrMV visualization as 2D “merge - path” 8 0 2 2 4 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (1.0)(1.0) row_offsets 2 (+3.0) (3.0)(1.0) ▪ Moves down when advancing within ℕ Ax dot products 3 (+3.0) (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 6.0 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 23 23
Recommend
More recommend