sparch.mit.edu SpArch: Efficient Architecture for Sparse Matrix Multiplication Zhekai Zhang* 1 , Hanrui Wang * 1 , Song Han 1 , William J. Dally 2 1 Massachusetts Institute of Technology 2 Stanford University / Nvidia *Equal Contributions 1
Accelerate Sparse Matrix Multiplication Graph Computing Dimension ≈ 4×10 ! ≈ 8×10 "! Sparsity Compressed Neural Networks Dimension ≈ 10 ## ≈ 10 "$ Sparsity 2
CPUs and GPUs are Slow and Under-Utilized for SpMM MKL on Intel cuSPARSE on CUSP on TITAN Armadillo on Double Precision Core i7-5930K TITAN Xp Xp Arm Cortex-A53 SpMM Average 0.560 0.595 0.631 0.00813 GFLOPS on 20 benchmarks 1,2 Theoretical 289 343 343 5.47 GFLOPS 3,4,5 0.194% 0.173% 0.184% 0.149% Utilization 1 Leskovec, Jure, and Rok Sosi č . "Snap: A general-purpose network analysis and graph-mining library." ACM Transactions on Intelligent Systems and Technology (TIST) 8.1 (2016): 1. 2 Davis, Timothy A., and Yifan Hu. "The University of Florida sparse matrix collection." ACM Transactions on Mathematical Software (TOMS) 38.1 (2011): 1. 3 https://www.pugetsystems.com/labs/hpc/Linpack-performance-Haswell-E-Core-i7-5960X-and-5930K-594/ 4 https://www.techpowerup.com/gpu-specs/titan-x-pascal.c2863 5 http://web.eece.maine.edu/~vweaver/group/machines.html 3
Challenges • Super-large • Limited on-chip memory Dimension ≈ 4×10 ! ≈ 8×10 "! Density Dimension ≈ 10 ## ≈ 10 "$ Density 4
Challenges • Super-large • Limited on-chip memory Dimension ≈ 4×10 ! ≈ 8×10 "! Density • Ultra-sparse • Low operational intensity • Limited memory bandwidth Dimension ≈ 10 ## ≈ 10 "$ Density 5
Background: Outer Product Intermediate Partial Matrix 𝑄 ! = 𝐵 :,! ×𝐶 !,: Right Matrix 𝐶 Result Matrix Multiply Phase 𝐷 = 𝐵𝐶 = ∑ ! 𝑄 ! Merge Phase Left Matrix 𝐵 6
Background: Outer Product Perfect input reuse: read input matrix only once Multiplier DRAM Merger DRAM DRAM Array Multiply Phase Merge Phase 7
Background: Outer Product Multiplier DRAM Merger DRAM DRAM Array Multiply Phase Merge Phase 8
Background: Outer Product Multiplier DRAM Merger DRAM DRAM Array Multiply Phase Merge Phase 9
Background: Outer Product Multiplier DRAM Merger DRAM DRAM Array Multiply Phase Merge Phase 10
Background: Outer Product Multiplier DRAM Merger DRAM DRAM Array Multiply Phase Merge Phase 11
Background: Outer Product Multiplier DRAM Merger DRAM DRAM Array Multiply Phase Merge Phase 12
Background: Outer Product Bad output reuse: intermediate results need storing to DRAM and loading back Multiplier DRAM Merger DRAM DRAM Array Multiply Phase Merge Phase 13
DRAM access of Intermediate Matrix in the baseline implementation Distribution of DRAM access • Baseline Implementation: OuterSPACE. # Load and Store • Row-wise Output Stationary • Each Intermediate Matrix has one- round of Store and Load. Partial Matrix Size (#Non-zeros) Pal, Subhankar, et al. "Outerspace: An outer product based sparse matrix multiplication accelerator." (HPCA) . IEEE, 2018. 14
Key idea: reduce both input and partial matrix DRAM access Algorithm: Outer Product Technique 1: Pipelined Multiply and Merge Technique 2: Matrix Condensing Technique 4: Technique 3: Row Prefetcher Huffman Tree Scheduler 15
Multiply and Merge 0.1 0.5 0.6 0.1 1.1 0.2 0.3 1.3 1.2 0.2 1.3 1.5 + = -0.8 2.2 -0.8 2.2 1.2 1.1 1.1 1.2 Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 MatA 0 0.1 0 0.5 0.2 0 0 0.3 0 0 0 0 0 1.2 0 0 MatB 0 0 0 0.6 0 1.3 0 1.2 0 -0.8 2.2 0 1.1 0 0 0 Element- 0 0.1 0 1.1 0.2 1.3 0 1.5 0 -0.8 2.2 0 1.1 1.2 0 0 wise Add 16
Merge Phase MatA: (1,0.1) (3, 0.5) (4, 0.2) (7, 0.3) (13, 1.2) MatB: (3, 0.6) (5,1.3) (7, 1.2) (9, -0.8) (10, 2.2) (12, 1.1) Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 MatA 0 0.1 0 0.5 0.2 0 0 0.3 0 0 0 0 0 1.2 0 0 MatB 0 0 0 0.6 0 1.3 0 1.2 0 -0.8 2.2 0 1.1 0 0 0 Element- 0 0.1 0 1.1 0.2 1.3 0 1.5 0 -0.8 2.2 0 1.1 1.2 0 0 wise Add 17
Merge Phase MatA: (1,0.1) (3, 0.5) (4, 0.2) (7, 0.3) (13, 1.2) MatB: (3, 0.6) (5,1.3) (7, 1.2) (9, -0.8) (10, 2.2) (12, 1.1) ptr0 [1, 3, 4, 7…] [3, 5, 7, 9…] ptr1 18
Merge Phase MatA: (1,0.1) (3, 0.5) (4, 0.2) (7, 0.3) (13, 1.2) MatB: (3, 0.6) (5,1.3) (7, 1.2) (9, -0.8) (10, 2.2) (12, 1.1) ptr0 Merge results [1, 3, 4, 7…] [1] [3, 5, 7, 9…] ptr1 19
Merge Phase MatA: (1,0.1) (3, 0.5) (4, 0.2) (7, 0.3) (13, 1.2) MatB: (3, 0.6) (5,1.3) (7, 1.2) (9, -0.8) (10, 2.2) (12, 1.1) ptr0 Merge results [1, 3, 4, 7…] [1,3,3] [3, 5, 7, 9…] ptr1 20
Merge Phase MatA: (1,0.1) (3, 0.5) (4, 0.2) (7, 0.3) (13, 1.2) MatB: (3, 0.6) (5,1.3) (7, 1.2) (9, -0.8) (10, 2.2) (12, 1.1) ptr0 Merge results [1, 3, 4, 7…] [1,3,3,4] [3, 5, 7, 9…] ptr1 21
Merge Phase MatA: (1,0.1) (3, 0.5) (4, 0.2) (7, 0.3) (13, 1.2) MatB: (3, 0.6) (5,1.3) (7, 1.2) (9, -0.8) (10, 2.2) (12, 1.1) ptr0 Merge results [1, 3, 4, 7…] [1,3,3,4,5] [3, 5, 7, 9…] ptr1 22
Merge Phase MatA: (1,0.1) (3, 0.5) (4, 0.2) (7, 0.3) (13, 1.2) MatB: (3, 0.6) (5,1.3) (7, 1.2) (9, -0.8) (10, 2.2) (12, 1.1) ptr0 Merge results [1, 3, 4, 7…] [1,3,3,4,5,7,7] [3, 5, 7, 9…] ptr1 23
Technique 1: Pipelined Multiply and Merge −∞ 1 3 4 7 +∞ −∞ 3 Index 0 1 2 3 4 5 6 7 8 9 5 MatA 0 0.1 0 0.5 0.2 0 0 0.3 0 0 7 MatB 9 Element- +∞ wise Add Paralleled in space instead of serialized in time 24
Technique 1: Pipelined Multiply and Merge −∞ 1 3 4 7 +∞ −∞ 3 Index 0 1 2 3 4 5 6 7 8 9 5 MatA 7 MatB 0 0 0 0.6 0 1.3 0 1.2 0 -0.8 9 Element- +∞ wise Add Paralleled in space instead of serialized in time 25
Technique 1: Pipelined Multiply and Merge −∞ 1 3 4 7 +∞ −∞ < < < < 3 ≥ ≥ ≥ < < < Index 0 1 2 3 4 5 6 7 8 9 5 ≥ ≥ ≥ ≥ < < MatA 0 0.1 0 0.5 0.2 0 0 0.3 0 0 7 ≥ ≥ ≥ ≥ ≥ < MatB 0 0 0 0.6 0 1.3 0 1.2 0 -0.8 9 ≥ ≥ ≥ ≥ ≥ < Element- +∞ ≥ ≥ ≥ ≥ wise Add 1 1 1 3 3 0 3 3 3 4 4 4 5 5 5 0 7 7 7 7 7 9 9 9 Add values of same indices 1 3 4 5 7 9 0.1 1.1 0.2 1.3 1.5 -0.8 26
Technique 1: Pipelined Multiply and Merge −∞ 1 3 4 7 +∞ −∞ < < < < 3 ≥ ≥ ≥ < < < Index 0 1 2 3 4 5 6 7 8 9 5 ≥ ≥ ≥ ≥ < < MatA 0 0.1 0 0.5 0.2 0 0 0.3 0 0 7 ≥ ≥ ≥ ≥ ≥ < MatB 0 0 0 0.6 0 1.3 0 1.2 0 -0.8 9 ≥ ≥ ≥ ≥ ≥ < Element- 0 0.1 0 1.1 0.2 1.3 0 1.5 0 -0.8 +∞ ≥ ≥ ≥ ≥ wise Add 1 0 3 4 5 0 7 9 Add values of same indices Get the results in one clock cycle 1 3 4 5 7 9 0.1 1.1 0.2 1.3 1.5 -0.8 27
Technique 1: Pipelined Multiply and Merge −∞ 1 3 4 7 +∞ 11, 12 DRAM −∞ < < < < 13 14 3 ≥ ≥ ≥ < < < Merger (Comparator Array) 5 ≥ ≥ ≥ ≥ < < 22 24 15 16 7 ≥ ≥ ≥ ≥ ≥ < Merger (Comparator Array) 9 ≥ ≥ ≥ ≥ ≥ < 26 31 28 42 21 23 17 19 +∞ ≥ ≥ ≥ ≥ A B C D 1 0 3 4 5 0 7 9 Multiplier Array Merge Tree Add values of same indices A: (24)(26)(31)(52)(54)(56)(57)(58)(73)(75) 1 3 4 5 7 9 B: (22)(28)(42)(44)(46)(47)(48) C: (11)(13)(15)(21)(23)(25)(41)(43)(45) 0.1 1.1 0.2 1.3 1.5 -0.8 D: (12)(14)(16)(17)(18)(32)(34)(36)(37)(38)(72) 28
Technique 1: Pipelined Multiply and Merge Ideally, partial matrices will not be stored on DRAM. Multiplier Merger DRAM DRAM Array 29
Technique 1: Pipelined Multiply and Merge Too Many rounds However…. Multiplier Merger DRAM DRAM Array Up to 𝟐𝟏 𝟖 Limited to 64-way partial merging matrices 30
Technique 1: Pipelined Multiply and Merge Distribution of DRAM access # Load and Store # Load and Store Partial Matrix Size (#Non-zeros) Partial Matrix Size (#Non-zeros) After pipeline OuterSPACE baseline Breakdown of Memory Access 0.0 KB 1.0 GB 2.0 GB 3.0 GB 4.0 GB 5.0 GB 6.0 GB 7.0 GB 8.0 GB Baseline (OuterSPACE) Pipelined Multiply and Merge Read Left Matrix Read Right Matrix R/W Intermediate Results Write Final Result 31
Technique 2: Matrix Condensing 32
Technique 2: Matrix Condensing Right Matrix 𝐶 (CSR) Condensed Matrix 𝐵′ (CSR) 33
Technique 2: Matrix Condensing Before Condensing: After Condensing: Up to 𝟐𝟏 𝟖 partial matrices Only 𝟐𝟏~𝟐𝟏 𝟒 partial matrices 34
Technique 2: Matrix Condensing Fewer rounds Multiplier DRAM Merger DRAM DRAM Array 35
Technique 2: Matrix Condensing Distribution of DRAM access # Load and Store # Load and Store Partial Matrix Size (#Non-zeros) Partial Matrix Size (#Non-zeros) After Matrix Condensing After pipeline Breakdown of Memory Access 0.0 KB 1.0 GB 2.0 GB 3.0 GB 4.0 GB 5.0 GB 6.0 GB 7.0 GB 8.0 GB Baseline (OuterSPACE) Pipelined Multiply and Merge 5x less Matrix Condensing Read Left Matrix Read Right Matrix R/W Intermediate Results Write Final Result 36
Recommend
More recommend