sparch efficient architecture for sparse matrix
play

SpArch: Efficient Architecture for Sparse Matrix Multiplication - PowerPoint PPT Presentation

sparch.mit.edu SpArch: Efficient Architecture for Sparse Matrix Multiplication Zhekai Zhang* 1 , Hanrui Wang * 1 , Song Han 1 , William J. Dally 2 1 Massachusetts Institute of Technology 2 Stanford University / Nvidia *Equal Contributions 1


  1. sparch.mit.edu SpArch: Efficient Architecture for Sparse Matrix Multiplication Zhekai Zhang* 1 , Hanrui Wang * 1 , Song Han 1 , William J. Dally 2 1 Massachusetts Institute of Technology 2 Stanford University / Nvidia *Equal Contributions 1

  2. Accelerate Sparse Matrix Multiplication Graph Computing Dimension ≈ 4×10 ! ≈ 8×10 "! Sparsity Compressed Neural Networks Dimension ≈ 10 ## ≈ 10 "$ Sparsity 2

  3. CPUs and GPUs are Slow and Under-Utilized for SpMM MKL on Intel cuSPARSE on CUSP on TITAN Armadillo on Double Precision Core i7-5930K TITAN Xp Xp Arm Cortex-A53 SpMM Average 0.560 0.595 0.631 0.00813 GFLOPS on 20 benchmarks 1,2 Theoretical 289 343 343 5.47 GFLOPS 3,4,5 0.194% 0.173% 0.184% 0.149% Utilization 1 Leskovec, Jure, and Rok Sosi č . "Snap: A general-purpose network analysis and graph-mining library." ACM Transactions on Intelligent Systems and Technology (TIST) 8.1 (2016): 1. 2 Davis, Timothy A., and Yifan Hu. "The University of Florida sparse matrix collection." ACM Transactions on Mathematical Software (TOMS) 38.1 (2011): 1. 3 https://www.pugetsystems.com/labs/hpc/Linpack-performance-Haswell-E-Core-i7-5960X-and-5930K-594/ 4 https://www.techpowerup.com/gpu-specs/titan-x-pascal.c2863 5 http://web.eece.maine.edu/~vweaver/group/machines.html 3

  4. Challenges • Super-large • Limited on-chip memory Dimension ≈ 4×10 ! ≈ 8×10 "! Density Dimension ≈ 10 ## ≈ 10 "$ Density 4

  5. Challenges • Super-large • Limited on-chip memory Dimension ≈ 4×10 ! ≈ 8×10 "! Density • Ultra-sparse • Low operational intensity • Limited memory bandwidth Dimension ≈ 10 ## ≈ 10 "$ Density 5

  6. Background: Outer Product Intermediate Partial Matrix 𝑄 ! = 𝐵 :,! ×𝐶 !,: Right Matrix 𝐶 Result Matrix Multiply Phase 𝐷 = 𝐵𝐶 = ∑ ! 𝑄 ! Merge Phase Left Matrix 𝐵 6

  7. Background: Outer Product Perfect input reuse: read input matrix only once Multiplier DRAM Merger DRAM DRAM Array Multiply Phase Merge Phase 7

  8. Background: Outer Product Multiplier DRAM Merger DRAM DRAM Array Multiply Phase Merge Phase 8

  9. Background: Outer Product Multiplier DRAM Merger DRAM DRAM Array Multiply Phase Merge Phase 9

  10. Background: Outer Product Multiplier DRAM Merger DRAM DRAM Array Multiply Phase Merge Phase 10

  11. Background: Outer Product Multiplier DRAM Merger DRAM DRAM Array Multiply Phase Merge Phase 11

  12. Background: Outer Product Multiplier DRAM Merger DRAM DRAM Array Multiply Phase Merge Phase 12

  13. Background: Outer Product Bad output reuse: intermediate results need storing to DRAM and loading back Multiplier DRAM Merger DRAM DRAM Array Multiply Phase Merge Phase 13

  14. DRAM access of Intermediate Matrix in the baseline implementation Distribution of DRAM access • Baseline Implementation: OuterSPACE. # Load and Store • Row-wise Output Stationary • Each Intermediate Matrix has one- round of Store and Load. Partial Matrix Size (#Non-zeros) Pal, Subhankar, et al. "Outerspace: An outer product based sparse matrix multiplication accelerator." (HPCA) . IEEE, 2018. 14

  15. Key idea: reduce both input and partial matrix DRAM access Algorithm: Outer Product Technique 1: Pipelined Multiply and Merge Technique 2: Matrix Condensing Technique 4: Technique 3: Row Prefetcher Huffman Tree Scheduler 15

  16. Multiply and Merge 0.1 0.5 0.6 0.1 1.1 0.2 0.3 1.3 1.2 0.2 1.3 1.5 + = -0.8 2.2 -0.8 2.2 1.2 1.1 1.1 1.2 Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 MatA 0 0.1 0 0.5 0.2 0 0 0.3 0 0 0 0 0 1.2 0 0 MatB 0 0 0 0.6 0 1.3 0 1.2 0 -0.8 2.2 0 1.1 0 0 0 Element- 0 0.1 0 1.1 0.2 1.3 0 1.5 0 -0.8 2.2 0 1.1 1.2 0 0 wise Add 16

  17. Merge Phase MatA: (1,0.1) (3, 0.5) (4, 0.2) (7, 0.3) (13, 1.2) MatB: (3, 0.6) (5,1.3) (7, 1.2) (9, -0.8) (10, 2.2) (12, 1.1) Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 MatA 0 0.1 0 0.5 0.2 0 0 0.3 0 0 0 0 0 1.2 0 0 MatB 0 0 0 0.6 0 1.3 0 1.2 0 -0.8 2.2 0 1.1 0 0 0 Element- 0 0.1 0 1.1 0.2 1.3 0 1.5 0 -0.8 2.2 0 1.1 1.2 0 0 wise Add 17

  18. Merge Phase MatA: (1,0.1) (3, 0.5) (4, 0.2) (7, 0.3) (13, 1.2) MatB: (3, 0.6) (5,1.3) (7, 1.2) (9, -0.8) (10, 2.2) (12, 1.1) ptr0 [1, 3, 4, 7…] [3, 5, 7, 9…] ptr1 18

  19. Merge Phase MatA: (1,0.1) (3, 0.5) (4, 0.2) (7, 0.3) (13, 1.2) MatB: (3, 0.6) (5,1.3) (7, 1.2) (9, -0.8) (10, 2.2) (12, 1.1) ptr0 Merge results [1, 3, 4, 7…] [1] [3, 5, 7, 9…] ptr1 19

  20. Merge Phase MatA: (1,0.1) (3, 0.5) (4, 0.2) (7, 0.3) (13, 1.2) MatB: (3, 0.6) (5,1.3) (7, 1.2) (9, -0.8) (10, 2.2) (12, 1.1) ptr0 Merge results [1, 3, 4, 7…] [1,3,3] [3, 5, 7, 9…] ptr1 20

  21. Merge Phase MatA: (1,0.1) (3, 0.5) (4, 0.2) (7, 0.3) (13, 1.2) MatB: (3, 0.6) (5,1.3) (7, 1.2) (9, -0.8) (10, 2.2) (12, 1.1) ptr0 Merge results [1, 3, 4, 7…] [1,3,3,4] [3, 5, 7, 9…] ptr1 21

  22. Merge Phase MatA: (1,0.1) (3, 0.5) (4, 0.2) (7, 0.3) (13, 1.2) MatB: (3, 0.6) (5,1.3) (7, 1.2) (9, -0.8) (10, 2.2) (12, 1.1) ptr0 Merge results [1, 3, 4, 7…] [1,3,3,4,5] [3, 5, 7, 9…] ptr1 22

  23. Merge Phase MatA: (1,0.1) (3, 0.5) (4, 0.2) (7, 0.3) (13, 1.2) MatB: (3, 0.6) (5,1.3) (7, 1.2) (9, -0.8) (10, 2.2) (12, 1.1) ptr0 Merge results [1, 3, 4, 7…] [1,3,3,4,5,7,7] [3, 5, 7, 9…] ptr1 23

  24. Technique 1: Pipelined Multiply and Merge −∞ 1 3 4 7 +∞ −∞ 3 Index 0 1 2 3 4 5 6 7 8 9 5 MatA 0 0.1 0 0.5 0.2 0 0 0.3 0 0 7 MatB 9 Element- +∞ wise Add Paralleled in space instead of serialized in time 24

  25. Technique 1: Pipelined Multiply and Merge −∞ 1 3 4 7 +∞ −∞ 3 Index 0 1 2 3 4 5 6 7 8 9 5 MatA 7 MatB 0 0 0 0.6 0 1.3 0 1.2 0 -0.8 9 Element- +∞ wise Add Paralleled in space instead of serialized in time 25

  26. Technique 1: Pipelined Multiply and Merge −∞ 1 3 4 7 +∞ −∞ < < < < 3 ≥ ≥ ≥ < < < Index 0 1 2 3 4 5 6 7 8 9 5 ≥ ≥ ≥ ≥ < < MatA 0 0.1 0 0.5 0.2 0 0 0.3 0 0 7 ≥ ≥ ≥ ≥ ≥ < MatB 0 0 0 0.6 0 1.3 0 1.2 0 -0.8 9 ≥ ≥ ≥ ≥ ≥ < Element- +∞ ≥ ≥ ≥ ≥ wise Add 1 1 1 3 3 0 3 3 3 4 4 4 5 5 5 0 7 7 7 7 7 9 9 9 Add values of same indices 1 3 4 5 7 9 0.1 1.1 0.2 1.3 1.5 -0.8 26

  27. Technique 1: Pipelined Multiply and Merge −∞ 1 3 4 7 +∞ −∞ < < < < 3 ≥ ≥ ≥ < < < Index 0 1 2 3 4 5 6 7 8 9 5 ≥ ≥ ≥ ≥ < < MatA 0 0.1 0 0.5 0.2 0 0 0.3 0 0 7 ≥ ≥ ≥ ≥ ≥ < MatB 0 0 0 0.6 0 1.3 0 1.2 0 -0.8 9 ≥ ≥ ≥ ≥ ≥ < Element- 0 0.1 0 1.1 0.2 1.3 0 1.5 0 -0.8 +∞ ≥ ≥ ≥ ≥ wise Add 1 0 3 4 5 0 7 9 Add values of same indices Get the results in one clock cycle 1 3 4 5 7 9 0.1 1.1 0.2 1.3 1.5 -0.8 27

  28. Technique 1: Pipelined Multiply and Merge −∞ 1 3 4 7 +∞ 11, 12 DRAM −∞ < < < < 13 14 3 ≥ ≥ ≥ < < < Merger (Comparator Array) 5 ≥ ≥ ≥ ≥ < < 22 24 15 16 7 ≥ ≥ ≥ ≥ ≥ < Merger (Comparator Array) 9 ≥ ≥ ≥ ≥ ≥ < 26 31 28 42 21 23 17 19 +∞ ≥ ≥ ≥ ≥ A B C D 1 0 3 4 5 0 7 9 Multiplier Array Merge Tree Add values of same indices A: (24)(26)(31)(52)(54)(56)(57)(58)(73)(75) 1 3 4 5 7 9 B: (22)(28)(42)(44)(46)(47)(48) C: (11)(13)(15)(21)(23)(25)(41)(43)(45) 0.1 1.1 0.2 1.3 1.5 -0.8 D: (12)(14)(16)(17)(18)(32)(34)(36)(37)(38)(72) 28

  29. Technique 1: Pipelined Multiply and Merge Ideally, partial matrices will not be stored on DRAM. Multiplier Merger DRAM DRAM Array 29

  30. Technique 1: Pipelined Multiply and Merge Too Many rounds However…. Multiplier Merger DRAM DRAM Array Up to 𝟐𝟏 𝟖 Limited to 64-way partial merging matrices 30

  31. Technique 1: Pipelined Multiply and Merge Distribution of DRAM access # Load and Store # Load and Store Partial Matrix Size (#Non-zeros) Partial Matrix Size (#Non-zeros) After pipeline OuterSPACE baseline Breakdown of Memory Access 0.0 KB 1.0 GB 2.0 GB 3.0 GB 4.0 GB 5.0 GB 6.0 GB 7.0 GB 8.0 GB Baseline (OuterSPACE) Pipelined Multiply and Merge Read Left Matrix Read Right Matrix R/W Intermediate Results Write Final Result 31

  32. Technique 2: Matrix Condensing 32

  33. Technique 2: Matrix Condensing Right Matrix 𝐶 (CSR) Condensed Matrix 𝐵′ (CSR) 33

  34. Technique 2: Matrix Condensing Before Condensing: After Condensing: Up to 𝟐𝟏 𝟖 partial matrices Only 𝟐𝟏~𝟐𝟏 𝟒 partial matrices 34

  35. Technique 2: Matrix Condensing Fewer rounds Multiplier DRAM Merger DRAM DRAM Array 35

  36. Technique 2: Matrix Condensing Distribution of DRAM access # Load and Store # Load and Store Partial Matrix Size (#Non-zeros) Partial Matrix Size (#Non-zeros) After Matrix Condensing After pipeline Breakdown of Memory Access 0.0 KB 1.0 GB 2.0 GB 3.0 GB 4.0 GB 5.0 GB 6.0 GB 7.0 GB 8.0 GB Baseline (OuterSPACE) Pipelined Multiply and Merge 5x less Matrix Condensing Read Left Matrix Read Right Matrix R/W Intermediate Results Write Final Result 36

Recommend


More recommend