high performance and memory saving sparse general matrix
play

High-performance and Memory-saving Sparse General Matrix-Matrix - PowerPoint PPT Presentation

High-performance and Memory-saving Sparse General Matrix-Matrix Multiplication for Pascal GPU Yusuke Nagasaka, Akira Nukada, Satoshi Matsuoka Tokyo Institute of Technology Sparse General Matrix-Matrix Multiplication (SpGEMM) Numerical


  1. High-performance and Memory-saving Sparse General Matrix-Matrix Multiplication for Pascal GPU Yusuke Nagasaka, Akira Nukada, Satoshi Matsuoka Tokyo Institute of Technology

  2. Sparse General Matrix-Matrix Multiplication (SpGEMM) ■ Numerical application, graph processing – AMG method, graph clustering ■ Low performance – Non-zero pattern of output matrix is unknown before execution ■ Accumulate intermediate products into one non-zero element ■ Hard to manage memory allocation ae+ ae+ a b 0 2 0 e 1 0 bi 1 2 0 a b e bi bh bh g 1 c 0 2 f 0 1 ce 1 2 c f g ce i 3 df 1 d 1 3 h 2 3 0 3 d h i df dg dg 4 5 5 value column �������� Row pointer Sparse matrix in CSR format 1

  3. Accumulation of intermediate products Sparse Accumulator (SPA) [Gilbert, SIAM1992] a b h i ah+ ai+ bk bl c j cj d k l dk dl e f g m eh fj ei gm Input Matrices Output Matrices 0 0 0 0 value ah ai 0 2 a b h i 0 2 1 c j 3 bit flag 0 0 0 0 1 1 2 d k l 0 2 e f g 0 1 3 m 1 index 0 2 Value Column id SPA Input matrices in sparse format 2

  4. Accumulation of intermediate products Sparse Accumulator (SPA) [Gilbert, SIAM1992] a b h i ah+ ai+ bk bl c j cj d k l dk dl e f g m eh fj ei gm Input Matrices Output Matrices ah ai 0 0 0 0 value + + ah ai 0 2 a b h i 0 2 bk bl 1 c j 3 bit flag 0 0 0 0 1 1 2 d k l 0 2 e f g 0 1 3 m 1 index 0 2 Value Column id SPA Input matrices in sparse format 3

  5. Accumulation of intermediate products Sparse Accumulator (SPA) [Gilbert, SIAM1992] a b h i ah+ ai+ ah ai bk bl value + + c j cj bk bl d k l dk dl index e f g m eh fj ei gm 0 2 0 th row of Output Input Matrices Output Matrices ah ai 0 0 0 0 value + + ah ai 0 2 a b h i 0 2 bk bl 1 c j 3 bit flag 0 0 0 0 1 1 2 d k l 0 2 e f g 0 1 3 m 1 index 0 2 Value Column id SPA Input matrices in sparse format 4

  6. Accumulation of intermediate products Sparse Accumulator (SPA) [Gilbert, SIAM1992] a b h i ah+ ai+ ah ai bk bl value J Efficient accumulation of intermediate + + c j cj bk bl products: Lookup cost is O(1) d k l dk dl index e f g m eh fj ei gm 0 2 L Require O(#columns) memory by one thread 0 th row of Output Input Matrices Output Matrices ah ai 0 0 0 0 value + + ah ai 0 2 a b h i 0 2 bk bl 1 c j 3 bit flag 0 0 0 0 1 1 2 d k l 0 2 e f g 0 1 3 m 1 index 0 2 Value Column id SPA Input matrices in sparse format 5

  7. Memory Allocation of Output Matrix ■ Non-zero pattern of output is unknown before execution – Cannot allocate exact memory space for output before execution ■ Two ways for allocation of output – 1-phase ■ Allocate enough large memory space for output – 2-phase ■ Count #non-zero of output, then allocate memory for output Computation c cost Memory u usage Li Libraries 1-phase Low Large CUSP, BHSPARSE 2-phase High Small cuSPARSE 6

  8. SpGEMM on GPU ■ Massive parallelism – Simple row/column-based parallelization causes load- imbalance ■ Largely different computation cost by row/column ■ Difficulty of memory management – Small global memory ■ Up to 16GB (P100 GPU) – Hierarchical memory ■ Shared memory (fast, but only 64KB/SM on P100) 7

  9. Contribution ■ We propose GPU-optimized fast SpGEMM algorithm with low memory usage – Efficiently manage column index of output matrix and accumulate intermediate products by hash table ■ Utilize GPU’s shared memory for hash table – Make row groups by the number of non-zero elements or intermediate products to improve load balance – Evaluate the performance of SpGEMM for the Sparse Matrix Collection from University Florida ■ Up to x4.3 in single precision, x4.4 in double precision ■ Memory usage is reduced by – 14.7% in single precision – 10.9% in double precision 8

  10. Related work (1) ■ ESC Algorithm [Bell, SIAM2012] – E xpansion: Generate the list of all intermediate products – S orting by column and row indices – C ontraction: Accumulate intermediate products – Each part can be executed with high parallelism ■ Whole performance is low since ESC requires large memory access, and also large memory space ■ BHSPARSE [Liu, IPDPS2014] – For irregular matrices – Binning by the number of intermediate products per row ■ Switch the algorithms of accumulation by bin – Heap method, bitonic ESC method, mergepath ■ Better load-balance 9

  11. Related work (2) ■ Balanced Hash [Anh, ICS’16] – Improve load balance ■ Worklist: pair of indices for computation of intermediate products – Worklist is stored on global memory – Improve the process of accumulation ■ Use hash table – Fixed size of hash table on shared memory ■ Waste shared memory when the number of non-zero is small – When hash collision occurs, the products are added to queue ■ Store the calculated elements in the table to memory, refresh table, and then process the products in queue ■ Repeat until queue becomes empty ■ Additional memory usage and memory access to queue 10

  12. Proposed Algorithm Key Points ■ Two-phase execution (1) Count #intermediate products – (1 - 4): Count #non-zero (2) Divide the rows into groups by elements of output matrix #intermediate products – (6 - 7): Calculate output matrix (3) Count #non-zero elements – Minimize the usage of memory (4) Set row pointers of output matrix (5) Memory allocation of output matrix (6) Divide the rows into groups by #non-zero elements (7) Compute the output matrix a. Calculate values and column indices on hash table b. Shrink the hash table c. Store to the memory with sorting 11

  13. Proposed Algorithm Key Points ■ Utilize hash table for accumulator – Allocated on fast shared memory ■ Divide the rows into groups by #intermediate products or #non-zero elements – Improve load balance by appropriate thread assignment – Better utilization of shared memory by coordinating hash table size 12

  14. Proposed Algorithm Count #intermediate products / Grouping ■ Rows are divided into several (1) Count #intermediate products groups by #intermediate (2) Divide the rows into groups by products or non-zero elements #intermediate products – Improve the load-balance (3) Count #non-zero elements – Utilize shared memory (4) Set row pointers of output matrix – #intermediate products is upper (5) Memory allocation of output matrix bound of #non-zero elements ■ Counting cost of #intermediate product (6) Divide the rows into groups by is relatively small #non-zero elements (7) Compute the output matrix a. Calculate values and column indices on hash table b. Shrink the hash table c. Store to the memory with sorting 13

  15. Proposed Algorithm Count #Non-zero Elements / Compute the output ■ Two-way thread assignment and (1) Count #intermediate products memory access to input matrices for load-balance (2) Divide the rows into groups by #intermediate products – Appropriate thread assignment for both dense row and sparse row (3) Count #non-zero elements ■ Column indices of output matrix (4) Set row pointers of output matrix are managed by hash table (5) Memory allocation of output matrix – Tables are on shared memory (6) Divide the rows into groups by ■ CUDA kernel for each group #non-zero elements – In order to execute concurrently, (7) Compute the output matrix each kernel is assigned to different a. Calculate values and column CUDA stream indices on hash table b. Shrink the hash table c. Store to the memory with sorting 14

  16. Proposed Algorithm Two-ways thread Assignment -1- ■ PW PWARP/ P/ROW : Partial warp / row – Partial warp means a bundle of 4 threads – 1 pwarp for each row of matrix A, and 1 thread for each non- zero element of A and corresponding row of B – Selected for the groups with sparser rows T T T T 0 1 0 0 a b h i PWARP T T c j 1 1 d k l e f g m 15

  17. Proposed Algorithm Two-ways thread Assignment -2- ■ TB TB/ROW : Thread block / row – Assign 1 thread block (TB) for each row of matrix A, 1 warp for each non-zero element of A, and 1thread for each non-zero element of B – Selected for the groups with denser rows W W A A R R T T P P 1 0 a b h i TB T T c j 1 0 d k l e f g m 16

  18. Proposed Algorithm Hash Table ■ Key is column index of B – if empty, add the element ■ compare-and-swap ■ Each thread counts the number of non-zero elements – Linear probing ■ When the hash is collided, the algorithm tries next entry hash(1)=0 a b e c f g 1 2 d h i hash(1)=0 hash(2)=0 Hash table for 0th row 17

  19. Proposed Algorithm Count #non-zero elements ■ Accumulate the number of non- (1) Count #intermediate products zero counted by each row (2) Divide the rows into groups by – PWARP/ROW: Utilizing warp shuffle #intermediate products – TB/ROW: Accumulate by warp (3) Count #non-zero elements shuffle in warp level, and then (4) Set row pointers of output matrix accumulate the sum of each warp by using shared memory (5) Memory allocation of output matrix (6) Divide the rows into groups by #non-zero elements (7) Compute the output matrix a. Calculate values and column indices on hash table b. Shrink the hash table c. Store to the memory with sorting 18

Recommend


More recommend