high performance sparse matrix matrix products on intel
play

High-Performance Sparse Matrix-Matrix Products on Intel KNL and - PowerPoint PPT Presentation

High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures Yusuke Nagasaka , Satoshi Matsuoka Ariful Azad , Aydn Bulu Tokyo Institute of Technology Riken Center for Computational


  1. High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures Yusuke Nagasaka † , Satoshi Matsuoka � † Ariful Azad ‡ , Aydın Buluç ‡ † Tokyo Institute of Technology � Riken Center for Computational Science ‡ Lawrence Berkeley National Laboratory

  2. Sparse G General M Matrix-Matrix M Multiplication (S (SpGEMM) ■ Key kernel in graph processing and numerical applications – Markov clustering, Betweenness centrality, triangle counting, ... – Preconditioner for linear solver ■ AMG (Algebraic Multigrid) method – Time-consuming part ah+ ai+b a b h i bk l j c cj d k l dl dk e f g m eh fj gm ei Output Matrices Input Matrices 1

  3. Accumulation o of i f intermediate p products Sparse A Accumulator ( (SPA) [ [Gilbert, S t, SIA IAM1992] a b h i ah+ ai+ bk bl c j cj d k l dk dl e f g m eh fj ei gm Input Matrices Output Matrices 0 0 0 0 value ah ai 0 2 a b h i 0 2 1 c j 3 bit flag 0 0 0 0 1 1 2 d k l 0 2 e f g 0 1 3 m 1 index 0 2 Value Column id SPA Input matrices in sparse format 2

  4. Accumulation o of i f intermediate p products Sparse A Accumulator ( (SPA) [ [Gilbert, S t, SIA IAM1992] a b h i ah+ ai+ bk bl c j cj d k l dk dl e f g m eh fj ei gm Input Matrices Output Matrices ai ah 0 0 0 0 value + ah + ai 0 2 a b h i 0 2 bl bk 1 c j 3 bit flag 0 0 0 0 1 1 2 d k l 0 2 e f g 0 1 3 m 1 index 0 2 Value Column id SPA Input matrices in sparse format 3

  5. Accumulation o of i f intermediate p products Sparse A Accumulator ( (SPA) [ [Gilbert, S t, SIA IAM1992] a b h i ah+ ai+ ah ai bk bl value + + c j cj bk bl d k l dk dl index e f g m eh fj ei gm 0 2 0 th row of Output Input Matrices Output Matrices ah ai 0 0 0 0 value + + ah ai 0 2 a b h i 0 2 bk bl 1 c j 3 bit flag 0 0 0 0 1 1 2 d k l 0 2 e f g 0 1 3 m 1 index 0 2 Value Column id SPA Input matrices in sparse format 4

  6. Accumulation o of i f intermediate p products Sparse A Accumulator ( (SPA) [ [Gilbert, S t, SIA IAM1992] a b h i ah+ ai+ ah ai bk bl value J Efficient a accumulation o of i interm rmediate + + c j cj bk bl pr produ ducts: Lookup c d k l cost i is O O(1 (1) dk dl index e f g m eh fj ei gm 0 2 L Requires O(#columns) memory by one thread 0 th row of Output Input Matrices Output Matrices ah ai 0 0 0 0 value + + ah ai 0 2 a b h i 0 2 bk bl 1 c j 3 bit flag 0 0 0 0 1 1 2 d k l 0 2 e f g 0 1 3 m 1 index 0 2 Value Column id SPA Input matrices in sparse format 5

  7. Existing a approaches f for S SpGEMM ■ Several sequential and parallel SpGEMM algorithms – Also packaged in software/libraries Algorithm ( (Library) Ac Accumulator So Sortedness (In Input/Output) MKL - Any/Select MKL-inspector - Any/Unsorted KokkosKernels HashMap Any/Unsorted Heap Heap Sorted/Sorte Hash Hash Table Any/Select 6

  8. Existing a approaches f for S SpGEMM ■ Several sequential and parallel SpGEMM algorithms – Also packaged in software/libraries Questi Qu tions? (a) What is the best algorithm/implementation for a a Algorithm ( (Library) Accumulator Ac So Sotedness (In Input/Output) blem at hand? pr probl MKL - Any/Select MKL-inspector - Any/Unsorted (b) What is the best algorithm/implementation for t the KokkosKernels HashMap Any/Unsorted ture to be used in solving the problem? ar architectu Heap Heap Sorted/Sorte Hash Hash Table Any/Select 7

  9. Cont Contrib ibut ution ion ■ We characterize, optimize and evaluate existing SpGEMM algorithms for real-world applications on modern Multi-core and Many-core architectures – Characterizing the performance of SpGEMM on shared- memory platforms ■ Intel Haswell and Intel KNL architectures ■ Identify b bottlenecks a and m mitigate t them – Evaluation including several use cases ■ A 2 , Square x Tall-skinny, L*U for triangle counting – Showing the impact o of k keeping u unsorted o output – A r recipe f for s selecting t the b best-performing a algorithm f for a s specific a application s scenario 8

  10. Benchmark f for S SpGEMM Thread s scheduling c cost ■ Evaluates the scheduling cost on Haswell and KNL architectures – OpenMP: static, dynamic and guided ■ Scheduling cost hurts the SpGEMM performance 9

  11. Benchmark f for S SpGEMM Memory a allocati tion/deallocati tion c cost ■ Identifies that allocation/deallocation of large memory space is expensive ■ Parallel memory allocation scheme – Each thread independently allocates/deallocates memory and accesses only its own memory space – For S SpGEMM, w we c can r reduce d deallocation c cost Parallel memory allocation Deallocation cost 10 10

  12. Benchmark f for S SpGEMM Im Impact o t of M MCDRAM ■ MCDRAM provides high memory bandwidth – Obviously improves s stream b benchmark – Performance of stanza-like memory access is unclear ■ Small blocks of consecutive elements ■ Access to rows of B in SpGEMM Hard to get the benefits of MCDRAM on very sparse matrices in SpGEMM 11 11

  13. Architecture S Specific O Optimization Thread s scheduling ■ Good l load-balance w with s static s scheduling – Assigning work to threads by FLOP – Work assignment can be efficiently executed in parallel ■ Counting required FLOP of each row ■ PrefixSum to get total FLOP of SpGEMM ■ Assigning rows to thread (Eg. shows the case of 3 threads) – Average FLOP = 11/3 0 a b h i 4 4 4 c j 1 1 5 k l 2 2 d 7 e f m g 4 4 11 Input Matrices FLOP PrefixSum 12 12

  14. Architecture S Specific O Optimization Accumulator f for S Symbolic a and N Numeric P Phases ■ Optimizing algorithms for Intel architectures ■ Heap [Azad, 2016] – Priority queue indexed by column indices – Requires logarithmic time to extract elements – Space e efficient : O(nnz(a i* )) ■ Better cache utilization ■ Hash [Nagasaka, 2016] – Uses hash table for accumulator, based on GPU work ■ Low m memory u usage a and h high p performance – Each thread once allocates the hash table and reuses it – Extended t to Ha HashV hVector to e exploit w wide v vector r register 13 13

  15. Architecture S Specific O Optimization Ha HashV shVector ■ Utilizing 256 and 512-bit wide vector register of Intel architectures for hash probing – Reduces t the n number o of p probing c caused b by h hash c collision – Requires a few more instructions for each check ■ Degrades the performance when the collisions in Hash are rare (a) Hash 1) Check the entry 2) If hash is collided, check next entry 3) If the entry is empty, add the element 2) If the element is not found and the (b) HashVector row has empty entry, add the element 1) Check multiple entries with vector register : element to be added : non-empty entry : empty entry 14 14

  16. Performance Evaluation 15 15

  17. Matrix D Data ■ Synthetic matrix – R-MAT, the recursive matrix generator – Two different non-zero patterns of synthetic matrices ■ ER ER : Erd ő s–Rényi random graphs ■ G500 G500 : Graphs with power-law degree distributions – Used for Graph500 benchmark – Scale n matrix: 2 n -by-2 n – Edge f factor : the average number of non-zero elements per row of the matrix ■ SuiteSparse Matrix Collection – 26 sparse matrices used in several past work 16 16

  18. Evaluation E Envi vironment ■ Cori s sys ystem @ @NERSC – Haswell Cluster ■ Intel Xeon Processor E5-2698 v3 ■ 128GB DDR4 memory – KNL Cluster ■ Intel Xeon Phi Processor 7250 – 68 cores – 32KB/core L1 cache, 1MB/tile L2 cache – 16GB MCDRAM – Quadrant, cache ■ 96GB DDR4 memory – OS: SuSE Linux Enterprise Server 12 SP3 – Intel C++ Compiler (icpc) ver18.0.0 ■ -g –O3 -qopenmp 17 17

  19. Benefit o of P f Performance O Optimization Scheduling a and m memory a allocati tion ■ Good l load b balance with static scheduling ■ For larger matrices, parallel memory allocation scheme keeps high performance A^2 of G500 matrices with edge factor=16 18 18

  20. Benefit o of P f Performance O Optimization Use o of M MCDRAM ■ Benefit o of M MCDRAM e especially o y on d denser m matrices 19 19

  21. Performance E Evaluation A^2 ^2: S : Scaling w with th d density ty ( (KNL, E , ER) ■ Scale = 16 ■ Different performance trends – Performance of MKL degrades with increasing density 20 20

  22. Performance E Evaluation A^2 ^2: S : Scaling w with th d density ty ( (KNL, E , ER) ■ Performance g gain w with k keeping o output u unsorted 21 21

  23. Performance E Evaluation A^2 ^2: S : Scaling w with th d density ty ( (KNL, G , G500) ■ Denser inputs do not simply bring performance gain – Different f from E ER m matrices 22 22

  24. Performance E Evaluation A^2 ^2: S : Scaling w with th d density ty ( (Haswell) ■ HashVector achieves much higher performance 23 23

  25. Performance E Evaluation A^2 ^2: S : Scaling w with th i input s t size ( (KNL, E , ER) ■ Edge factor = 16 ■ Hash and HashVector show good performance in any input size 24 24

Recommend


More recommend