swSpTRSV: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang , Weifeng Liu, Wei Xue, Li Wu 2 1 3
Ou Outline ine 1. Background 2. Sunway architecture 3. Sparse Level Tile layout 4. Producer-Consumer Pairing method 5. Experiment 6. Conclusion clarencewxl@gmail.com 2
Sp Spars rse e Tr Triangular ular So Solve Example: Lx = b Compute a solution vector x from the sparse linear system, where L is a square lower triangular sparse matrix, and b is the right-hand vector. 1* x 0 = a 1* x 1 = b system: 2* x 1 +1* x 2 = c 3* x 0 +1* x 3 = d x 0 = a x 1 = b solution: x 2 = c - 2 b x 3 = d - 3 a clarencewxl@gmail.com 3
Sp Spars rse e Tr Triangular ular So Solve Example: Lx = b Compute a solution vector x from the sparse linear system, where L is a square lower triangular sparse matrix, and b is the right-hand vector. 1* x 0 = a 0 0 0 1 x 0 a 1* x 1 = b system: 0 0 0 1 x 1 b 2* x 1 +1* x 2 = c x = 3* x 0 +1* x 3 = d 0 0 0 0 2 1 x 2 c 0 0 0 0 3 1 x 3 d L x b x 0 = a (4x4) (4x1) (4x1) x 1 = b solution: nnzL = 6 dense dense x 2 = c - 2 b known unknown known x 3 = d - 3 a clarencewxl@gmail.com 4
Sp Spars rse e Tr Triangular ular So Solve Example: Lx = b Compute a solution vector x from the sparse linear system, where L is a square lower triangular sparse matrix, and b is the right-hand vector. 1* x 0 = a 0 0 0 1 x 0 a 1* x 1 = b system: 0 0 0 1 x 1 b 2* x 1 +1* x 2 = c Use case: x = 3* x 0 +1* x 3 = d 0 0 0 0 2 1 x 2 c In direct methods for solving a sparse linear system Ax=b, A can be first decomposed to LU, then be solved by LUx=b. 0 0 0 0 3 1 x 3 d This is done by calling two sparse triangular solves Ly=b and L x b x 0 = a Ux=y. (4x4) (4x1) (4x1) x 1 = b In iterative solvers, incomplete LU preconditioner uses sparse solution: nnzL = 6 dense dense x 2 = c - 2 b triangular solves in a similar way. known unknown known x 3 = d - 3 a clarencewxl@gmail.com 5
Sing ngle le cor ore: e: Sequen quentia tial l meth ethod od 0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: A sequential method based 13: 14: on CSC layout 15: 𝑀𝒚 = 𝒄 clarencewxl@gmail.com 6
A fe few w cor ores: es: Leve vel-se set t me metho thod 0: Thread 0 1: 2: Thread 1 3: 4: 5: Thread 2 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: Parallel in each level and sequential inter level clarencewxl@gmail.com 7
A fe few w cor ores: es: Leve vel-se set t me metho thod 0: Thread 0 1: Level 0 2: Thread 1 3: 4: 5: Thread 2 Level 1 6: 7: 8: 9: Level 2 10: 11: 12: Level 3 13: 14: 15: Parallel in each level and sequential inter level clarencewxl@gmail.com 8
A fe few w cor ores: es: Leve vel-se set t me metho thod 0: Thread 0 1: Level 0 2: Thread 1 3: 4: Thread 2 5: Level 1 6: 7: 8: 9: Level 2 10: 11: 12: Level 3 13: 14: 15: Parallel in each level and sequential inter level clarencewxl@gmail.com 9
A fe few w cor ores: es: Leve vel-se set t me metho thod 0: Thread 0 1: Level 0 2: Thread 1 3: 4: 5: Thread 2 Level 1 6: 7: 8: 9: Level 2 10: 11: 12: Level 3 13: 14: 15: Parallel in each level and sequential inter level clarencewxl@gmail.com 10
A fe few w cor ores: es: Leve vel-se set t me metho thod 0: Thread 0 1: Level 0 2: Thread 1 3: 4: 5: Thread 2 Level 1 6: 7: 8: 9: Level 2 10: 11: 12: Level 3 13: 14: 15: Parallel in each level and sequential inter level clarencewxl@gmail.com 11
A fe few w cor ores: es: Leve vel-se set t me metho thod 0: Thread 0 1: Level 0 2: Thread 1 3: 4: 5: Thread 2 Level 1 6: 7: 8: 9: Level 2 10: 11: 12: Level 3 13: 14: 15: Parallel in each level and sequential inter level clarencewxl@gmail.com 12
Mor ore e cor ores: es: P2P method thod (CPU PU/MIC /MIC) Level 0 Level 1 Level 2 • No full-synchronization • Only synchronize between Thread 0 and Thread 2 Level 3 Park J, et al. Sparsifying synchronization for high-performance shared-memory sparse triangular solver[C] International Supercomputing Conference. Springer, Cham, 2014: 124-140. clarencewxl@gmail.com 13
Mor ore e cor ores: es: Sync-fre free e me metho thod d (GP GPU) U) Level 0 Level 1 Level 2 • Thread 0 and 2 modify the same value by atomic operations. Level 3 Liu W, et al. A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves[C] European Conference on Parallel Processing. Springer International Publishing, 2016: 617-630. clarencewxl@gmail.com 14
Bac ackg kgrou round nd Problem Architecture 0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: Sparse Triangular Solve Sunway Processor clarencewxl@gmail.com 15
Sunwa nway Tai aihu huLig Light ht: Overvi rview ew Entire System Peak Performance 125 PFlops Linpack Performance 93 Pflops / 74.4% Total Memory 1310.72 TB Total Memory Bandwidth 5591.45 TB/s # nodes 40,960 # cores 10,649,600 clarencewxl@gmail.com 16
SW26010 W26010 Pr Processo ocessor Memory Memory iMC iMC Core Group 0 Core Group 1 Memory Level 8*8 CPE Mesh PPU PPU Computing Row Core Communication 8*8 CPE 8*8 CPE Bus MPE MPE Mesh Mesh LDMLevel Registers NoC Data Transfer SPM LDM Network 8*8 CPE 8*8 CPE MPE MPE Register Level Mesh Mesh Transfer Agent (TA) PPU PPU Control Column iMC iMC Core Group 2 Core Group 3 Network Communication Bus Computing Level Memory Memory clarencewxl@gmail.com 17
SW26010 W26010 Pr Processo ocessor Memory Memory iMC iMC Core Group 0 Core Group 1 Memory Level 8*8 CPE Mesh PPU PPU Computing Row Core Communication 8*8 CPE 8*8 CPE Bus MPE MPE Mesh Mesh LDMLevel Registers NoC Data Transfer SPM LDM Network 8*8 CPE 8*8 CPE MPE MPE Register Level Mesh Mesh Transfer Agent (TA) PPU PPU Control Column iMC iMC Core Group 2 Core Group 3 Network Communication Bus Computing Level Memory Memory clarencewxl@gmail.com 18
SW26010 W26010 Pr Processo ocessor D irect M emoy A ccess Memory Memory (DMA) 22.6 GB/s iMC iMC Core Group 0 Core Group 1 Memory Level 8*8 CPE Mesh PPU PPU Computing Row Core Communication 8*8 CPE 8*8 CPE Bus MPE MPE Mesh Mesh LDMLevel Registers NoC Data Transfer SPM LDM Network 8*8 CPE 8*8 CPE MPE MPE Register Level Mesh Mesh Transfer Agent (TA) PPU PPU Control Column iMC iMC Core Group 2 Core Group 3 Network Communication Bus Computing Level Memory Memory clarencewxl@gmail.com 19
SW26010 W26010 Pr Processo ocessor G lobal Load/Store Memory Memory (Gload/Gstore) 1.5 GB/s iMC iMC Core Group 0 Core Group 1 Memory Level 8*8 CPE Mesh PPU PPU Computing Row Core Communication 8*8 CPE 8*8 CPE Bus MPE MPE Mesh Mesh LDMLevel Registers NoC Data Transfer SPM LDM Network 8*8 CPE 8*8 CPE MPE MPE Register Level Mesh Mesh Transfer Agent (TA) PPU PPU Control Column iMC iMC Core Group 2 Core Group 3 Network Communication Bus Computing Level Memory Memory clarencewxl@gmail.com 20
Re Regi gister ster Co Communi mmunica catio tion Get C Get C Get R Get R Put Put Get C Get C Get R Get R Put Put clarencewxl@gmail.com 21
Regi Re gister ster Co Communi mmunica catio tion Get C Get C putr getr Get R Get R Put Put putc getc Get C Get C Get R Get R Put Put clarencewxl@gmail.com 22
Re Regi gister ster Co Communi mmunica catio tion Get C Get C // P2P Test Get R Get R if (id%2 == 0) Put Put while(1) putr(data, id+1); else while(1) getr(&data); Get C Get C Get R Get R Put Put Latency: less than 11 cycles Integrated Bandwidth: 637 GB/s Xu, Zhigeng, James Lin, and Satoshi Matsuoka. "Benchmarking SW26010 Many-Core Processor." Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International . IEEE, 2017. clarencewxl@gmail.com 23
SW26010 W26010 Pr Processo ocessor • Manual cache system (SPM) • Direct memory access (DMA) • Limited register communication clarencewxl@gmail.com 24
Mismatch smatch between ween SpTRSV TRSV an and Sunway nway • Branch code to check whether cache is miss or not; • The cost of the branch is high • Manual cache system • Direct memory access • Register communication • Cost much even cache hit • Hurt the instruction pipeline • Difficult to prefetch clarencewxl@gmail.com 25
Mismatch match between etween SpT pTRSV RSV and Sunwa way Limitation of register communication: only happen in the same column or row • Manual cache system CPE CPE • Direct memory access (0,0) (0,1) • Register communication CPE (1,1) clarencewxl@gmail.com 26
Mismatch match between etween SpT pTRSV RSV and Sunwa way Limitation of register communication: only happen in the Limitation of register communication same column or row • Manual cache system CPE CPE • Direct memory access (0,0) (0,1) • Register communication Cycle Communication cycle + Random CPE CPE communication size ≈ Dead-Lock (1,0) (1,1) Lin H, et al. Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores[C] Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International. IEEE, 2017: 635-645. clarencewxl@gmail.com 27
Recommend
More recommend