Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang , Weifeng Liu, - PowerPoint PPT Presentation

swSpTRSV: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang , Weifeng Liu, Wei Xue, Li Wu 2 1 3

Ou Outline ine 1. Background 2. Sunway architecture 3. Sparse Level Tile layout 4. Producer-Consumer Pairing method 5. Experiment 6. Conclusion clarencewxl@gmail.com 2

Sp Spars rse e Tr Triangular ular So Solve Example: Lx = b Compute a solution vector x from the sparse linear system, where L is a square lower triangular sparse matrix, and b is the right-hand vector. 1* x 0 = a 1* x 1 = b system: 2* x 1 +1* x 2 = c 3* x 0 +1* x 3 = d x 0 = a x 1 = b solution: x 2 = c - 2 b x 3 = d - 3 a clarencewxl@gmail.com 3

Sp Spars rse e Tr Triangular ular So Solve Example: Lx = b Compute a solution vector x from the sparse linear system, where L is a square lower triangular sparse matrix, and b is the right-hand vector. 1* x 0 = a 0 0 0 1 x 0 a 1* x 1 = b system: 0 0 0 1 x 1 b 2* x 1 +1* x 2 = c x = 3* x 0 +1* x 3 = d 0 0 0 0 2 1 x 2 c 0 0 0 0 3 1 x 3 d L x b x 0 = a (4x4) (4x1) (4x1) x 1 = b solution: nnzL = 6 dense dense x 2 = c - 2 b known unknown known x 3 = d - 3 a clarencewxl@gmail.com 4

Sp Spars rse e Tr Triangular ular So Solve Example: Lx = b Compute a solution vector x from the sparse linear system, where L is a square lower triangular sparse matrix, and b is the right-hand vector. 1* x 0 = a 0 0 0 1 x 0 a 1* x 1 = b system: 0 0 0 1 x 1 b 2* x 1 +1* x 2 = c Use case: x = 3* x 0 +1* x 3 = d 0 0 0 0 2 1 x 2 c In direct methods for solving a sparse linear system Ax=b, A can be first decomposed to LU, then be solved by LUx=b. 0 0 0 0 3 1 x 3 d This is done by calling two sparse triangular solves Ly=b and L x b x 0 = a Ux=y. (4x4) (4x1) (4x1) x 1 = b In iterative solvers, incomplete LU preconditioner uses sparse solution: nnzL = 6 dense dense x 2 = c - 2 b triangular solves in a similar way. known unknown known x 3 = d - 3 a clarencewxl@gmail.com 5

Sing ngle le cor ore: e: Sequen quentia tial l meth ethod od 0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: A sequential method based 13: 14: on CSC layout 15: 𝑀𝒚 = 𝒄 clarencewxl@gmail.com 6

A fe few w cor ores: es: Leve vel-se set t me metho thod 0: Thread 0 1: 2: Thread 1 3: 4: 5: Thread 2 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: Parallel in each level and sequential inter level clarencewxl@gmail.com 7

A fe few w cor ores: es: Leve vel-se set t me metho thod 0: Thread 0 1: Level 0 2: Thread 1 3: 4: 5: Thread 2 Level 1 6: 7: 8: 9: Level 2 10: 11: 12: Level 3 13: 14: 15: Parallel in each level and sequential inter level clarencewxl@gmail.com 8

A fe few w cor ores: es: Leve vel-se set t me metho thod 0: Thread 0 1: Level 0 2: Thread 1 3: 4: Thread 2 5: Level 1 6: 7: 8: 9: Level 2 10: 11: 12: Level 3 13: 14: 15: Parallel in each level and sequential inter level clarencewxl@gmail.com 9

Mor ore e cor ores: es: P2P method thod (CPU PU/MIC /MIC) Level 0 Level 1 Level 2 • No full-synchronization • Only synchronize between Thread 0 and Thread 2 Level 3 Park J, et al. Sparsifying synchronization for high-performance shared-memory sparse triangular solver[C] International Supercomputing Conference. Springer, Cham, 2014: 124-140. clarencewxl@gmail.com 13

Mor ore e cor ores: es: Sync-fre free e me metho thod d (GP GPU) U) Level 0 Level 1 Level 2 • Thread 0 and 2 modify the same value by atomic operations. Level 3 Liu W, et al. A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves[C] European Conference on Parallel Processing. Springer International Publishing, 2016: 617-630. clarencewxl@gmail.com 14

Bac ackg kgrou round nd Problem Architecture 0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: Sparse Triangular Solve Sunway Processor clarencewxl@gmail.com 15

Sunwa nway Tai aihu huLig Light ht: Overvi rview ew Entire System Peak Performance 125 PFlops Linpack Performance 93 Pflops / 74.4% Total Memory 1310.72 TB Total Memory Bandwidth 5591.45 TB/s # nodes 40,960 # cores 10,649,600 clarencewxl@gmail.com 16

SW26010 W26010 Pr Processo ocessor Memory Memory iMC iMC Core Group 0 Core Group 1 Memory Level 8*8 CPE Mesh PPU PPU Computing Row Core Communication 8*8 CPE 8*8 CPE Bus MPE MPE Mesh Mesh LDMLevel Registers NoC Data Transfer SPM LDM Network 8*8 CPE 8*8 CPE MPE MPE Register Level Mesh Mesh Transfer Agent (TA) PPU PPU Control Column iMC iMC Core Group 2 Core Group 3 Network Communication Bus Computing Level Memory Memory clarencewxl@gmail.com 17

SW26010 W26010 Pr Processo ocessor Memory Memory iMC iMC Core Group 0 Core Group 1 Memory Level 8*8 CPE Mesh PPU PPU Computing Row Core Communication 8*8 CPE 8*8 CPE Bus MPE MPE Mesh Mesh LDMLevel Registers NoC Data Transfer SPM LDM Network 8*8 CPE 8*8 CPE MPE MPE Register Level Mesh Mesh Transfer Agent (TA) PPU PPU Control Column iMC iMC Core Group 2 Core Group 3 Network Communication Bus Computing Level Memory Memory clarencewxl@gmail.com 18

SW26010 W26010 Pr Processo ocessor D irect M emoy A ccess Memory Memory (DMA) 22.6 GB/s iMC iMC Core Group 0 Core Group 1 Memory Level 8*8 CPE Mesh PPU PPU Computing Row Core Communication 8*8 CPE 8*8 CPE Bus MPE MPE Mesh Mesh LDMLevel Registers NoC Data Transfer SPM LDM Network 8*8 CPE 8*8 CPE MPE MPE Register Level Mesh Mesh Transfer Agent (TA) PPU PPU Control Column iMC iMC Core Group 2 Core Group 3 Network Communication Bus Computing Level Memory Memory clarencewxl@gmail.com 19

SW26010 W26010 Pr Processo ocessor G lobal Load/Store Memory Memory (Gload/Gstore) 1.5 GB/s iMC iMC Core Group 0 Core Group 1 Memory Level 8*8 CPE Mesh PPU PPU Computing Row Core Communication 8*8 CPE 8*8 CPE Bus MPE MPE Mesh Mesh LDMLevel Registers NoC Data Transfer SPM LDM Network 8*8 CPE 8*8 CPE MPE MPE Register Level Mesh Mesh Transfer Agent (TA) PPU PPU Control Column iMC iMC Core Group 2 Core Group 3 Network Communication Bus Computing Level Memory Memory clarencewxl@gmail.com 20

Re Regi gister ster Co Communi mmunica catio tion Get C Get C Get R Get R Put Put Get C Get C Get R Get R Put Put clarencewxl@gmail.com 21

Regi Re gister ster Co Communi mmunica catio tion Get C Get C putr  getr Get R Get R Put Put putc  getc Get C Get C Get R Get R Put Put clarencewxl@gmail.com 22

Re Regi gister ster Co Communi mmunica catio tion Get C Get C // P2P Test Get R Get R if (id%2 == 0) Put Put while(1) putr(data, id+1); else while(1) getr(&data); Get C Get C Get R Get R Put Put Latency: less than 11 cycles Integrated Bandwidth: 637 GB/s Xu, Zhigeng, James Lin, and Satoshi Matsuoka. "Benchmarking SW26010 Many-Core Processor." Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International . IEEE, 2017. clarencewxl@gmail.com 23

SW26010 W26010 Pr Processo ocessor • Manual cache system (SPM) • Direct memory access (DMA) • Limited register communication clarencewxl@gmail.com 24

Mismatch smatch between ween SpTRSV TRSV an and Sunway nway • Branch code to check whether cache is miss or not; • The cost of the branch is high • Manual cache system • Direct memory access • Register communication • Cost much even cache hit • Hurt the instruction pipeline • Difficult to prefetch clarencewxl@gmail.com 25

Mismatch match between etween SpT pTRSV RSV and Sunwa way Limitation of register communication: only happen in the same column or row • Manual cache system CPE CPE • Direct memory access (0,0) (0,1) • Register communication CPE (1,1) clarencewxl@gmail.com 26

Mismatch match between etween SpT pTRSV RSV and Sunwa way Limitation of register communication: only happen in the Limitation of register communication same column or row • Manual cache system CPE CPE • Direct memory access (0,0) (0,1) • Register communication Cycle Communication cycle + Random CPE CPE communication size ≈ Dead-Lock (1,0) (1,1) Lin H, et al. Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores[C] Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International. IEEE, 2017: 635-645. clarencewxl@gmail.com 27

Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang , Weifeng Liu, - PowerPoint PPT Presentation

swSpTRSV: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang , Weifeng Liu, Wei Xue, Li Wu 2 1 3 Ou Outline ine 1. Background 2. Sunway architecture 3. Sparse Level Tile

SUNWAY UNIVERSITY STUDENT COUNCIL 2020/2021 The official student body to represent the Sunway

Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight

Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway

SUNWAY REIT FINANCIAL RESULTS 3 rd Quarter ended 31 March 2012 (FYE 30 June 2012) 0 STRICTLY

SUNWAY REIT Financial Results 4 th Quarter Ended 30 June 2015 (FYE 30 June 2015) Announcement

SWMapper: Scalable Read Mapper on SunWay TaihuLight Kai Xu 1,2 , Xiaohui Duan 1,2 , Xiangxu Meng 1

Large-scale Simulations of Peridynamics on Sunway TaihuLight Supercomputer Authors: Xinyuan Li,

An Agent Architecture An Agent Architecture An Agent Architecture An Agent Architecture for

Architecture: Culture and Space Architecture: Culture and Space Architecture: Culture and Space

Introduction to Software Architecture Reid Holmes Architecture Architecture is: All

CMS Strip Readout Architecture for SLHC OUTLINE brief review of LHC strip readout architecture p

Betting on Software Architecture as Code a note on hypothesis-driven architecture James Lewis :

Institute for East Asian Architecture and Urbanism in Kyoto www.East-Asian-Architecture.org

Defense Daily Open Architecture Summit 2014 Defense Daily Open Architecture Summit 2014 PEO IWS

Wisznia | Architecture + Development Wisznia | Architecture + Development The Rebirth of a

Four Layers to Build a Four Layers to Build a Trusted Architecture Trusted Architecture Danny

Architecting a Kotlin JVM and JS multiplatform project FELIPE LIMA / OCT 4TH, 2018 / KOTLINCONF

High-Performance Physics Solver Design for Next Generation Consoles Vangelis Kokkevis Steven

A Bayesian framework for optimal motion planning with uncertainty Andrea Censi, Daniele Calisi,

A Fast Algorithm for Permutation Pattern Matching Based on Alternating Runs Marie-Louise Bruner

PERCEPTION IS IS Market is headed to the bottom of the current real estate cycle with

Task Superscalar: Using Processors as Functional Units Yoav Etsion Alex Ramirez Rosa M.

Total Work-Flow: Exploiting Hybrid Computing Architectures for Scientific Computing ScicomP 15

Structure Groups (and Rings) Wolfgang Rump Instead of Groups and associated Structures I