Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector - PowerPoint PPT Presentation

Matrix reuse and data locality in parallel y = A z and z = A T x Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication on Many-Core Architectures Kadir Akbudak Ozan Karsavuran 1 (speaker) 2 Cevdet Aykanat 1 sites.google.com/site/kadircs kadir.cs@gmail.com 1 Bilkent University, Turkey 2 KAUST, KSA SIAM Workshop on Combinatorial Scientific Computing (CSC), Albuquerque, NM, USA, October 10-12, 2016 O. Karsavuran, K. Akbudak , and C. Aykanat, Locality-Aware Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication on Many-Core Architectures , IEEE Transactions on Parallel and Distributed Systems (TPDS) , vol. 27(6), pp. 1713-1726, 2016, available at ieeexplore.ieee.org/document/7152923/ 1 / 14

Matrix reuse and data locality in parallel y = A z and z = A T x Introduction: y = AA T x 1 Open problems & Related work 2 Parallel SpAA T based on 1D partitioning of A and A T matrices 3 Quality criteria for efficient parallelization of SpAA T Proposed SpAA T algorithms Experiments References 4 2 / 14

Matrix reuse and data locality in parallel y = A z and z = A T x Introduction: y = AA T x Thread-level parallelization of y = AA T x ( SpAA T ) y = AA T x is computed as two Sparse Matrix-Vector Multiplies ( SpMV ) x Sparse Matrix- Transpose–Vector z = A T x and then z A T Multiply ( Sp A T ) z Sparse Matrix-Vector y A y = Az Multiply ( Sp A ) Thread-level parallelization of repeated and consecutive Sp A and Sp A T that involve the same sparse matrix A Examples: Linear Programming (LP) problems via interior point methods nonsymmetric systems via Bi-CG, CGNE, Lanczos Bi-ortagonalization least squares problem via LSQR linear feasibility problem via Surrogate Constraints method Krylov-based balancing algorithms used as preconditioners for sparse eigensolvers web page ranking via HITS algorithm 3 / 14

Matrix reuse and data locality in parallel y = A z and z = A T x Open problems & Related work Open problems Utilize the opportunity of reusing A -matrix nonzeros? Obtain close performance for both z = A T x and y = Az at the same time? Single storage of A for both z = A T x and y = Az Storage of A T for z = A T x and a separate storage of A for y = Az Related work Optimized Sparse Kernel Interface (OSKI), Berkeley x Serial z y Each row/column is reused. A z A T Compressed Sparse Blocks (CSB) by Buluc et. al. [10] Parallel Same data structure for both Sp A and Sp A T operations without any performance degradation Two phase, i.e., Sp A and Sp A T are not performed simultaneously 4 / 14

Matrix reuse and data locality in parallel y = A z and z = A T x Parallel SpAA T based on 1D partitioning of A and A T matrices Thread-level baseline parallelization of SpAA T z 1 z 2 z 3 z 4 x 1 x 2 x 3 x 4 x z R T z 1 y 1 R 1 1 R T z 2 y 2 R 2 2 = y = C T C T C T C T = z = C 1 C 2 C 3 C 4 R T z 3 1 2 3 4 y 3 R 3 3 R T z 4 y 4 R 4 4 A T RRp A A T A CCp Row-Row parallel Column-Column parallel z 1 z 2 z 3 z 4 x 1 x 2 x 3 x 4 x z C T z 1 y 1 R 1 1 C T z 2 y 2 R 2 2 R T R T R T R T y = z = C 1 C 2 C 3 C 4 = = 1 2 3 4 C T z 3 y 3 R 3 3 C T z 4 y 4 R 4 4 A T A T CRp RCp A A Column-Row parallel Row-Column parallel YELLOW scale tone: exclusive accesses by a single thread RED color: concurrent accesses by multiple threads. Four baseline SpAA T algorithms for computing y = A z after z = A T x by four threads. 5 / 14

Matrix reuse and data locality in parallel y = A z and z = A T x Parallel SpAA T based on 1D partitioning of A and A T matrices Contributions Identify five quality criteria (QC), which have impact on performance of parallel SpAA T Singly-bordered block-diagonal (SB) form based methods: sb CRp and sb RCp Matrix A partitioned in to four and the subma- z 1 z 2 z 3 z 4 trices are processed by four threads. y 1 A 11 For sb CRp (SB-based Column-Row parallel algorithm), y 2 A 22 we permute matrix A into a rowwise SB form, which y 3 = A 33 induces a columnwise SB form of matrix A T y 4 A 44 z 1 z 2 z 3 z 4 z B y B A B 1 A B 2 A B 3 A B 4 y 1 A 11 A B 1 For sb RCp (SB-based Row-Column parallel algorithm), y 2 we permute matrix A into a columnwise SB form, A 22 A B 2 = which induces a rowwise SB form of matrix A T y 3 A 33 A B 3 y 4 A 44 A B 4 Achieve (a) ( z -vector reuse) and (b) ( A -matrix reuse). Objectives of minimizing the size of the row/column border in the SB form of A ≈ achieve QC (c), (d), and (e) in sb CRp/ sb RCp. 6 / 14

Matrix reuse and data locality in parallel y = A z and z = A T x Parallel SpAA T based on 1D partitioning of A and A T matrices Quality criteria for efficient parallelization of SpAA T Quality criteria for efficient parallelization of SpAA T Quality Criteria RRp CRp RCp sb CRp sb RCp – (a)Reusing z -vector entries generated in z = A T x and 1 � � � × × then read in y = A z – 2 (b)Reusing matrix nonzeros (together with their in- � � � × × dices) in z = A T x and y = A z × 3 × 3 × 3 (c) Exploiting temporal locality in reading input vector � � entries in row-parallel SpMVs × 3 × 3 (d)Exploiting temporal locality in updating output vec- � � − tor entries in column-parallel SpMVs (e) Minimizing the number of concurrent writes per- � � � × × formed by different threads in column-parallel SpMVs z 1 z 2 z 3 z 4 x z 1 C T – 1 1 : satisfied except z B border subvectors � : satisfied � z 2 C T y = 2 – C 1 C 2 C 3 C 4 = 2 : satisfied except A kB border submatrices − : not applicable � C T z 3 3 × 3 : may be satisfied through row/column reordering × : not satisfied C T z 4 4 A T A CRp Column-Row parallel 7 / 14

Matrix reuse and data locality in parallel y = A z and z = A T x Parallel SpAA T based on 1D partitioning of A and A T matrices Proposed SpAA T algorithms Maintaining balance on the number of nonzeros at each slice Reducing parallel time under arbitrary task scheduling Reducing border size x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 × × × × × × × × × × × × × × × × c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 Reducing # of cache misses due to loss r 1 r 1 × ×× × × of temporal locality r 2 r 4 × × r 3 ×× r 2 × λ ( c j ) = |{ A k : r 4 × r 3 × c j has at least one nonzero at A k , r 5 r 5 × × r 6 r 6 × × × × ∀ k ∈ 1 , . . . , K }| � λ ( c j ) = 3 + 5 � λ ( c j ) = 3 + 3 Matrix A partitioned in to three and the submatrices are processed by three threads. Reducing # of concurrent writes λ ( r i ) = |{ A k : r i has at least one nonzero at A k , ∀ k ∈ 1 , . . . , K }| 8 / 14

Matrix reuse and data locality in parallel y = A z and z = A T x Parallel SpAA T based on 1D partitioning of A and A T matrices Proposed SpAA T algorithms Merits of Singly-Bordered Block Diagonal (SB) Form on CRp SB Form z 1 z 2 z 3 z 4 z 1 z 2 z 3 z 4 x 1 x 2 x 3 x 4 x B x y 1 A 11 z 1 A T A T C T z 1 11 B 1 1 y 2 A 22 z 2 A T A T z 2 C T 22 B 2 y = 2 y 3 = = A 33 C 1 C 2 C 3 C 4 = A T A T z 3 C T z 3 y 4 33 B 3 3 A 44 z 4 C T z 4 A T 44 A T y B A B 1 A B 2 A B 3 A B 4 4 B 4 A T A T A A CRp sb CRp Concurrent accesses Whole x and y vectors Only x B and y B subvectors Exploits temporal locality in reading x -vector entries in row parallel z = A T x Exploits temporal locality in updating y -vector entries in column-parallel y = A z Minimizing number of concurrent writes by Minimizing border different threads in column-parallel y = A z size in the SB form 9 / 14

Matrix reuse and data locality in parallel y = A z and z = A T x Parallel SpAA T based on 1D partitioning of A and A T matrices Proposed SpAA T algorithms Require: A kk and A Bk matrices; x , y , and z vectors Singly-bordered block-diagonal (SB) form 1: for k ← 1 to K in parallel do z 1 z 2 z 3 z 4 x 1 x 2 x 3 x 4 x B z k ← A T kk x k 2: z k ← C T y 1 k x A 11 A T z k ← z k + A T z 1 A T Bk x B 3: 11 B 1 y 2 y k ← A kk z k A 22 4: z 2 A T A T y ← C k z k ⊲ Concurrent 22 B 2 y B ← y B + A Bk z k y 3 5: = = A 33 writes A T A T z 3 6: end for y 4 33 B 3 A 44 z 4 A T 44 A T y B A B 1 A B 2 A B 3 A B 4 B 4 A T A sb CRp SB-based Column-Row parallel Require: A kk and A kB matrices; x , y , and z vectors Singly-bordered block-diagonal (SB) form 1: for k ← 1 to K in parallel do x 1 x 2 x 3 x 4 z 1 z 2 z 3 z 4 z B z k ← A T kk x k 2: z ← R T k x k kB x k ⊲ Concurrent z B ← z B + A T A T z 1 3: y 1 A 11 A 1 B 11 writes y k ← A kk z k z 2 A T 4: 22 y 2 A 22 A 2 B 5: end for A T z 3 = = 33 y k ← R k z 6: for k ← 1 to K in parallel do y 3 A 33 A 3 B z 4 A T 44 y k ← y k + A kB z B 7: y 4 A T A T A T A T A 44 A 4 B z B 1 B 2 B 3 B 4 B 8: end for A T sb RCp A SB-based Row-Column parallel 10 / 14

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector - PowerPoint PPT Presentation

Matrix reuse and data locality in parallel y = A z and z = A T x Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication on Many-Core Architectures Kadir Akbudak Ozan Karsavuran 1 (speaker)

CONTEXT LOCALITY LOCALITY LOCALITY LOCALITY LAYOUTS M E E R L U S T R O A D PICK

Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of

Locality Locality CS 105 Tour of the Black Holes of Computing Principle of Locality: Programs

1 Infrastructure Requirements Limit Reuse Planned Indirect Potable Reuse (Purple pipe may be a

High-performance and Memory-saving Sparse General Matrix-Matrix Multiplication for Pascal GPU

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

Sparse Matrix Partitioning, Reordering and Vector Multiplication Albert-Jan Yzelman, Utrecht

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Software Reuse From informal reuse (scavenging) to systematic reuse Management and technical

UC Berkeley ReUSE Programs March 9, 2017 Lin King Cal Zero Waste Manager UC Berkeley Chair

TRACER TUTORIAL: TEXT REUSE DETECTION INTRODUCTION TO HISTORICAL TEXT REUSE DETECTION M arco B

PSL: Exploiting Parallelism, Sparsity and Locality to Accelerate Matrix Factorization on x86

locality.org.uk Locality is the national network of ambitious and enterprising community-led

PreK-12 Professional Development Friday, December 7, 2018 AM General Session: Sustaining and

Performance 1 Introduction Performance is an important aspect of software quality. To achieve

12/14/2018 COPE WEBINAR SERIES FOR HEALTH PROFESSIONALS FINDING SLIDES FOR TODAYS WEBINAR

Splitting Algorithms We have seen that slotted Aloha has maximal throughput 1 /e Now we will look

1 Enabling technologies Advances in sensor and actuator technology GPS, control of quantum

Citizens Advisory Committee Meeting #4 A REBUILD BY DESIGN PROJECT February 28, 2017 AG AGENDA

Dr. Nonparametric Bayes Or: How I Learned to Stop Worrying and Love the Dirichlet Process Kurt

What happens after patient harm? Why one family from Kansas is fighting to stop the secrecy and