The Sparse Matrix Vector Product on High-End GPUs SIAM Conference on - PowerPoint PPT Presentation

The Sparse Matrix Vector Product on High-End GPUs SIAM Conference on Parallel Processing for Scientific Computing (PP20) February 12 - 15, 2020 Hyatt Regency Seattle | Seattle, Washington, U.S. Hartwig Anzt, Terry Cojean, Yuhsiang M. Tsai Steinbuch Centre for Computing (SCC) Mike Tsai This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration and the Helmholtz Impuls und VernetzungsfondVH-NG-1241. www.kit.edu KIT – The Research University in the Helmholtz Association

SpMV on GPUs – Moving away from the NVIDIA hegemony In the past, NVIDIA GPUs were dominating the GPGPU market; • We see an increasing adoption of AMD GPUs in leadership supercomputers: • Frontier system in OakRidge (2021) • El Capitan in Lawrence Livermore National Lab ? (2023) • AMD is heavily investing in the HIP software development ecosystem; • HIP programming similar to CUDA programming; • HIP libraries similar to cuBLAS, cuSPARSE, … • The Race is on! • How can we prepare the Ginkgo sparse linear algebra library for cross-platform portability? • Are the CUDA-optimized kernels suitable for AMD GPUs? • How does the performance compare across different GPUs? • Hartwig Anzt: The Sparse Matrix Vector Product on High-End GPUs 02/13/2020 2

Extend Ginkgo’s hardware scope to AMD GPUs Core https://github.com/ginkgo-project/ginkgo Library core contains architecture-agnostic Library Infrastructure algorithm implementation; Algorithm Implementations • Iterative Solvers Runtime polymorphism selects the right kernel • Preconditioners depending on the target architecture; • … Architecture-specific kernels execute the Part of algorithm on target architecture; OpenMP CUDA Reference OpenMP-kernels CUDA-GPU kernels Reference kernels https://xsdk.info/ • SpMV • SpMV • SpMV Kernels • Solver kernels • Solver kernels • Solver kernels • Precond kernels • Precond kernels • Precond kernels • … • … • … Reference are sequential Optimized architecture-specific kernels; kernels to check correctness of algorithm design and optimized kernels; Hartwig Anzt: The Sparse Matrix Vector Product on High-End GPUs 02/13/2020 3

Extend Ginkgo’s hardware scope to AMD GPUs Core Library core contains architecture-agnostic Library Infrastructure algorithm implementation; Algorithm Implementations • Iterative Solvers Runtime polymorphism selects the right kernel • Preconditioners depending on the target architecture; • … Architecture-specific kernels execute the algorithm on target architecture; OpenMP CUDA Reference HIP OpenMP-kernels HIP-GPU kernels Reference kernels HIP-GPU kernels • SpMV • SpMV • SpMV • SpMV Kernels • Solver kernels • Solver kernels • Solver kernels • Solver kernels • Precond kernels • Precond kernels • Precond kernels • Precond kernels • … • … • … • … Reference are sequential Optimized architecture-specific kernels; kernels to check correctness of algorithm design and optimized kernels; Hartwig Anzt: The Sparse Matrix Vector Product on High-End GPUs 02/13/2020 4

Extend Ginkgo’s hardware scope to AMD GPUs Core Library core contains architecture-agnostic Library Infrastructure algorithm implementation; Algorithm Implementations • Iterative Solvers Runtime polymorphism selects the right kernel • Preconditioners depending on the target architecture; • … Architecture-specific kernels execute the algorithm on target architecture; CUDA HIP OpenMP CUDA Reference HIP OpenMP-kernels HIP-GPU kernels Reference kernels HIP-GPU kernels • SpMV • SpMV • SpMV • SpMV Kernels • Solver kernels • Solver kernels • Solver kernels • Solver kernels • Precond kernels • Precond kernels • Precond kernels • Precond kernels • … • … • … • … Reference are sequential Optimized architecture-specific kernels; kernels to check correctness of algorithm design and optimized kernels; Hartwig Anzt: The Sparse Matrix Vector Product on High-End GPUs 02/13/2020 5

Extend Ginkgo’s hardware scope to AMD GPUs Core Library core contains architecture-agnostic Library Infrastructure algorithm implementation; Algorithm Implementations • Iterative Solvers Runtime polymorphism selects the right kernel • Preconditioners depending on the target architecture; • … Architecture-specific kernels execute the algorithm on target architecture; OpenMP CUDA Reference HIP OpenMP-kernels CUDA-GPU kernels Reference kernels HIP-GPU kernels • SpMV • SpMV • SpMV • SpMV Kernels • Solver kernels • Solver kernels • Solver kernels • Solver kernels • Precond kernels • Precond kernels • Precond kernels • Precond kernels • … • … • … • … Reference are sequential Optimized architecture-specific kernels; kernels to check correctness of algorithm design and optimized kernels; Hartwig Anzt: The Sparse Matrix Vector Product on High-End GPUs 02/13/2020 6

Extend Ginkgo’s hardware scope to AMD GPUs Core To avoid code duplication, Library core contains architecture-agnostic Library Infrastructure common contains kernels algorithm implementation; Algorithm Implementations shared between CUDA and • Iterative Solvers Runtime polymorphism selects the right kernel HIP (upon parameter configs) • Preconditioners depending on the target architecture; • … Architecture-specific kernels execute the Common algorithm on target architecture; • Shared kernels OpenMP CUDA Reference HIP OpenMP-kernels CUDA-GPU kernels Reference kernels HIP-GPU kernels • SpMV • SpMV • SpMV • SpMV Kernels • Solver kernels • Solver kernels • Solver kernels • Solver kernels • Precond kernels • Precond kernels • Precond kernels • Precond kernels • … • … • … • … Reference are sequential Optimized architecture-specific kernels; kernels to check correctness of algorithm design and optimized kernels; Hartwig Anzt: The Sparse Matrix Vector Product on High-End GPUs 02/13/2020 7

Extend Ginkgo’s hardware scope to AMD GPUs Kernels shared between CUDA and AMD • CUDA backends (upon parameter setting) are relocated in the ``common’’ module. CUDA New code necessary for HIP-specific • optimizations and for implementing common functionality currently missing in the HIP ecosystem (e.g. cooperative groups). new HIP Hartwig Anzt: The Sparse Matrix Vector Product on High-End GPUs 02/13/2020 8

How does Ginkgo compare to the vendor libraries - COO SpMV Ginkgo vs cuSPARSE on V100 Ginkgo vs HIPsparse on RadeonVII Results and interactive performance explorer available at: https://ginkgo-project.github.io/gpe/ Hartwig Anzt: The Sparse Matrix Vector Product on High-End GPUs 02/13/2020 9

How does Ginkgo compare to the vendor libraries - CSR SpMV Ginkgo vs cuSPARSE on V100 Ginkgo vs HIPsparse on RadeonVII Results and interactive performance explorer available at: https://ginkgo-project.github.io/gpe/ Hartwig Anzt: The Sparse Matrix Vector Product on High-End GPUs 02/13/2020 10

How does Ginkgo compare to the vendor libraries - ELL SpMV Ginkgo vs cuSPARSE on V100 Ginkgo vs HIPsparse on RadeonVII Results and interactive performance explorer available at: https://ginkgo-project.github.io/gpe/ Hartwig Anzt: The Sparse Matrix Vector Product on High-End GPUs 02/13/2020 11

How does Ginkgo compare to the vendor libraries - hybrid SpMV Ginkgo vs cuSPARSE on V100 Ginkgo vs HIPsparse on RadeonVII Results and interactive performance explorer available at: https://ginkgo-project.github.io/gpe/ Hartwig Anzt: The Sparse Matrix Vector Product on High-End GPUs 02/13/2020 12

Performance Profile on AMD’s RadeonVII Results and interactive performance explorer available at: https://ginkgo-project.github.io/gpe/ Hartwig Anzt: The Sparse Matrix Vector Product on High-End GPUs 02/13/2020 13

Performance Profile on NVIDIA’s V100 Results and interactive performance explorer available at: https://ginkgo-project.github.io/gpe/ Hartwig Anzt: The Sparse Matrix Vector Product on High-End GPUs 02/13/2020 14

Compiling HIP code for NVIDIA GPUs – comparison against native CUDA code Native CUDA vs. HIP compiled for NVIDIA GPUs • Same kernel • All tests on NVIDIA V100 (Summit) • We expect CUDA to be slightly faster • Hartwig Anzt: The Sparse Matrix Vector Product on High-End GPUs 02/13/2020 15

Compiling HIP code for NVIDIA GPUs – comparison against native CUDA code Native CUDA vs. HIP compiled for NVIDIA GPUs • Same kernel • All tests on NVIDIA V100 (Summit) • We expect CUDA to be slightly faster • HIP faster than CUDA on NVIDIA GPU? outliers? machine noise? Hartwig Anzt: The Sparse Matrix Vector Product on High-End GPUs 02/13/2020 16

Compiling HIP code for NVIDIA GPUs – comparison against native CUDA code Native CUDA vs. HIP compiled for NVIDIA GPUs • Same kernel • All tests on NVIDIA V100 (Summit) • We expect CUDA to be slightly faster • HIP faster than CUDA on NVIDIA GPU? outliers? machine noise? Outlier stats on 100 runs a 20 reps: Hartwig Anzt: The Sparse Matrix Vector Product on High-End GPUs 02/13/2020 17

The Sparse Matrix Vector Product on High-End GPUs SIAM Conference on - PowerPoint PPT Presentation

The Sparse Matrix Vector Product on High-End GPUs SIAM Conference on Parallel Processing for Scientific Computing (PP20) February 12 - 15, 2020 Hyatt Regency Seattle | Seattle, Washington, U.S. Hartwig Anzt, Terry Cojean, Yuhsiang M. Tsai

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse

Sparse Matrix Partitioning, Reordering and Vector Multiplication Albert-Jan Yzelman, Utrecht

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

High-performance and Memory-saving Sparse General Matrix-Matrix Multiplication for Pascal GPU

Product Section Product Section New Product Introduction New Product Introduction Product

Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.1 Vector and

Lecture 14: Planted Sparse Vector Lecture Outline Part I: Planted Sparse Vector and 2 to 4

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

CS 140 : Matrix multiplication Warmup: Matrix times vector: communication volume Matrix

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

HIP-WG meeting, IETF62 Generalizing the HI P base protocol

Double Hit/Double Expressing Lymphomas Clinical Presentation and Management Brian K. Link

The HealthierU Portal for Supporting Behaviour Change and Diet Programs Shlomo BERKOVSKY, Gilly

Health & Wellness Alexis, Caroline, Ross, Jessica We met Kathy. Watches calories An avid

The Blocker Tag: Selective Blocking of RFID Tags for Consumer Privacy Ari Juels Ron Rivest

Large Shared Haptic Virtual Environments Nakul Chaudhari nakul.chaudhari@tum.de Supervisors:

How cryptographic benchmarking About PRESERVE: The goes wrong mission of PRESERVE is,

Secure Coding Practices for Middleware Barton P. Miller Elisa Heymann James A. Kupsch Computer

The Sparse Matrix Vector Product on High-End GPUs SIAM Conference on - PowerPoint PPT Presentation

The Sparse Matrix Vector Product on High-End GPUs SIAM Conference on Parallel Processing for Scientific Computing (PP20) February 12 - 15, 2020 Hyatt Regency Seattle | Seattle, Washington, U.S. Hartwig Anzt, Terry Cojean, Yuhsiang M. Tsai

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse

Sparse Matrix Partitioning, Reordering and Vector Multiplication Albert-Jan Yzelman, Utrecht

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

High-performance and Memory-saving Sparse General Matrix-Matrix Multiplication for Pascal GPU

Product Section Product Section New Product Introduction New Product Introduction Product

Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.1 Vector and

Lecture 14: Planted Sparse Vector Lecture Outline Part I: Planted Sparse Vector and 2 to 4

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

CS 140 : Matrix multiplication Warmup: Matrix times vector: communication volume Matrix

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

HIP-WG meeting, IETF62 Generalizing the HI P base protocol

Double Hit/Double Expressing Lymphomas Clinical Presentation and Management Brian K. Link

The HealthierU Portal for Supporting Behaviour Change and Diet Programs Shlomo BERKOVSKY, Gilly

Health &amp; Wellness Alexis, Caroline, Ross, Jessica We met Kathy. Watches calories An avid

The Blocker Tag: Selective Blocking of RFID Tags for Consumer Privacy Ari Juels Ron Rivest

Large Shared Haptic Virtual Environments Nakul Chaudhari nakul.chaudhari@tum.de Supervisors:

How cryptographic benchmarking About PRESERVE: The goes wrong mission of PRESERVE is,

Secure Coding Practices for Middleware Barton P. Miller Elisa Heymann James A. Kupsch Computer

Health & Wellness Alexis, Caroline, Ross, Jessica We met Kathy. Watches calories An avid