Reducing Communication in Sparse Matrix Operations 2018 Blue Waters - PowerPoint PPT Presentation

Reducing Communication in Sparse Matrix Operations 2018 Blue Waters Symposium Luke Olson Department of Computer Science, University of Illinois at Urbana-Champaign Collaborators on this allocation: Amanda Bienz , University of Illinois at Urbana-Champaign Bill Gropp, University of Illinois at Urbana-Champaign Andrew Reisner, University of Illinois at Urbana-Champaign Lukas Spies, University of Illinois at Urbana-Champaign

Sparse Matrix Operations Time Stepping Linear Systems PCA / Clustering Figure: XPACC @ Illinois Figure: MD Anderson Figure: Fischer @ Illinois C ← A ∗ B w ← A − 1 v Eigen analysis C ← R ∗ A ∗ R T Figure:QMCpack w ← A ∗ v Sparse Matrix-Vector multiplication (SpMV)

What is this talk about? (Why it matters) Iterative method for solving Ax = b while... α h r, z i / h Ap, p i CA algorithms, see Eller/Gropp x x + α p SpMV r + r � α Ap 2…10…100 SpMVs z + precond( r ) β h r + , z + i / h r, z i p z + β p • 10s, 100s, 1000s, … of SpMVs in a computation • SpMV is a major kernel, but is limited e ffi ciency and limited scalability • Use machine layout (nodes) on Blue Waters to reduce communication • Use consistent timings on Blue Waters to develop accurate performance models

Anatomy of a Sparse Matrix-Vector (SpMV) product w ← A ∗ v Solid blocks: on-process portion • Patterned blocks: off-process portion (requires communication of the input vector) • w v A p = 0 P0 p = 1 p = 2 P1 p = 3 P2 p = 4 P3 p = 5 Data layout Where data is sent

Cost of a Sparse Matrix-Vector (SpMV) product nlpkkt240 100 Process ID Process ID % of Time in Communication 80 60 Process ID 40 20 0 500K 100K 50K Non-zeros per core All-reduce SpMV SpMV Modeling difficult (more later) • Basic SpMV: rows-per-process layout •

Case Study: Preconditioning (Algebraic Multigrid) • AMG: Algebraic Multigrid iteratively whittles away at the error • Series or hierarchy of successively smaller (and more dense) sparse matrices • SpMV dominated x ← x + ω Ar x ← x + ω Ar Level 0 nnz A 0 n rows = 26 x ← x + ωA 1 r Level 1 x ← x + ωA 1 r nnz n rows = 30 A 1 x ← x + ωA 2 r Level 2 nnz x ← x + ωA 2 r A 2 n rows = 64 … Level 3 nnz A 3 n rows = 66

Case Study: Preconditioning (Algebraic Multigrid) 10 − 2 Time (Seconds) 10 − 3 10 − 4 0 5 10 15 20 25 Level in AMG Hierarchy Smaller matrices == more communication • MFEM discretization • Linear elasticity • 8192 cores, 512 nodes, 10k dof / core

Observation 1: message volume between procs 1. high volume of messages, high number of messages 2. Diminishing returns with higher communicating cores 3. off node > on node > on socket Max Messages Size (bytes) 10 5 Max Number of Messages 10 3 10 4 10 2 0 5 10 15 20 0 5 10 15 20 AMG Level AMG Level Maximum number of Maximum size of messages messages

Observation 2: limits of communication 1. high volume of messages, high number of messages 2. Diminishing returns with higher communicating cores 3. off node > on node > on socket latency message size ppn · s T = α + min ( R N , ppn · R B ) Node injection Bandwidth between bandwidth two processes Modeling MPI Communication Performance on SMP Nodes: Is it Time to Retire the Ping Pong Test,   Gropp, Olson, Samfass, EuroMPI 2016.

Observation 3: node locality 1. high volume of messages, high number of messages 2. Diminishing returns with higher communicating cores 3. o ff node > on node > on socket Network (PPN ≥ 4) On-Node On-Socket Network (PPN < 4) • Split into short, eager, 10 − 4 Time (seconds rendezvous • Partition into on- 10 − 5 socket, on-node, and 10 − 6 off-node 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Number of Bytes Communicated

Anatomy of a node level SpMV product w v A P0 P1 P0 P2 P4 P2 P3 P1 P3 P5 P4 N0 N1 N2 P5 Six processes distributed Linear system distributed across three nodes across the processes

Standard Communication q core n m Node Node

Standard Communication core p n m Node Node

New Algorithm: On-Node Communication p n

New Algorithm: Off-Node Communication p q n m

Node-Aware Parallel (NAP) Matrix Operation 1.) Redistribute initial values 2.) Inter-node communication p p q q n m n m 4.) On-node 3.) Redistribute received values communication 5.) Local computation with on-process, on-node, p p and off-node portions of Matrix q n m n Note: step 4 and portions of step 5 can overlap with steps 1, 2, and 3

Case Study: Preconditioning (Algebraic Multigrid) Maximum number of messages sent from any process on 16,384 processes ref. SpMV TAPSpMV ref. SpMV TAPSpMV Max O ff -Node Number of Messages Max Number of On-Node Messages 10 3 10 2 10 1 10 1 0 5 10 15 20 0 5 10 15 20 AMG Level AMG Level O ff -node On-node

Case Study: Preconditioning (Algebraic Multigrid) Maximum size of messages sent from any process on 16,384 processes ref. SpMV TAPSpMV ref. SpMV TAPSpMV Max O ff -Node Messages Size (bytes) Max On-Node Messages Size (bytes) 10 5 10 3 10 4 10 2 10 3 10 1 0 5 10 15 20 0 5 10 15 20 AMG Level AMG Level O ff -node On-node

Case Study: Preconditioning (Algebraic Multigrid) ref. SpMV TAPSpMV ref. SpMV TAPSpMV Time (seconds) 10 − 1 Time (seconds) 10 − 2 10 − 3 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 0 5 10 15 20 Number of Processes AMG Level Total Time Strong Scaling Node aware sparse matrix-vector multiplication,   Bienz, Gropp, Olson, in review JPDC, 2018. Arxiv

Cost analysis on Blue Waters • Blue Waters provided a unique setting for two aspects: 1. Model MPI queueing times 2. Model network contention 16 Bytes 1024 Bytes 65536 Bytes 64 Bytes 4096 Bytes 262144 Bytes 256 Bytes 16384 Bytes 10 0 10 − 1 Time (seconds) 10 − 2 10 − 3 10 − 4 MPI Irecv message queue costly • 10 − 5 10 − 6 Identified a quadratic cost • 10 0 10 1 10 2 10 3 10 4 10 0 10 − 1 Time (seconds) 10 − 2 10 − 3 10 − 4 10 − 5 10 − 6 10 0 10 1 10 2 10 3 10 4 Number of Messages Communicated

Cost analysis on Blue Waters • Blue Waters provided a unique setting for two aspects: 1. Model MPI queueing times 16 Bytes 1024 Bytes 65536 Bytes 2. Model network contention 64 Bytes 4096 Bytes 262144 Bytes 256 Bytes 16384 Bytes 10 0 10 − 1 Time (seconds) 10 − 2 G0 G1 G2 G3 10 − 3 10 − 4 10 − 5 256 Bytes 16384 Bytes Network contention is costly 10 0 • 10 0 10 1 10 2 10 3 10 4 10 − 1 Identified a hop model • Time (seconds) 10 − 2 10 − 3 10 − 4 10 − 5 10 0 10 1 10 2 10 3 10 4 Number of Messages Communicated

Cost analysis on Blue Waters • Blue Waters provided a unique setting for two aspects: 1. Model MPI queueing times 2. Model network contention Measured Queue Search Max-Rate Contention 0 . 008 0 . 007 0 . 006 Time (seconds) 0 . 005 0 . 004 0 . 003 0 . 002 0 . 001 0 . 000 0 1 2 3 4 5 6 Level in AMG Hierarchy Improving Performance Models for Irregular Point-to-Point Communication, Bienz, Gropp, Olson, in review EuroMPI, 2018.

Summary and Ongoing Work Drop in replacement for a range of Sparse Matrix operations • (SpMV, SPMM, MIS(k), assembly operations, etc) Blue Waters instrumental in testing at scale, reproducible • outcomes, and accurate performance analysis. (this) Code base: https://github.com/lukeolson/raptor • Structured code base: https://github.com/cedar-framework/cedar • This research is part of the Blue Waters sustained petascale computing project, which is supported by the National Science Foundation (awards OCI-0725070 and ACI-1238993) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications.

Reducing Communication in Sparse Matrix Operations 2018 Blue Waters - PowerPoint PPT Presentation

Reducing Communication in Sparse Matrix Operations 2018 Blue Waters Symposium Luke Olson Department of Computer Science, University of Illinois at Urbana-Champaign Collaborators on this allocation: Amanda Bienz , University of Illinois at

High-performance and Memory-saving Sparse General Matrix-Matrix Multiplication for Pascal GPU

Liberating Communication with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse

Sparse Matrix Partitioning, Reordering and Vector Multiplication Albert-Jan Yzelman, Utrecht

Introductory Matrix Operations Matrix Entries Defn. For matrix A , notation a ij means the en-

A User-Friendly Hybrid Sparse Matrix Class in C++ Conrad Sanderson, Ryan R. Curtin July 19, 2018

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Building an IoT Platform with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Gov 2000: 10. Multiple Regression in Matrix Form Matthew Blackwell Fall 2016 1 / 64 1. Matrix

Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of

The Input/Output Complexity of Sparse Matrix Multiplication Rasmus Pagh, Morten St ockel IT

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Database Design Process Requirements analysis Conceptual design: Entity-Relationship Model

Process Layout and Function Calls CS 161 Spring 2017 1 / 8 Process Layout in Memory

Software Design Refinement Using Design Patterns Instructor: Dr. Hany H. Ammar Dept. of Computer

Memory Layout for Process Stack Data Code 0 CS 140 Lecture Notes: Linkers Slide 1

Standard Cell Design Advanced VLSI Design CMPE 414 Standard Cell Libraries Standard cell

The Best of Apache Kafka Architecture Ranganathan Balashanmugam @ran_than Hell Budapest

VAST: Interactive Network Forensics Matthias Vallentin matthias@bro.org BroCon August 5, 2015

Fast Data apps with Alpakka Kafka connector and Akka Streams Sean Glover, Lightbend @seg1o Who