GiMMiK: Generating Bespoke Matrix Multiplication Kernels F.D. - PowerPoint PPT Presentation

GiMMiK: Generating Bespoke Matrix Multiplication Kernels F.D. Witherden , B.D. Wozniak, F.P Russel, P.E. Vincent, P.H.J Kelly � Department of Aeronautics & Department of Computer Science Imperial College London

Motivation • Computational fluid dynamics (CFD) is the bedrock of several high-tech industries. • Desire amongst practitioners to perform unsteady , scale resolving simulations within the vicinity of complex geometries . • Not currently viable with current generation CFD.

Motivation • Our solution PyFR —the Py being for Python . • Runs on clusters of NVIDIA GPUs. • Uses the flux reconstruction (FR) approach to solver the compressible Navier-Stokes equations on mixed unstructured grids in 2D/ 3D .

Motivation • FR has a variety of desirable numerical properties: • completely explicit ; • halo type exchanges between elements; • majority of operations can be cast as large matrix-matrix multiplications .

Motivation • Runtime of PyFR is hence dominated by calls to GEMM . GEMM Other � p = 2 p = 3 � p = 4 � 0 25 50 75 100 • To speed-up PyFR we therefore need to beat cuBLAS !

Motivation • Have data at and want to interpolate to . � M = � � �

Motivation • In a tensor product element points can align .

Motivation • Consider the two highlighted blue points. • These line up with the three interior points.

Motivation • Hence, the entires in M for these two points only depend on some of the interior points. • This introduces sparsity into M .

Putting the G in GEMM • The G in GEMM stands for general . � C A B � � • But in the case of FR we know things BLAS doesn’t.

What We Know: Shape • Multiplications are of the block-by-panel variety: � C A B M � N K � • where N ~ 10 5 and N ≫ ( M , K ).

What We Know: Size 1029 • Dimension of A is quantised : 64 � 343 6 � 96 3 � • Around ~100 different sizes occur in practise.

What We Know: Values • Entries of A are constant : 5 2 0 1 0 7 6 2 5 8 0 3 9 0 0 2 5 3 0 1 0 5 7 0 6 0 1 8 4 0 5 3 9 2 1 8 0 0 0 0 8 4 3 9 0 4 3 0 0 0 9 0 1 4 4 4 5 8 7 1 4 6 3 0 0 0 0 7 9 2 1 8 3 5 1 2 0 7 4 6 0 9 3 5 0 4 1 2 6 1 9 0 5 0 2 9 5 8 7 1 4 0 0 0 1 2 6 2 4 3 6 5 0 0 2 0 0 3 0 0 2 8 7 4 6 9 4 0 0 5 7 7 0 9 0 8 0 2 5 3 0 2 1 8 9 0 0 8 4 0 2 6 7 3 0 0 0 8 7 4 6 3 7 0 9 0 8 7 6 2 0 8 0 0 0 1 4 0 5 4 3 5 0 2 0 0 0 6 9 1 0 4 2 5 3 4 6 9 0 8 9 8 8 5 2 7 4 2 0 0 0 9 0 8 1 4

What We Know: Sparsity • A can sometimes exhibit sparsity : 5 2 1 7 6 2 5 8 3 9 2 5 3 1 5 7 6 1 8 4 5 3 9 2 1 8 8 4 3 9 4 3 9 1 4 4 4 5 8 7 1 4 6 3 7 9 2 1 8 3 5 1 2 7 4 6 9 3 5 4 1 2 6 1 9 5 2 9 5 8 7 1 4 1 2 6 2 4 3 6 5 2 3 2 8 7 4 6 9 4 5 7 7 9 8 2 5 3 2 1 8 9 8 4 2 6 7 3 8 7 4 6 3 7 9 8 7 6 2 8 1 4 5 4 3 5 2 6 9 1 4 2 5 3 4 6 9 8 9 8 8 5 2 7 4 2 9 8 1 4

Interlude on cuSPARSE cuSPARSE cuBLAS 1.00 • cuSPARSE provides cusparseDcsrmm . 0.75 Throughput 0.50 • However, it consistently under 0.25 performs straight cuBLAS. 0.00 A(150,125) / 4% nze

Knowledge Exploitation • Leveraging size we can avoid inefficient cleanup code. � � • Leveraging values we can save loads from memory; • …and exploit any sparsity to reduce FLOPs.

Generating Kernels • Given an A generate at runtime a kernel for performing: C := α AB + β C • Readily accomplished using Python and PyCUDA • We call our solution for this GiMMiK ; • G enerator of M atrix M ultiplication K ernels.

GiMMiK In Action • As an example take A as: 0.0 0.0 0.59097691 � 0.63448574 0.0 0.0 0.0 0.71191878 0.95941663 � • and α = 1 and β = 0.

GiMMiK In Action __global__ void gimmik_mm( const double* __restrict__ b, double* __restrict__ c, const int width, const int bstride, const int cstride) { int index = blockDim.x * blockIdx.x + threadIdx.x; if (index < width) { const double *b_local = b + index; double *c_local = c + index; � const double subterm_0 = b_local[2 * bstride]; const double subterm_1 = b_local[0 * bstride]; const double subterm_2 = b_local[1 * bstride]; � c_local[0 * cstride] = 0.5909769053580467 * subterm_0; c_local[1 * cstride] = 0.6344857400767476 * subterm_1; c_local[2 * cstride] = 0.9594166286064713 * subterm_0 + 0.7119187815275971 * subterm_2; } }

Benchmarks 63.300 • Average speedup over cuBLAS. • Two cases β = 0 and β ≠ 0 . 24.565 13.074 12.196 9.443 9.984 8.736 3.832 Single Double Single Double Tesla K40c GTX 780 Ti

Performance Analysis: K40c 100 • Most sparse kernels 80 are bandwidth 60 % MEMORY 40 bound . 20 Sparsity 0 100 • But 40% of peak 80 60 % FLOPS possible for denser 40 20 cases. 0 Size

Performance Analysis: GTX 780 Ti 100 80 60 % MEMORY 40 • Speedup for dense 20 Sparsity 0 matrices limited by 100 80 FLOPs . 60 % FLOPS 40 20 0 Size

Profiling: Register Pressure Tesla K40c Speedup GTX 780 Ti Useful Memory Bandwidth [%] double single

Speedup in PyFR • Runtime for an benchmark flow 20 days problem. 11.5 days cuBLAS GiMMiK

Takeaway Messages • GiMMiK can outperform cuBLAS when A is: • small —on account of reduced overheads ; • or relatively sparse ; • especially true for fp64 on consumer-grade hardware .

Further Information • Journal paper under review in Comput. Phys. Commun.

Summary • You can beat BLAS. • Funded and supported by � • Any questions? • E-mail: freddie.witherden08@imperial.ac.uk

GiMMiK: Generating Bespoke Matrix Multiplication Kernels F.D. - PowerPoint PPT Presentation

GiMMiK: Generating Bespoke Matrix Multiplication Kernels F.D. Witherden , B.D. Wozniak, F.P Russel, P.E. Vincent, P.H.J Kelly Department of Aeronautics & Department of Computer Science Imperial College London Motivation

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Matrix Multiplication Matrix multiplication is an operation with properties quite different from

Matrix Calculations: Kernels & Images, Matrix Multiplication A. Kissinger (and H. Geuvers)

CS 140 : Matrix multiplication Warmup: Matrix times vector: communication volume Matrix

Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication

Parallel Scientific Computing Matrix-vector multiplication. Matrix-matrix multiplication.

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

Complexity of matrix multiplication (For Hierarchical matrix) For Usual matrix The

CS 401 Integer Multiplication / Matrix Multiplication Xiaorui Sun 1 Integer Multiplication

Matrix-chain multiplication Carola Wenk 1 CMPS 6610 Algorithms Matrix-chain multiplication

Chapter VI All Pair Shortest Paths and Matrix Multiplication VI.1 APSPs and Matrix

Efficient multiplication 2 Matrix multiplication If you have square matrices A and B, then C =

H2 F2009 H2 F2009 GENERATING GENERATING GENERATING GENERATING FREE CASH FLOW FREE CASH FLOW

Communication Lower Bounds for Matrix-Matrix Multiplication Dagstuhl Seminar #15281 July 6-9,

High-performance and Memory-saving Sparse General Matrix-Matrix Multiplication for Pascal GPU

A Parallel Generator of Non-Hermitian Matrices computed from Given Spectra Xinzhe WU 1 , 2 Serge

ABB Ability LEAP for HV motors and generators (Life Expectancy Analysis Program) Customer

Comments Received on the Draft 2017 Inventory of Renewable Energy Generators KEVIN PORTER AND

A Precious Metals Project Generator Focused in the Golden Triangle & Area of British Columbia

System Integrity Matrix or PWSRCAC Matrix of VMT Concerns By Tom Kuckertz & Anna Carey

Registration for Regulatory Energy Management Non- Generator Resource Fall 2012 Market Simulation

reduce greenhouse gas emissions Jo Wheeler Email: jwheeler@wheatbeltnrm.org.au Phone: 0419 753

Implementing the Common Core State Standards for Mathematics Bradford R. Findell, PhD National

GiMMiK: Generating Bespoke Matrix Multiplication Kernels F.D. - PowerPoint PPT Presentation

GiMMiK: Generating Bespoke Matrix Multiplication Kernels F.D. Witherden , B.D. Wozniak, F.P Russel, P.E. Vincent, P.H.J Kelly Department of Aeronautics & Department of Computer Science Imperial College London Motivation

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Matrix Multiplication Matrix multiplication is an operation with properties quite different from

Matrix Calculations: Kernels &amp; Images, Matrix Multiplication A. Kissinger (and H. Geuvers)

CS 140 : Matrix multiplication Warmup: Matrix times vector: communication volume Matrix

Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication

Parallel Scientific Computing Matrix-vector multiplication. Matrix-matrix multiplication.

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

Complexity of matrix multiplication (For Hierarchical matrix) For Usual matrix The

CS 401 Integer Multiplication / Matrix Multiplication Xiaorui Sun 1 Integer Multiplication

Matrix-chain multiplication Carola Wenk 1 CMPS 6610 Algorithms Matrix-chain multiplication

Chapter VI All Pair Shortest Paths and Matrix Multiplication VI.1 APSPs and Matrix

Efficient multiplication 2 Matrix multiplication If you have square matrices A and B, then C =

H2 F2009 H2 F2009 GENERATING GENERATING GENERATING GENERATING FREE CASH FLOW FREE CASH FLOW

Communication Lower Bounds for Matrix-Matrix Multiplication Dagstuhl Seminar #15281 July 6-9,

High-performance and Memory-saving Sparse General Matrix-Matrix Multiplication for Pascal GPU

A Parallel Generator of Non-Hermitian Matrices computed from Given Spectra Xinzhe WU 1 , 2 Serge

ABB Ability LEAP for HV motors and generators (Life Expectancy Analysis Program) Customer

Comments Received on the Draft 2017 Inventory of Renewable Energy Generators KEVIN PORTER AND

A Precious Metals Project Generator Focused in the Golden Triangle &amp; Area of British Columbia

System Integrity Matrix or PWSRCAC Matrix of VMT Concerns By Tom Kuckertz &amp; Anna Carey

Registration for Regulatory Energy Management Non- Generator Resource Fall 2012 Market Simulation

reduce greenhouse gas emissions Jo Wheeler Email: jwheeler@wheatbeltnrm.org.au Phone: 0419 753

Implementing the Common Core State Standards for Mathematics Bradford R. Findell, PhD National

Matrix Calculations: Kernels & Images, Matrix Multiplication A. Kissinger (and H. Geuvers)

A Precious Metals Project Generator Focused in the Golden Triangle & Area of British Columbia

System Integrity Matrix or PWSRCAC Matrix of VMT Concerns By Tom Kuckertz & Anna Carey