Seminar on GPGPU Programming: Optimising Matrix Multiplications with - PowerPoint PPT Presentation

Seminar on GPGPU Programming: Optimising Matrix Multiplications with CUDA Axel Eirola 28.01.2010

Table of Contents Introduction Multiplication with CPU Naive implementation IT++ Multiplication with CUDA Naive implementation Using Shared Memory Optimising Block Size CUBLAS Discussion

Introduction ◮ Matrix multiplication for square unfiformally random matrices ◮ C = AB where A , B , C ∈ R ( n , n ) ◮ Syntetical benchmarking, since we do not know anything about the matrices ◮ In real life problems we usually have information about the matrix, it can be symmetric or orthogonal, or have some other pattern which can be exlpoited in the computations

About the benchmarks ◮ Executed on miranda ◮ CPU code in C++, GPU code in CUDA ◮ Measurements average of 5 runs after one warm-up run ◮ Calculations performed in single precission floating point ◮ Only actual calculation timed, no allocation or copying between host and device ◮ Matrices of sizes 100 × 100 (40Kb) to 4000 × 4000 (64MB) were used,

Naive CPU implementation ◮ Simple ”by definition” implementation ◮ Loops through the elements of the output matrix C, and calculates each element seperately ◮ No multithreading, no smart fetching of elements from A and B

Benchmarks 10000 1000 100 Time (ms) 10 CPU Naive 1 0.1 100 1000 Matrix Width Figure: Naive CPU implementation

BLAS libary IT++ ◮ A general purpose linear algebra and signal processing library for C++ ◮ Utilizes underlying BLAS implementataions ◮ Seems to do multithreading and smarter memory management ◮ Does not seem to use Strasses (or any other guys) matrix multiplication algorithm

Benchmarks 10000 1000 100 Time (ms) 10 CPU Naive 1 IT++ 0.1 100 1000 Matrix Width Figure: IT++ library

Naive GPU implementation ◮ Trivial reimplementation of the CPU naive code to CUDA ◮ Replaces the loops with threading, that is that each thread is created for each element in the output matrix C ◮ All data is retreived from the global memory of the GPU

Benchmarks 10000 1000 100 Time (ms) 10 CPU Naive 1 IT++ GPU Naive 0.1 100 1000 Matrix Width Figure: Naive GPU implementation

Speed it up with Shared Memory ◮ The naive GPU implementation only used global memory for accessing matrices A and B ◮ Since each element is accessed multiple times, it would be faster to store the elements somwhere close, such as the SM (stream multiprocessor) shared memory ◮ Give each thread block a responsibilty to calculate one block of the output matrix C ◮ Store data needed to calculate the block in the shared memory

Benchmarks B.width-1 col 0 B B.height 0 A C A.height row A.width B.width A.height-1 Figure: Naive matrix multiplication

Benchmarks blockCol BLOCK _SIZE B B.height BLOCK _SIZE BLOCK_SIZE-1 A C 0 col 0 ! BLOCK _SIZE C sub blockRow A.height row BLOCK_SIZE-1 BLOCK_SIZE BLOCK_SIZE BLOCK_SIZE A.width B.width Figure: Matrix multiplication with shared memory

Benchmarks 10000 1000 100 Time (ms) 10 CPU Naive 1 IT++ GPU Naive + Shared Memory 0.1 100 1000 Matrix Width Figure: GPU using shared memory

What can we do with block size ◮ The block size represents the amount of threads executed by one SM (stream multiprocessor) ◮ The amount of threads stays constant ◮ But the amount of data kept in the shared memory of the SM is increased, decreasing the amount of costly accesses to the global memory ◮ Block size is limited to 22, since the maximum amount of thread blocks in one grid is 512 (22 2 = 484 and 23 2 = 529)

Benchmarks 10000 1000 100 Time (ms) 10 CPU Naive 1 IT++ GPU Naive + Shared Memory 0.1 + large blocksize 100 1000 Matrix Width Figure: GPU with larger blocksize

CUBLAS library ◮ A C library provided by nVidia implementing the BLAS (Basic Linear Algebra Subprograms) specification ◮ Could not find what it actually does, but seems to do something.

Benchmarks 10000 1000 100 Time (ms) 10 CPU Naive 1 IT++ GPU Naive + Shared Memory 0.1 + large blocksize CUBLAS 100 1000 Matrix Width Figure: CUBLAS library implementation

Benchmarks 10000 1000 100 Time (ms) 10 CPU Naive 1 IT++ GPU Naive + Shared Memory 0.1 + large blocksize CUBLAS 100 1000 Matrix Width Figure: This is interesting

Benchmarks (Zoomed) 20 18 16 14 Time (ms) 12 10 8 6 CUBLAS 4 2 1008 1024 1040 1056 1072 1088 1104 1120 1136 1152 1168 1184 1200 Matrix Width Figure: Zoom on spikes

◮ CUBLAS twice as fast when the width of the matrix is divisible by 16 ◮ Noticed by O. Schenk et al in Algorithmic performance studies on graphics processing units . Stating that When the matrix is not divisible by 16, there are conflicts in shared memory regarding multiple threads accessing the same bank at the same time. This forces one thread to be put in a queue while the other thread is accessing the memory, increasing the amount of time for all memory accesses to be completed. ◮ The question is: Why aren’t the smaller matrices padded to become divisible by 16?

Profit ratio ◮ Tesla C1060 costs about $1200, and calculates a 2000 × 2000 matrix in 50 ms ◮ Core i7 920 costs about $300, and calculates a 2000 × 2000 matrix in 2000 ms ◮ CUBLAS is about 40 times faster than IT++, while a Tesla costs only about 4 times more than a Core i7 ◮ So the profit ratio becomes tenfold. 300$ ∗ 2000 ms 1200$ ∗ 50 ms = 10

Summary ◮ GPGPU is fast :) ◮ But without proper memory management it isn’t as fast as it could be. ◮ Even the libraries aren’t as fast as they could be

Seminar on GPGPU Programming: Optimising Matrix Multiplications with - PowerPoint PPT Presentation

Seminar on GPGPU Programming: Optimising Matrix Multiplications with CUDA Axel Eirola 28.01.2010 Table of Contents Introduction Multiplication with CPU Naive implementation IT++ Multiplication with CUDA Naive implementation Using Shared

Welcome! Global Agenda: 1. GPGPU (1) : Introduction, architecture, concepts 2. GPGPU (2) :

Welcome! Todays Agenda: GPU Execution Model GPGPU Flow GPGPU Low Level Notes

Parallel Incep+on MPP Databases GPGPU Kyle Dunn Me Data nerd for Recovering HPC/GPGPU

Welcome! Todays Agenda: Introduction to GPGPU Example: Voronoi Noise GPGPU

Welcome! Todays Agenda: Practical GPGPU: Verlet Fluid GPGPU Algorithms Optimizing

Efficient Abstractions for GPGPU Programming . Mathias Bourgoin 10.03.2015 Efficient

GPGPU Programming in Haskell with Accelerate Trevor L. McDonell University of New South Wales

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Optimising Optimising the Gas the Gas Netw Networ ork Helen Fitzgerald Wales & West

Memcheck vs Optimising Compilers: Memcheck vs Optimising Compilers: keeping the false positive

Building an IoT Platform with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Introductory Matrix Operations Matrix Entries Defn. For matrix A , notation a ij means the en-

Gov 2000: 10. Multiple Regression in Matrix Form Matthew Blackwell Fall 2016 1 / 64 1. Matrix

Liberating Communication with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Communication Lower Bounds for Matrix-Matrix Multiplication Dagstuhl Seminar #15281 July 6-9,

Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2016 CPS343

Offshoring: A new methodology for complex and spatial LCA calculations Pascal Lesage (CIRAIG,

Clarifications Clarifications Berioska Quispe Estrada Ministry of Environment of Per July,

Da iry Dig e ste r E missio ns Ma trix (DRAF T ) DAIRY AND L IVE ST OCK SUBGROUP #2: F

R RITSUMEIKAN Introduction 1 Methodology Content 2 Experiments 3 Conclusion 4

On the behaviour of the MKL library in multicore shared-memory systems Domingo Gim enez

EMRAS 2 EMRAS 2 Working Group 1 Working Group 1 Legacy Sites and NORM Legacy Sites and NORM

WELCOME! Alvin ISD Magnet Academic Program Agenda for the Evening Welcome &

Seminar on GPGPU Programming: Optimising Matrix Multiplications with - PowerPoint PPT Presentation

Seminar on GPGPU Programming: Optimising Matrix Multiplications with CUDA Axel Eirola 28.01.2010 Table of Contents Introduction Multiplication with CPU Naive implementation IT++ Multiplication with CUDA Naive implementation Using Shared

Welcome! Global Agenda: 1. GPGPU (1) : Introduction, architecture, concepts 2. GPGPU (2) :

Welcome! Todays Agenda: GPU Execution Model GPGPU Flow GPGPU Low Level Notes

Parallel Incep+on MPP Databases GPGPU Kyle Dunn Me Data nerd for Recovering HPC/GPGPU

Welcome! Todays Agenda: Introduction to GPGPU Example: Voronoi Noise GPGPU

Welcome! Todays Agenda: Practical GPGPU: Verlet Fluid GPGPU Algorithms Optimizing

Efficient Abstractions for GPGPU Programming . Mathias Bourgoin 10.03.2015 Efficient

GPGPU Programming in Haskell with Accelerate Trevor L. McDonell University of New South Wales

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Optimising Optimising the Gas the Gas Netw Networ ork Helen Fitzgerald Wales &amp; West

Memcheck vs Optimising Compilers: Memcheck vs Optimising Compilers: keeping the false positive

Building an IoT Platform with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Introductory Matrix Operations Matrix Entries Defn. For matrix A , notation a ij means the en-

Gov 2000: 10. Multiple Regression in Matrix Form Matthew Blackwell Fall 2016 1 / 64 1. Matrix

Liberating Communication with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Communication Lower Bounds for Matrix-Matrix Multiplication Dagstuhl Seminar #15281 July 6-9,

Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2016 CPS343

Offshoring: A new methodology for complex and spatial LCA calculations Pascal Lesage (CIRAIG,

Clarifications Clarifications Berioska Quispe Estrada Ministry of Environment of Per July,

Da iry Dig e ste r E missio ns Ma trix (DRAF T ) DAIRY AND L IVE ST OCK SUBGROUP #2: F

R RITSUMEIKAN Introduction 1 Methodology Content 2 Experiments 3 Conclusion 4

On the behaviour of the MKL library in multicore shared-memory systems Domingo Gim enez

EMRAS 2 EMRAS 2 Working Group 1 Working Group 1 Legacy Sites and NORM Legacy Sites and NORM

WELCOME! Alvin ISD Magnet Academic Program Agenda for the Evening Welcome &amp;

Optimising Optimising the Gas the Gas Netw Networ ork Helen Fitzgerald Wales & West

WELCOME! Alvin ISD Magnet Academic Program Agenda for the Evening Welcome &