Seminar on GPGPU Programming: Optimising Matrix Multiplications with CUDA Axel Eirola 28.01.2010
Table of Contents Introduction Multiplication with CPU Naive implementation IT++ Multiplication with CUDA Naive implementation Using Shared Memory Optimising Block Size CUBLAS Discussion
Introduction ◮ Matrix multiplication for square unfiformally random matrices ◮ C = AB where A , B , C ∈ R ( n , n ) ◮ Syntetical benchmarking, since we do not know anything about the matrices ◮ In real life problems we usually have information about the matrix, it can be symmetric or orthogonal, or have some other pattern which can be exlpoited in the computations
About the benchmarks ◮ Executed on miranda ◮ CPU code in C++, GPU code in CUDA ◮ Measurements average of 5 runs after one warm-up run ◮ Calculations performed in single precission floating point ◮ Only actual calculation timed, no allocation or copying between host and device ◮ Matrices of sizes 100 × 100 (40Kb) to 4000 × 4000 (64MB) were used,
Naive CPU implementation ◮ Simple ”by definition” implementation ◮ Loops through the elements of the output matrix C, and calculates each element seperately ◮ No multithreading, no smart fetching of elements from A and B
Benchmarks 10000 1000 100 Time (ms) 10 CPU Naive 1 0.1 100 1000 Matrix Width Figure: Naive CPU implementation
BLAS libary IT++ ◮ A general purpose linear algebra and signal processing library for C++ ◮ Utilizes underlying BLAS implementataions ◮ Seems to do multithreading and smarter memory management ◮ Does not seem to use Strasses (or any other guys) matrix multiplication algorithm
Benchmarks 10000 1000 100 Time (ms) 10 CPU Naive 1 IT++ 0.1 100 1000 Matrix Width Figure: IT++ library
Naive GPU implementation ◮ Trivial reimplementation of the CPU naive code to CUDA ◮ Replaces the loops with threading, that is that each thread is created for each element in the output matrix C ◮ All data is retreived from the global memory of the GPU
Benchmarks 10000 1000 100 Time (ms) 10 CPU Naive 1 IT++ GPU Naive 0.1 100 1000 Matrix Width Figure: Naive GPU implementation
Speed it up with Shared Memory ◮ The naive GPU implementation only used global memory for accessing matrices A and B ◮ Since each element is accessed multiple times, it would be faster to store the elements somwhere close, such as the SM (stream multiprocessor) shared memory ◮ Give each thread block a responsibilty to calculate one block of the output matrix C ◮ Store data needed to calculate the block in the shared memory
Benchmarks B.width-1 col 0 B B.height 0 A C A.height row A.width B.width A.height-1 Figure: Naive matrix multiplication
Benchmarks blockCol BLOCK _SIZE B B.height BLOCK _SIZE BLOCK_SIZE-1 A C 0 col 0 ! BLOCK _SIZE C sub blockRow A.height row BLOCK_SIZE-1 BLOCK_SIZE BLOCK_SIZE BLOCK_SIZE A.width B.width Figure: Matrix multiplication with shared memory
Benchmarks 10000 1000 100 Time (ms) 10 CPU Naive 1 IT++ GPU Naive + Shared Memory 0.1 100 1000 Matrix Width Figure: GPU using shared memory
What can we do with block size ◮ The block size represents the amount of threads executed by one SM (stream multiprocessor) ◮ The amount of threads stays constant ◮ But the amount of data kept in the shared memory of the SM is increased, decreasing the amount of costly accesses to the global memory ◮ Block size is limited to 22, since the maximum amount of thread blocks in one grid is 512 (22 2 = 484 and 23 2 = 529)
Benchmarks 10000 1000 100 Time (ms) 10 CPU Naive 1 IT++ GPU Naive + Shared Memory 0.1 + large blocksize 100 1000 Matrix Width Figure: GPU with larger blocksize
CUBLAS library ◮ A C library provided by nVidia implementing the BLAS (Basic Linear Algebra Subprograms) specification ◮ Could not find what it actually does, but seems to do something.
Benchmarks 10000 1000 100 Time (ms) 10 CPU Naive 1 IT++ GPU Naive + Shared Memory 0.1 + large blocksize CUBLAS 100 1000 Matrix Width Figure: CUBLAS library implementation
Benchmarks 10000 1000 100 Time (ms) 10 CPU Naive 1 IT++ GPU Naive + Shared Memory 0.1 + large blocksize CUBLAS 100 1000 Matrix Width Figure: This is interesting
Benchmarks (Zoomed) 20 18 16 14 Time (ms) 12 10 8 6 CUBLAS 4 2 1008 1024 1040 1056 1072 1088 1104 1120 1136 1152 1168 1184 1200 Matrix Width Figure: Zoom on spikes
◮ CUBLAS twice as fast when the width of the matrix is divisible by 16 ◮ Noticed by O. Schenk et al in Algorithmic performance studies on graphics processing units . Stating that When the matrix is not divisible by 16, there are conflicts in shared memory regarding multiple threads accessing the same bank at the same time. This forces one thread to be put in a queue while the other thread is accessing the mem- ory, increasing the amount of time for all memory accesses to be completed. ◮ The question is: Why aren’t the smaller matrices padded to become divisible by 16?
Profit ratio ◮ Tesla C1060 costs about $1200, and calculates a 2000 × 2000 matrix in 50 ms ◮ Core i7 920 costs about $300, and calculates a 2000 × 2000 matrix in 2000 ms ◮ CUBLAS is about 40 times faster than IT++, while a Tesla costs only about 4 times more than a Core i7 ◮ So the profit ratio becomes tenfold. 300$ ∗ 2000 ms 1200$ ∗ 50 ms = 10
Summary ◮ GPGPU is fast :) ◮ But without proper memory management it isn’t as fast as it could be. ◮ Even the libraries aren’t as fast as they could be
Recommend
More recommend