GPU programming Dr. Bernhard Kainz 1 Overview About myself Last - PowerPoint PPT Presentation

GPU programming Dr. Bernhard Kainz 1

Overview • About myself Last week • Motivation • GPU hardware and system architecture • GPU programming languages • GPU programming paradigms This week • Example program • Memory model • Tiling • Reduction • State-of-the-art applications 2 Dr Bernhard Kainz

Distinguishing between threads • blockId and threadId 0, 0, 1, 1, 2, 2, 3, 3, 0,0 4, 0, 5, 1, 6, 2, 3, 7, 0, 0, 1, 1, 2, 2, 3, 3, 1,0 4, 0, 5, 1, 6, 2, 7, 3, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0,0 1,0 2,0 3,0 4,0 5,0 6,0 7,0 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 0,1 0, 1, 2, 3, 0, 1, 2, 3, 1,1 0, 1, 2, 3, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0,0 1,0 2,0 3,0 0, 0, 1, 1, 2, 2, 3, 3, 0, 4, 5, 1, 2, 6, 3, 7, 0, 0, 1, 1, 2, 2, 3, 3, 4, 5, 6, 7, 0, 1, 2, 3, 0, 1, 0, 1, 0,2 0, 1, 0, 1, 0, 1, 0, 1, 1,2 0, 1, 0, 1, 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0,1 4,1 1,1 2,1 3,1 5,1 6,1 7,1 0, 0, 1, 1, 2, 2, 3, 3, 4, 0, 5, 1, 6, 2, 3, 7, 0, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 5, 6, 2, 3, 7, 0, 1, 0, 1, 0,3 0, 1, 0, 1, 0, 1, 0, 1, 1,3 0, 1, 0, 1, 0 3 0 3 0 3 3 0 3 0 0 3 3 0 3 0 3 0 3 0 0 3 3 0 3 0 3 0 3 0 3 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0, 0, 1, 1, 2, 2, 3, 3, 4, 0, 5, 1, 2, 6, 7, 3, 0, 0, 1, 1, 2, 2, 3, 3, 4, 5, 6, 7, 0, 1, 2, 3, 0, 1, 0, 1, 0,4 0, 1, 0, 1, 0, 1, 0, 1, 1,4 0, 1, 0, 1, 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6,2 7,2 0,2 1,2 2,2 3,2 4,2 5,2 0, 0, 1, 1, 2, 2, 3, 3, 0, 4, 1, 5, 6, 2, 7, 3, 0, 0, 1, 1, 2, 2, 3, 3, 4, 0, 5, 1, 6, 2, 3, 7, 0, 1, 0, 1, 0, 1, 0,5 0, 1, 0, 1, 0, 1, 0, 1, 1,5 0, 1, 0 0 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0,1 1,1 2,1 3,1 0, 1, 2, 3, 0, 0, 1, 1, 2, 2, 3, 3, 4, 5, 6, 7, 0, 0, 1, 1, 2, 2, 3, 3, 0, 4, 1, 5, 2, 6, 3, 7, 0, 1, 0, 1, 0,6 0, 1, 0, 1, 0, 1, 0, 1, 1,6 0, 1, 0, 1, 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0,3 1,3 2,3 3,3 4,3 5,3 6,3 7,3 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,7 1,7 0 3 3 0 3 0 0 3 3 0 3 0 0 3 3 0 3 0 3 0 0 3 3 0 3 0 3 0 0 3 3 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 0, 4, 5, 1, 2, 6, 3, 7, 0, 1, 0, 1, 0,8 0, 1, 0, 1, 0, 1, 0, 1, 1,8 0, 1, 0, 1, 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0,4 1,4 2,4 3,4 4,4 5,4 6,4 7,4 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,9 1,9 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0,2 1,2 2,2 3,2 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 0, 1, 1, 2, 0, 3, 1, 4, 0, 5, 1, 0, 6, 1, 7, 0, 0, 1, 1, 0, 2, 3, 1, 4, 0, 5, 1, 0, 6, 1, 7, 0,10 1,10 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0,5 1,5 2,5 3,5 4,5 5,5 6,5 7,5 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 0, 1, 1, 2, 0, 1, 3, 4, 0, 5, 1, 0, 6, 7, 1, 0, 0, 1, 1, 2, 0, 3, 1, 0, 4, 5, 1, 0, 6, 1, 7, 0,11 1,11 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 0 1 0 0 1 1 0 1 0 1 0 1 0 1 0 1 0 0 1 1 0 1 0 0 1 1 0 1 0 1 0 3 Dr Bernhard Kainz

2D Kernel example • using threadIdx and blockIdx execution paths are chosen • with blockDim and gridDim number of threads can be determined __global__ void myfunction(float *input, float* output) { uint bid = blockIdx.x + blockIdx.y * gridDim.x; uint tid = bId * (blockDim.x * blockDim.y) + (threadIdx.y * blockDim.x) + threadIdx.x; output[tid] = input[tid]; } dim3 blockSize(32,32,1); dim3 gridSize((iSpaceX + blockSize.x - 1)/blockSize.x, (iSpaceY + blockSize.y - 1)/blockSize.y), 1) myfunction<<<gridSize, blockSize>>>(input, output); 4 Dr Bernhard Kainz

Matrix Multiplication Example   C A B 𝑛 𝐷 𝑗𝑘 = 𝐵 𝑗𝑙 𝐶 𝑙𝑘 𝑙=1 5 Dr Bernhard Kainz

Matrix Multiplication Example Loop-based parallelism 6 Dr Bernhard Kainz

Matrix Multiplication Example 7 Dr Bernhard Kainz

Matrix Multiplication Example float* A = new float[A_rows*A_cols]; float* B = new float[B_rows*B_cols]; float* C = new float[B_cols*A_rows]; //some matrix initialization float* d_A, d_B, d_C; cudaMalloc((void**)&d_A, A_rows*A_cols*sizeof(float)); cudaMalloc((void**)&d_B, B_rows*B_cols*sizeof(float)); cudaMalloc((void**)&d_C, B_cols*A_rows*sizeof(float)); cudaMemcpy(d_A, A, cudaMemcpyHostToDevice); cudaMemcpy(d_B, B, cudaMemcpyHostToDevice); cudaMemcpy(C, d_C, cudaMemcpyDeviceToHost); //free stuff 8 Dr Bernhard Kainz

Matrix Multiplication Example 9 Dr Bernhard Kainz

Memory Model CPU Host Memory 10 Dr Bernhard Kainz

Matrix Multiplication Example • A lot of memory access with little computations only • Memory access is all going to slow global device memory • In a block the same memory is needed by multiple threads  use shared memory to load one tile of data, consume the data together, advance to next block 11 Dr Bernhard Kainz

Matrix Multiplication Example Data loaded BLOCKS_SIZE times! Load tiles, work on tiles, load next tiles … 13 Dr Bernhard Kainz

Matrix Multiplication Example Blocksize: TILE_WIDTH x TILE_WIDTH 14 Dr Bernhard Kainz

Matrix multiplication problems 15 Dr Bernhard Kainz

Matrix multiplication problems Read from another thread before loaded!! 16 Dr Bernhard Kainz

Memory statistics: non-tiled 20 Dr Bernhard Kainz

Memory statistics: tiled 21 Dr Bernhard Kainz

Parallel Reduction Illustrations by Mark Harris, Nvidia 22

Parallel Reduction - Common and important data parallel primitive - Easy to implement in CUDA - Harder to get it right - Serves as a great optimization example - Several different versions possible - Demonstrates several important optimization strategies 23 Dr Bernhard Kainz

Parallel Reduction - Tree-based approach used within each thread block - Need to be able to use multiple thread blocks - To process very large arrays - To keep all multiprocessors on the GPU busy - Each thread block reduces a portion of the array - Communicate partial results between thread blocks? 24 Dr Bernhard Kainz

Parallel Reduction - If we could synchronize across all thread blocks, could easily reduce very large arrays, right? - Global sync after each block produces its result - Once all blocks reach sync, continue recursively - But CUDA has no global synchronization. Why? - Expensive to build in hardware for GPUs with high processor count - Would force programmer to run fewer blocks (no more than # multiprocessors * # resident blocks / multiprocessor) to avoid deadlock, which may reduce overall efficiency - Solution: decompose into multiple kernels - Kernel launch serves as a global synchronization point - Kernel launch has negligible HW overhead, low SW overhead 25 Dr Bernhard Kainz

Parallel Reduction - Avoid global sync by decomposing computation into multiple kernel invocations - In the case of reductions, code for all levels is the same - Recursive kernel invocation 26 Dr Bernhard Kainz

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last - PowerPoint PPT Presentation

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms This week Example program Memory model

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

OpenMP and GPU Programming GPU Intro Emanuele Ruffaldi

Cs Memory Model C0 C 1 Balance Sheet so far Lost Gained

Majority element using O (1) memory Anil Maheshwari School of Computer Science Carleton

Lecture 7: Sequential Networks CK Cheng Dept. of Computer Science and Engineering University of

Latches, Flip-flops, Registers, Memory Sequential logic: elements to store values Output depends

Dissecting Memory Problems A Semantic Approach Alfredo Gimenez Motivation Historical trends

Computer Science & Engineering 150A Introduction Problem Solving Using Computers Declaring,

Vector IRAM: ISA and Micro-architecture Christoforos E. Kozyrakis Computer Science Division

Permuting Upper and Lower bounds [Aggarwal, Vitter, 88] Page 1 Upper Bound Assume instance is

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last - PowerPoint PPT Presentation

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms This week Example program Memory model

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

OpenMP and GPU Programming GPU Intro Emanuele Ruffaldi

Cs Memory Model C0 C 1 Balance Sheet so far Lost Gained

Majority element using O (1) memory Anil Maheshwari School of Computer Science Carleton

Lecture 7: Sequential Networks CK Cheng Dept. of Computer Science and Engineering University of

Latches, Flip-flops, Registers, Memory Sequential logic: elements to store values Output depends

Dissecting Memory Problems A Semantic Approach Alfredo Gimenez Motivation Historical trends

Computer Science &amp; Engineering 150A Introduction Problem Solving Using Computers Declaring,

Vector IRAM: ISA and Micro-architecture Christoforos E. Kozyrakis Computer Science Division

Permuting Upper and Lower bounds [Aggarwal, Vitter, 88] Page 1 Upper Bound Assume instance is

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Computer Science & Engineering 150A Introduction Problem Solving Using Computers Declaring,