Introduc)on to GPU Programming Mubashir Adnan Qureshi - PowerPoint PPT Presentation

Introduc)on to GPU Programming Mubashir Adnan Qureshi h3p://www.ncsa.illinois.edu/People/kindr/projects/hpca/files/singapore_p1.pdf h3p://developer.download.nvidia.com/CUDA/training/NVIDIA_GPU_Compu)ng_Webinars_CUDA_Memory_Op)miza)on.pdf

Tutorial Goals • NVIDIA GPU architecture • NVIDIA GPU application development flow • Write and run simple NVIDIA GPU kernels in CUDA • Be aware of performance limiting factors and understand performance tuning strategies 2

Introduction • Why use Graphics Processing Units (GPUs) for general-purpose computing • Modern GPU architecture – NVIDIA • GPU programming overview – CUDA C – OpenCL 3

GPU vs. CPU Silicon Use 4 Graph is courtesy of NVIDIA

NVIDIA GPU Architecture • N mul)processors called SMs • Each has M cores called SPs • SIMD • Same instruc)on executed on SPs • Device memory shared across all SMs 5 Figure is courtesy of NVIDIA

NVIDIA GeForce9400M G GPU • 16 streaming processors TPC arranged as 2 streaming Geometry controller SMC multiprocessors SM SM I cache I cache MT issue MT issue C cache C cache • At 0.8 GHz this provides SP SP SP SP SP SP SP SP – 54 GFLOPS in single- SP SP SP SP SP SP SP SP precision (SP) SFU SFU SFU SFU Shared Shared memory memory • 128-bit interface to off- Texture units chip GDDR3 memory Texture L1 128-bit interconnect – 21 GB/s bandwidth L2 ROP ROP L2 DRAM DRAM 6

NVIDIA Tesla C1060 GPU • 240 streaming TPC 1 TPC 10 processors arranged Geometry controller Geometry controller as 30 streaming SMC SMC SM SM SM SM SM SM mul)processors I cache I cache I cache I cache I cache I cache MT issue MT issue MT issue MT issue MT issue MT issue C cache C cache C cache C cache C cache C cache • At 1.3 GHz this SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP provides SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP – 1 TFLOPS SP SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU Shared Shared Shared Shared Shared Shared memory memory memory memory memory memory – 86.4 GFLOPS DP Texture units Texture units Texture L1 Texture L1 • 512-bit interface to off-chip GDDR3 512-bit memory interconnect memory L2 ROP ROP L2 DRAM DRAM DRAM DRAM – 102 GB/s bandwidth DRAM DRAM DRAM DRAM 7

NVIDIA Tesla S1070 Computing Server • 4 T10 GPUs 4 GB GDDR3 SDRAM 4 GB GDDR3 SDRAM Power supply Tesla GPU Tesla GPU NVIDIA management PCI x16 SWITCH Thermal PCI x16 NVIDIA SWITCH monitoring System Tesla GPU Tesla GPU 4 GB GDDR3 SDRAM 4 GB GDDR3 SDRAM 12 Graph is courtesy of NVIDIA

GPU Use/Programming • GPU libraries – NVIDIA’s CUDA BLAS and FFT libraries – Many 3 rd party libraries • Low abstraction lightweight GPU programming toolkits – CUDA C – OpenCL 9

nvcc • Any source file containing CUDA C language extensions must be compiled with nvcc • nvcc is a compiler driver that invokes many other tools to accomplish the job • Basic nvcc usage – nvcc <filename>.cu [-o <executable>] • Builds release mode – nvcc -deviceemu <filename>.cu • Builds device emula)on mode (all code runs on CPU) – nvprof <executable> • Profiles the code 10

Anatomy of a GPU Applica)on • Host side • Device side 30

Reference CPU Version void vecAdd(int N, float* A, float* B, float* C) Computational kernel { for (int i = 0; i < N; i++) C[i] = A[i] + B[i]; } int main(int argc, char **argv) { int N = 16384; // default vector size float *A = (float*)malloc(N * sizeof(float)); Memory allocation float *B = (float*)malloc(N * sizeof(float)); float *C = (float*)malloc(N * sizeof(float)); Kernel invocation vecAdd(N, A, B, C); // call compute kernel Memory de-allocation free(A); free(B); free(C); } 12

Adding GPU support Host GPU card CPU GPU Host Device Memory Memory A gA B gB C gC 13

Memory Spaces • CPU and GPU have separate memory spaces – Data is moved across PCIe bus – Use func[ons to allocate/set/copy memory on GPU • Host (CPU) manages device (GPU) memory – cudaMalloc(void** pointer, size_t nbytes) – cudaFree(void* pointer) – cudaMemcpy(void* dst, void* src, size_t nbytes, enum cudaMemcpyKind direc[on); • returns after the copy is complete • blocks CPU thread un[l all bytes have been copied • does not start copying un[l previous CUDA calls complete – enum cudaMemcpyKind • cudaMemcpyHostToDevice • cudaMemcpyDeviceToHost • cudaMemcpyDeviceToDevice 14

Adding GPU support int main(int argc, char **argv) { int N = 16384; // default vector size Memory allocation float *A = (float*)malloc(N * sizeof(float)); on the GPU card float *B = (float*)malloc(N * sizeof(float)); float *C = (float*)malloc(N * sizeof(float)); float *devPtrA, *devPtrB, *devPtrC; Copy data from the CPU (host) memory cudaMalloc((void**)&devPtrA, N * sizeof(float)); to the GPU (device) cudaMalloc((void**)&devPtrB, N * sizeof(float)); memory cudaMalloc((void**)&devPtrC, N * sizeof(float)); cudaMemcpy(devPtrA, A, N * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(devPtrB, B, N * sizeof(float), cudaMemcpyHostToDevice); 15

Adding GPU support vecAdd<<<N/512, 512>>>(devPtrA, devPtrB, devPtrC); Kernel invocation cudaMemcpy(C, devPtrC, N * sizeof(float), cudaMemcpyDeviceToHost); cudaFree(devPtrA); Copy results from cudaFree(devPtrB); device memory to cudaFree(devPtrC); the host memory Device memory free(A); de-allocation free(B); free(C); } 16

GPU Kernel • CPU version void vecAdd(int N, float* A, float* B, float* C) { for (int i = 0; i < N; i++) C[i] = A[i] + B[i]; } • GPU version global__ void vecAdd(float* A, float* B, float* C) { int i = blockIdx.x * blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; } 17

CUDA Programming Model • A CUDA kernel is executed by threadID an array of threads – All threads run the same code (SIMD) … float x = input[threadID]; float y = func(x); – Each thread has an ID that it uses output[threadID] = y; … to compute memory addresses and make control decisions • Threads are arranged as a grid of thread blocks – Threads within Grid a block have access Thread Block 0 Thread Block 1 Thread Block N-1 … to a segment of shared memory Shared memory Shared memory Shared memory 18

Kernel Invoca)on Syntax grid & thread block dimensionality vecAdd<<<32, 512>>>(devPtrA, devPtrB, devPtrC); Grid Thread Block 0 Thread Block 1 Thread Block N-1 … Shared memory Shared memory Shared memory int i = blockIdx.x * blockDim.x + threadIdx.x; block ID within a grid number of threads per block thread ID within a thread block 19

Mapping Threads to the Hardware Blocks of threads are transparently Blocks must be independent • • assigned to SMs – Any possible interleaving of blocks should be valid – A block of threads executes on one SM & does not migrate – Blocks may coordinate but not synchronize – Several blocks can reside concurrently on one SM – Thread blocks can run in any order Kernel grid Device Device Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 0 Block 1 Block 0 Block 1 Block 2 Block 3 Block 6 Block 7 time Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 4 Block 5 Each block can execute in any Block 6 Block 7 order relative to other blocks. 20 Slide is courtesy of NVIDIA

GPU Memory Hierarchy • Global (device) memory – Accessible by all threads as well as host (CPU) – Data life)me is from alloca)on to dealloca)on Device 0 memory cudaMemcpy() Host memory Device 1 memory 21

GPU Memory Hierarchy • Global (device) memory Kernel 0 Thread Block 0 Thread Block 1 Thread Block N-1 … Per-device Global Memory Kernel 1 Thread Block 0 Thread Block 1 Thread Block N-1 … 22

GPU Memory Hierarchy • Local storage • Shared memory – Each thread has own local – Each thread block has own storage shared memory • Accessible only by threads – Mostly registers (managed by within that block the compiler) – Data life)me = block life)me – Data life)me = thread life)me Thread Block Per-block Per-thread shared local memory memory 23

GPU Memory Hierarchy Host Device GPU CPU DRAM Mul)processor Mul)processor local Mul)processor chipset global registers shared memory constant DRAM constant and texture caches texture Memory Loca[on Cached Access Scope Life[me Register On-chip N/A R/W One thread Thread Local Off-chip No R/W One thread Thread Shared On-chip N/A R/W All threads in a block Block Global Off-chip No R/W All threads + host Applica[on Constant Off-chip Yes R All threads + host Applica[on Texture Off-chip Yes R All threads + host Applica[on 24

GPU Kernel • CPU version void vecAdd(int N, float* A, float* B, float* C) { for (int i = 0; i < N; i++) C[i] = A[i] + B[i]; } • GPU version global__ void vecAdd(float* A, float* B, float* C) { int i = blockIdx.x * blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; } 25

Op)mizing Algorithms for GPUs • Maximize independent parallelism • Maximize arithme)c intensity (math/bandwidth) • Some)mes it’s be3er to recompute than to cache GPU GPU spends its transistors on ALUs, not memory • • Do more computa)on on the GPU to avoid costly data transfers Even low parallelism computa)ons can some)mes be • faster than transferring back and forth to host

Introduc)on to GPU Programming Mubashir Adnan Qureshi - PowerPoint PPT Presentation

Introduc)on to GPU Programming Mubashir Adnan Qureshi h3p://www.ncsa.illinois.edu/People/kindr/projects/hpca/files/singapore_p1.pdf h3p://developer.download.nvidia.com/CUDA/training/NVIDIA_GPU_Compu)ng_Webinars_CUDA_Memory_Op)miza)on.pdf Tutorial

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

Road Map 1. Introduction Introduction 1. 2. The Physical Problem The Physical Problem 2. 3.

CCIN2P3 connectivity NCSA / LSST meeting jerome.bernier@in2p3.fr May 2015 CCIN2P3 connectivity

1 About the Better Identity Coalition Focus: developing and advancing consensus-driven,

Managing a software project the dos and donts SI2 PI Meeting January 18, 2012 Von Welch

Damaris: Using Dedicated I/O Cores for Scalable Post-petascale HPC Simulations Matthieu Dorier

Parallel Homotopy Algorithms to Solve Polynomial Systems Jan Verschelde Department of Math, Stat

AIRS Browse Products Stephanie Granger AIRS Science Integration Team May 2, 2002 Jet Propulsion

WELCOME Data Analytics Industry Day VADM Brian Brown Commander, Naval Information Forces Our

Introduc)on to GPU Programming Mubashir Adnan Qureshi - PowerPoint PPT Presentation

Introduc)on to GPU Programming Mubashir Adnan Qureshi h3p://www.ncsa.illinois.edu/People/kindr/projects/hpca/files/singapore_p1.pdf h3p://developer.download.nvidia.com/CUDA/training/NVIDIA_GPU_Compu)ng_Webinars_CUDA_Memory_Op)miza)on.pdf Tutorial

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

Road Map 1. Introduction Introduction 1. 2. The Physical Problem The Physical Problem 2. 3.

CCIN2P3 connectivity NCSA / LSST meeting jerome.bernier@in2p3.fr May 2015 CCIN2P3 connectivity

1 About the Better Identity Coalition Focus: developing and advancing consensus-driven,

Managing a software project the dos and donts SI2 PI Meeting January 18, 2012 Von Welch

Damaris: Using Dedicated I/O Cores for Scalable Post-petascale HPC Simulations Matthieu Dorier

Parallel Homotopy Algorithms to Solve Polynomial Systems Jan Verschelde Department of Math, Stat

AIRS Browse Products Stephanie Granger AIRS Science Integration Team May 2, 2002 Jet Propulsion

WELCOME Data Analytics Industry Day VADM Brian Brown Commander, Naval Information Forces Our

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,