2110412 Parallel Comp Arch CUDA: Parallel Programming on GPU Natawut Nupairoj, Ph.D. Department of Computer Engineering, Chulalongkorn University
Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model CUDA Memory Hierarchy and Memory Spaces CUDA Synchronization
Overview Modern graphics accelerators are called GPUs (Graphics Processing Units) 2 ways GPUs speed up graphics: Pipelining: similar to pipelining in CPUs. CPUs like Pentium 4 has 20 pipeline stages. GPUs typically have 600-800 stages. -- very few branches & most of the functionality is fixed. Source: Leigh, “ Graphics Hardware Architecture & Miscellaneous Real Time Special Effects ”
Rocket Engines Alpha channel of image 100% Opaque 100% Transparent
Typical Parallel Graphics Architecture G R G R . . Application Display . . G R Geometry Rasterizer Unit Unit Geometry Stage RasterizerStage (Transforms geometry - (Turns geometry into pixels – scale, rotate, translate..) fragment gen, z-buffer merging)
Transformation Performs a sequence of math operation on each vertex
Rasterization Rasterization Fragment Processing Enumerates the pixels Gives the individual covered by triangles triangle pixels a color
Imagine this is my screen and the polygons that will occupy my screen
How Polygons Are Processed (Sort-Last Fragment) Equally divide up the polygons • Geometry G G G processing is balanced. Generate fragment for each group of polygons • Rendering is FG FG FG balanced. Sort out where portions of the fragments need to go • Merging involves to merge to form the whole compositing color image FM FM FM and z-buffer. Display
Overview Parallelizing Process the data in parallel within the GPU. In essence multiple pipelines running in parallel. Basic model is SIMD (Single Instruction Multiple Data) – ie same graphics algorithms but lots of polygons to process. Source: Leigh, “ Graphics Hardware Architecture & Miscellaneous Real Time Special Effects ”
SIMD (revisited) One control unit tells processing elements to compute (at the same time). D P M I D P M Ctrl D P M D P M Examples TMC/CM- 1 , Maspar MP- 1 , Modern GPU
Modern GPU is More General Purpose – Lots of ALU’s
GPU Case: nVidia G80 Architecture
The nVidia G80 GPU ► 128 streaming floating point processors @1.5Ghz ► 1.5 Gb Shared RAM with 86Gb/s bandwidth ► 500 Gflop on one chip (single precision)
nVidia G80 GPU Architecture Overview • 16 Multiprocessors Blocks • Each MP Block Has: • 8 Streaming Processors (IEEE 754 spfp compliant) • 16K Shared Memory • 64K Constant Cache • 8K Texture Cache • Each processor can access all of the memory at 86Gb/s, but with different latencies: • Shared – 2 cycle latency • Device – 300 cycle latency
Programming Interface Interface to GPU via nVidia’s proprietary API – CUDA (very C-like) Looks a lot like UPC (simplified CUDA below) void AddVectors(float *r, float *a, float *a) { int tx = threadId.x; //~processor rank r[tx] = a[tx] + b[tx]; //executed in parallel }
Still A Specialized Processor Very Efficient For Fast Parallel Floating Point Processing Single Instruction Multiple Data Operations High Computation per Memory Access Not As Efficient For Double Precision (need to test performance) Logical Operations on Integer Data Branching-Intensive Operations Random Access, Memory-Intensive Operations
Source: Kirk, “ Parallel Computing: What has changed lately? ”
GPU Case: Cell Architecture
History Idea generated by SCEI in 1999 after release of PS2 STI group (Sony, Toshiba, IBM) formed in 2000 In 2001 the first design center opened in the US Fall 2002 US patent released Since then prototypes have been developed and clocked over @4.5 GHz February 2005 final architecture revealed to public In 2005 announced that first commercial product of the Cell will be released in 2006 Source: Lemieux, “ The Cell Processor: from conception to deployment ”
Cell Architecture Overview
Cell Architecture Overview Intended to be configurable Basic Configuration consists of: 1 PowerPC Processing Element (PPE) 8 Synergistic Processing Elements (SPE) Element Interconnect Bus (EIB) Rambus Memory Interface Controller (MIC) Rambus FlexIO interface 512 KB system Level 2 cache
The Cell Processor SPE 1 SPE 3 SPE 5 I/O MIC XIO LS LS LS Memory (256KB) (256KB) (256KB) I/O Flex- Interface Controller IO 1 DMA DMA DMA PPE SPE 0 SPE 2 SPE 4 SPE 6 Flex- I/O IO 0 L1 (32 KB I/D) LS LS LS LS (256KB) (256KB) (256KB) (256KB) L2 (512 KB) DMA DMA DMA DMA Source: Perthuis , “ Introduction to the graphics pipeline of the PS3 ”
Power Processing Element (PPE) Act as the host processor and performs scheduling for the SPE 64-bit processor based on IBM POWER architecture ( P erformance O ptimization W ith E nhanced R ISC) Dual threaded, in-order execution 32 KB Level 1 cache, connected to 512 KB system level 2 cache Contains VMX (AltiVec) unit and IBM hypervisor technology to allow two operating systems to run concurrently (Such as Linux and a real-time OS for gaming)
Synergistic Processing Unit (SPU) SIMD vector processor and acts independently Handles most of the computational workload Again in-order execution but dual issue * Contains 256 KB local store memory Contains 128 X 128 bit registers
Synergistic Processing Unit (SPU) Operate on registers which are read from or written to local stores. SPE cannot act directly on main memory; they have to move data to and from the local stores. DMA device in SPEs handles moving data between the main memory and the local store. Local Store addresses are aliased in the PPE address map and transfers to and from Local Store to memory (including other Local Stores) are coherent in the system
Sony’s PS 3
PS3 Specs Cell processor @ 3.2 Ghz 7 functional SPE Total 218 SP GFLOPS nVidia RSX GPU (1.8 TFLOPS) 256 MB XDR RAM 256MB GDDR3 VRAM Up to 7 Bluetooth controllers Backwards compatible, WiFi capabilities with PSP
Parallel Programming with CUDA
Source: CUDA Tutorial Workshop, ISC-2009
SETI@home and CUDA Run 5x to 10x times faster than CPU-only version
Introduction to CUDA nVidia introduced CUDA in November 2006 Utilize parallel computing engine in GPU to solve complex computational problems CUDA is industry-standard C Subset of C with extensions Write a program for one thread Instantiate it on many parallel threads Familiar programming model and language CUDA is a scalable parallel programming model Program runs on any number of processors without recompiling
CUDA Concept Co-Execution between Host (CPU) and Device (GPU) Parallel portions are executed on the device as kernels One kernel is executed at a time Many threads execute each kernel All threads run the same code Each thread has an ID that it uses to compute memory addresses and make control decisions Serial program with parallel kernels, all in C Serial C code executes in a CPU thread Parallel kernel C code executes in thread blocks across multiple processing elements
CUDA Development: nvcc
Normal C Program void VecAdd_CPU(float* A, float* B, float* C, int N) { for(int i=0 ; i < N ; i++) C[i] = A[i] + B[i]; } void main() { VecAdd_CPU(A, B, C, N); }
CUDA Program // Kernel definition __global__ void VecAdd(float* A, float* B, float* C) { int i = threadIdx.x; C[i] = A[i] + B[i]; } void main() { // Kernel invocation VecAdd<<<1, N>>>(A, B, C); }
Source: High Performance Computing with CUDA, DoD HPCMP: 2009
CUDA Thread Model CUDA Thread can be one-dimensional two-dimensional three-dimensional Thread Hierarchy Grid (2-D) Block (3-D) Thread
Calling CUDA Kernel Modified C function call syntax: kernel<<<dim3 dG, dim3 dB>>>(…) Execution Configuration (“<<< >>>”) dG - dimension and size of grid in blocks Two-dimensional: x and y Blocks launched in the grid: dG.x*dG.y dB - dimension and size of blocks in threads: Three-dimensional: x, y, and z Threads per block: dB.x*dB.y*dB.z Unspecified dim3 fields initialize to 1
Example: Adding 2-D Matrix // Kernel definition __global__ void MatAdd(float A[M][N], float B[M][N], float C[M][N]) { int i = threadIdx.x; int j = threadIdx.y; C[i][j] = A[i][j] + B[i][j]; } void main() { // Kernel invocation dim3 dimBlock(M, N); MatAdd<<<1, dimBlock>>>(A, B, C); }
CUDA Built-In Device Variables All __global__ and __device__ functions have access to these automatically defined variables dim3 gridDim; Dimensions of the grid in blocks (at most 2D) dim3 blockDim; Dimensions of the block in threads dim3 blockIdx; Block index within the grid dim3 threadIdx; Thread index within the block
Recommend
More recommend