2110412 Parallel Comp Arch CUDA: Parallel Programming on GPU - - PowerPoint PPT Presentation

2110412 parallel comp arch cuda parallel programming on
SMART_READER_LITE
LIVE PREVIEW

2110412 Parallel Comp Arch CUDA: Parallel Programming on GPU - - PowerPoint PPT Presentation

2110412 Parallel Comp Arch CUDA: Parallel Programming on GPU Natawut Nupairoj, Ph.D. Department of Computer Engineering, Chulalongkorn University Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model


  • 2110412 Parallel Comp Arch CUDA: Parallel Programming on GPU Natawut Nupairoj, Ph.D. Department of Computer Engineering, Chulalongkorn University

  • Outline  Overview  Parallel Computing with GPU  Introduction to CUDA  CUDA Thread Model  CUDA Memory Hierarchy and Memory Spaces  CUDA Synchronization

  • Overview  Modern graphics accelerators are called GPUs (Graphics Processing Units)  2 ways GPUs speed up graphics:  Pipelining: similar to pipelining in CPUs.  CPUs like Pentium 4 has 20 pipeline stages.  GPUs typically have 600-800 stages. -- very few branches & most of the functionality is fixed. Source: Leigh, “ Graphics Hardware Architecture & Miscellaneous Real Time Special Effects ”

  • Rocket Engines Alpha channel of image 100% Opaque 100% Transparent

  • Typical Parallel Graphics Architecture G R G R . . Application Display . . G R Geometry Rasterizer Unit Unit Geometry Stage RasterizerStage (Transforms geometry - (Turns geometry into pixels – scale, rotate, translate..) fragment gen, z-buffer merging)

  • Transformation  Performs a sequence of math operation on each vertex

  • Rasterization Rasterization Fragment Processing  Enumerates the pixels  Gives the individual covered by triangles triangle pixels a color

  • Imagine this is my screen and the polygons that will occupy my screen

  • How Polygons Are Processed (Sort-Last Fragment) Equally divide up the polygons • Geometry G G G processing is balanced. Generate fragment for each group of polygons • Rendering is FG FG FG balanced. Sort out where portions of the fragments need to go • Merging involves to merge to form the whole compositing color image FM FM FM and z-buffer. Display

  • Overview  Parallelizing  Process the data in parallel within the GPU. In essence multiple pipelines running in parallel.  Basic model is SIMD (Single Instruction Multiple Data) – ie same graphics algorithms but lots of polygons to process. Source: Leigh, “ Graphics Hardware Architecture & Miscellaneous Real Time Special Effects ”

  • SIMD (revisited)  One control unit tells processing elements to compute (at the same time). D P M I D P M Ctrl D P M D P M  Examples  TMC/CM- 1 , Maspar MP- 1 , Modern GPU

  • Modern GPU is More General Purpose – Lots of ALU’s

  • GPU Case: nVidia G80 Architecture

  • The nVidia G80 GPU ► 128 streaming floating point processors @1.5Ghz ► 1.5 Gb Shared RAM with 86Gb/s bandwidth ► 500 Gflop on one chip (single precision)

  • nVidia G80 GPU Architecture Overview • 16 Multiprocessors Blocks • Each MP Block Has: • 8 Streaming Processors (IEEE 754 spfp compliant) • 16K Shared Memory • 64K Constant Cache • 8K Texture Cache • Each processor can access all of the memory at 86Gb/s, but with different latencies: • Shared – 2 cycle latency • Device – 300 cycle latency

  • Programming Interface  Interface to GPU via nVidia’s proprietary API – CUDA (very C-like)  Looks a lot like UPC (simplified CUDA below) void AddVectors(float *r, float *a, float *a) { int tx = threadId.x; //~processor rank r[tx] = a[tx] + b[tx]; //executed in parallel }

  • Still A Specialized Processor  Very Efficient For  Fast Parallel Floating Point Processing  Single Instruction Multiple Data Operations  High Computation per Memory Access  Not As Efficient For  Double Precision (need to test performance)  Logical Operations on Integer Data  Branching-Intensive Operations  Random Access, Memory-Intensive Operations

  • Source: Kirk, “ Parallel Computing: What has changed lately? ”

  • GPU Case: Cell Architecture

  • History  Idea generated by SCEI in 1999 after release of PS2  STI group (Sony, Toshiba, IBM) formed in 2000  In 2001 the first design center opened in the US  Fall 2002 US patent released  Since then prototypes have been developed and clocked over @4.5 GHz  February 2005 final architecture revealed to public  In 2005 announced that first commercial product of the Cell will be released in 2006 Source: Lemieux, “ The Cell Processor: from conception to deployment ”

  • Cell Architecture Overview

  • Cell Architecture Overview  Intended to be configurable  Basic Configuration consists of:  1 PowerPC Processing Element (PPE)  8 Synergistic Processing Elements (SPE)  Element Interconnect Bus (EIB)  Rambus Memory Interface Controller (MIC)  Rambus FlexIO interface  512 KB system Level 2 cache

  • The Cell Processor SPE 1 SPE 3 SPE 5 I/O MIC XIO LS LS LS Memory (256KB) (256KB) (256KB) I/O Flex- Interface Controller IO 1 DMA DMA DMA PPE SPE 0 SPE 2 SPE 4 SPE 6 Flex- I/O IO 0 L1 (32 KB I/D) LS LS LS LS (256KB) (256KB) (256KB) (256KB) L2 (512 KB) DMA DMA DMA DMA Source: Perthuis , “ Introduction to the graphics pipeline of the PS3 ”

  • Power Processing Element (PPE)  Act as the host processor and performs scheduling for the SPE  64-bit processor based on IBM POWER architecture ( P erformance O ptimization W ith E nhanced R ISC)  Dual threaded, in-order execution  32 KB Level 1 cache, connected to 512 KB system level 2 cache  Contains VMX (AltiVec) unit and IBM hypervisor technology to allow two operating systems to run concurrently (Such as Linux and a real-time OS for gaming)

  • Synergistic Processing Unit (SPU)  SIMD vector processor and acts independently  Handles most of the computational workload  Again in-order execution but dual issue *  Contains 256 KB local store memory  Contains 128 X 128 bit registers

  • Synergistic Processing Unit (SPU)  Operate on registers which are read from or written to local stores.  SPE cannot act directly on main memory; they have to move data to and from the local stores.  DMA device in SPEs handles moving data between the main memory and the local store.  Local Store addresses are aliased in the PPE address map and transfers to and from Local Store to memory (including other Local Stores) are coherent in the system

  • Sony’s PS 3

  • PS3 Specs  Cell processor @ 3.2 Ghz  7 functional SPE  Total 218 SP GFLOPS  nVidia RSX GPU (1.8 TFLOPS)  256 MB XDR RAM  256MB GDDR3 VRAM  Up to 7 Bluetooth controllers  Backwards compatible, WiFi capabilities with PSP

  • Parallel Programming with CUDA

  • Source: CUDA Tutorial Workshop, ISC-2009

  • SETI@home and CUDA  Run 5x to 10x times faster than CPU-only version

  • Introduction to CUDA  nVidia introduced CUDA in November 2006  Utilize parallel computing engine in GPU to solve complex computational problems  CUDA is industry-standard C  Subset of C with extensions  Write a program for one thread  Instantiate it on many parallel threads  Familiar programming model and language  CUDA is a scalable parallel programming model  Program runs on any number of processors without recompiling

  • CUDA Concept  Co-Execution between Host (CPU) and Device (GPU)  Parallel portions are executed on the device as kernels  One kernel is executed at a time  Many threads execute each kernel  All threads run the same code  Each thread has an ID that it uses to compute memory addresses and make control decisions  Serial program with parallel kernels, all in C  Serial C code executes in a CPU thread  Parallel kernel C code executes in thread blocks across multiple processing elements

  • CUDA Development: nvcc

  • Normal C Program void VecAdd_CPU(float* A, float* B, float* C, int N) { for(int i=0 ; i < N ; i++) C[i] = A[i] + B[i]; } void main() { VecAdd_CPU(A, B, C, N); }

  • CUDA Program // Kernel definition __global__ void VecAdd(float* A, float* B, float* C) { int i = threadIdx.x; C[i] = A[i] + B[i]; } void main() { // Kernel invocation VecAdd<<<1, N>>>(A, B, C); }

  • Source: High Performance Computing with CUDA, DoD HPCMP: 2009

  • CUDA Thread Model  CUDA Thread can be  one-dimensional  two-dimensional  three-dimensional  Thread Hierarchy  Grid  (2-D) Block  (3-D) Thread

  • Calling CUDA Kernel  Modified C function call syntax: kernel<<<dim3 dG, dim3 dB>>>(…)  Execution Configuration (“<<< >>>”)  dG - dimension and size of grid in blocks  Two-dimensional: x and y  Blocks launched in the grid: dG.x*dG.y  dB - dimension and size of blocks in threads:  Three-dimensional: x, y, and z  Threads per block: dB.x*dB.y*dB.z  Unspecified dim3 fields initialize to 1

  • Example: Adding 2-D Matrix // Kernel definition __global__ void MatAdd(float A[M][N], float B[M][N], float C[M][N]) { int i = threadIdx.x; int j = threadIdx.y; C[i][j] = A[i][j] + B[i][j]; } void main() { // Kernel invocation dim3 dimBlock(M, N); MatAdd<<<1, dimBlock>>>(A, B, C); }

  • CUDA Built-In Device Variables  All __global__ and __device__ functions have access to these automatically defined variables  dim3 gridDim;  Dimensions of the grid in blocks (at most 2D)  dim3 blockDim;  Dimensions of the block in threads  dim3 blockIdx;  Block index within the grid  dim3 threadIdx;  Thread index within the block