2110412 parallel comp arch cuda parallel programming on
play

2110412 Parallel Comp Arch CUDA: Parallel Programming on GPU - PowerPoint PPT Presentation

2110412 Parallel Comp Arch CUDA: Parallel Programming on GPU Natawut Nupairoj, Ph.D. Department of Computer Engineering, Chulalongkorn University Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model


  1. 2110412 Parallel Comp Arch CUDA: Parallel Programming on GPU Natawut Nupairoj, Ph.D. Department of Computer Engineering, Chulalongkorn University

  2. Outline  Overview  Parallel Computing with GPU  Introduction to CUDA  CUDA Thread Model  CUDA Memory Hierarchy and Memory Spaces  CUDA Synchronization

  3. Overview  Modern graphics accelerators are called GPUs (Graphics Processing Units)  2 ways GPUs speed up graphics:  Pipelining: similar to pipelining in CPUs.  CPUs like Pentium 4 has 20 pipeline stages.  GPUs typically have 600-800 stages. -- very few branches & most of the functionality is fixed. Source: Leigh, “ Graphics Hardware Architecture & Miscellaneous Real Time Special Effects ”

  4. Rocket Engines Alpha channel of image 100% Opaque 100% Transparent

  5. Typical Parallel Graphics Architecture G R G R . . Application Display . . G R Geometry Rasterizer Unit Unit Geometry Stage RasterizerStage (Transforms geometry - (Turns geometry into pixels – scale, rotate, translate..) fragment gen, z-buffer merging)

  6. Transformation  Performs a sequence of math operation on each vertex

  7. Rasterization Rasterization Fragment Processing  Enumerates the pixels  Gives the individual covered by triangles triangle pixels a color

  8. Imagine this is my screen and the polygons that will occupy my screen

  9. How Polygons Are Processed (Sort-Last Fragment) Equally divide up the polygons • Geometry G G G processing is balanced. Generate fragment for each group of polygons • Rendering is FG FG FG balanced. Sort out where portions of the fragments need to go • Merging involves to merge to form the whole compositing color image FM FM FM and z-buffer. Display

  10. Overview  Parallelizing  Process the data in parallel within the GPU. In essence multiple pipelines running in parallel.  Basic model is SIMD (Single Instruction Multiple Data) – ie same graphics algorithms but lots of polygons to process. Source: Leigh, “ Graphics Hardware Architecture & Miscellaneous Real Time Special Effects ”

  11. SIMD (revisited)  One control unit tells processing elements to compute (at the same time). D P M I D P M Ctrl D P M D P M  Examples  TMC/CM- 1 , Maspar MP- 1 , Modern GPU

  12. Modern GPU is More General Purpose – Lots of ALU’s

  13. GPU Case: nVidia G80 Architecture

  14. The nVidia G80 GPU ► 128 streaming floating point processors @1.5Ghz ► 1.5 Gb Shared RAM with 86Gb/s bandwidth ► 500 Gflop on one chip (single precision)

  15. nVidia G80 GPU Architecture Overview • 16 Multiprocessors Blocks • Each MP Block Has: • 8 Streaming Processors (IEEE 754 spfp compliant) • 16K Shared Memory • 64K Constant Cache • 8K Texture Cache • Each processor can access all of the memory at 86Gb/s, but with different latencies: • Shared – 2 cycle latency • Device – 300 cycle latency

  16. Programming Interface  Interface to GPU via nVidia’s proprietary API – CUDA (very C-like)  Looks a lot like UPC (simplified CUDA below) void AddVectors(float *r, float *a, float *a) { int tx = threadId.x; //~processor rank r[tx] = a[tx] + b[tx]; //executed in parallel }

  17. Still A Specialized Processor  Very Efficient For  Fast Parallel Floating Point Processing  Single Instruction Multiple Data Operations  High Computation per Memory Access  Not As Efficient For  Double Precision (need to test performance)  Logical Operations on Integer Data  Branching-Intensive Operations  Random Access, Memory-Intensive Operations

  18. Source: Kirk, “ Parallel Computing: What has changed lately? ”

  19. GPU Case: Cell Architecture

  20. History  Idea generated by SCEI in 1999 after release of PS2  STI group (Sony, Toshiba, IBM) formed in 2000  In 2001 the first design center opened in the US  Fall 2002 US patent released  Since then prototypes have been developed and clocked over @4.5 GHz  February 2005 final architecture revealed to public  In 2005 announced that first commercial product of the Cell will be released in 2006 Source: Lemieux, “ The Cell Processor: from conception to deployment ”

  21. Cell Architecture Overview

  22. Cell Architecture Overview  Intended to be configurable  Basic Configuration consists of:  1 PowerPC Processing Element (PPE)  8 Synergistic Processing Elements (SPE)  Element Interconnect Bus (EIB)  Rambus Memory Interface Controller (MIC)  Rambus FlexIO interface  512 KB system Level 2 cache

  23. The Cell Processor SPE 1 SPE 3 SPE 5 I/O MIC XIO LS LS LS Memory (256KB) (256KB) (256KB) I/O Flex- Interface Controller IO 1 DMA DMA DMA PPE SPE 0 SPE 2 SPE 4 SPE 6 Flex- I/O IO 0 L1 (32 KB I/D) LS LS LS LS (256KB) (256KB) (256KB) (256KB) L2 (512 KB) DMA DMA DMA DMA Source: Perthuis , “ Introduction to the graphics pipeline of the PS3 ”

  24. Power Processing Element (PPE)  Act as the host processor and performs scheduling for the SPE  64-bit processor based on IBM POWER architecture ( P erformance O ptimization W ith E nhanced R ISC)  Dual threaded, in-order execution  32 KB Level 1 cache, connected to 512 KB system level 2 cache  Contains VMX (AltiVec) unit and IBM hypervisor technology to allow two operating systems to run concurrently (Such as Linux and a real-time OS for gaming)

  25. Synergistic Processing Unit (SPU)  SIMD vector processor and acts independently  Handles most of the computational workload  Again in-order execution but dual issue *  Contains 256 KB local store memory  Contains 128 X 128 bit registers

  26. Synergistic Processing Unit (SPU)  Operate on registers which are read from or written to local stores.  SPE cannot act directly on main memory; they have to move data to and from the local stores.  DMA device in SPEs handles moving data between the main memory and the local store.  Local Store addresses are aliased in the PPE address map and transfers to and from Local Store to memory (including other Local Stores) are coherent in the system

  27. Sony’s PS 3

  28. PS3 Specs  Cell processor @ 3.2 Ghz  7 functional SPE  Total 218 SP GFLOPS  nVidia RSX GPU (1.8 TFLOPS)  256 MB XDR RAM  256MB GDDR3 VRAM  Up to 7 Bluetooth controllers  Backwards compatible, WiFi capabilities with PSP

  29. Parallel Programming with CUDA

  30. Source: CUDA Tutorial Workshop, ISC-2009

  31. SETI@home and CUDA  Run 5x to 10x times faster than CPU-only version

  32. Introduction to CUDA  nVidia introduced CUDA in November 2006  Utilize parallel computing engine in GPU to solve complex computational problems  CUDA is industry-standard C  Subset of C with extensions  Write a program for one thread  Instantiate it on many parallel threads  Familiar programming model and language  CUDA is a scalable parallel programming model  Program runs on any number of processors without recompiling

  33. CUDA Concept  Co-Execution between Host (CPU) and Device (GPU)  Parallel portions are executed on the device as kernels  One kernel is executed at a time  Many threads execute each kernel  All threads run the same code  Each thread has an ID that it uses to compute memory addresses and make control decisions  Serial program with parallel kernels, all in C  Serial C code executes in a CPU thread  Parallel kernel C code executes in thread blocks across multiple processing elements

  34. CUDA Development: nvcc

  35. Normal C Program void VecAdd_CPU(float* A, float* B, float* C, int N) { for(int i=0 ; i < N ; i++) C[i] = A[i] + B[i]; } void main() { VecAdd_CPU(A, B, C, N); }

  36. CUDA Program // Kernel definition __global__ void VecAdd(float* A, float* B, float* C) { int i = threadIdx.x; C[i] = A[i] + B[i]; } void main() { // Kernel invocation VecAdd<<<1, N>>>(A, B, C); }

  37. Source: High Performance Computing with CUDA, DoD HPCMP: 2009

  38. CUDA Thread Model  CUDA Thread can be  one-dimensional  two-dimensional  three-dimensional  Thread Hierarchy  Grid  (2-D) Block  (3-D) Thread

  39. Calling CUDA Kernel  Modified C function call syntax: kernel<<<dim3 dG, dim3 dB>>>(…)  Execution Configuration (“<<< >>>”)  dG - dimension and size of grid in blocks  Two-dimensional: x and y  Blocks launched in the grid: dG.x*dG.y  dB - dimension and size of blocks in threads:  Three-dimensional: x, y, and z  Threads per block: dB.x*dB.y*dB.z  Unspecified dim3 fields initialize to 1

  40. Example: Adding 2-D Matrix // Kernel definition __global__ void MatAdd(float A[M][N], float B[M][N], float C[M][N]) { int i = threadIdx.x; int j = threadIdx.y; C[i][j] = A[i][j] + B[i][j]; } void main() { // Kernel invocation dim3 dimBlock(M, N); MatAdd<<<1, dimBlock>>>(A, B, C); }

  41. CUDA Built-In Device Variables  All __global__ and __device__ functions have access to these automatically defined variables  dim3 gridDim;  Dimensions of the grid in blocks (at most 2D)  dim3 blockDim;  Dimensions of the block in threads  dim3 blockIdx;  Block index within the grid  dim3 threadIdx;  Thread index within the block

Recommend


More recommend