gpgpu general purpose computation on gpus
play

GPGPU: General-Purpose Computation on GPUs Prekshu Ajmera 03d05006 - PowerPoint PPT Presentation

GPGPU: General-Purpose Computation on GPUs Prekshu Ajmera 03d05006 Overview 1. Motivation: Why GPGPU ? 2. CPU-GPU Analogies 3. GPU Resources The Graphics Pipeline Textures Programmable Vertex Processor Fixed Function Rasterizer


  1. GPGPU: General-Purpose Computation on GPUs Prekshu Ajmera 03d05006

  2. Overview 1. Motivation: Why GPGPU ? 2. CPU-GPU Analogies 3. GPU Resources  The Graphics Pipeline  Textures  Programmable Vertex Processor  Fixed Function Rasterizer  Programmable Fragment Processor  Feedback 4. GPU Program Flow Control 5. GPGPU Techniques  Reduction : Max  Sort  Search  Matrix Multiplication

  3. Why GPGPU ? − The GPU has evolved into an extremely flexible and powerful processor  Programmability − Programmable pixel and vertex engines − High-level language support  Precision − 32-bit floating point throughout the pipeline  Performance − 3 GHz Pentium 4 theoretical : 12 GFLOPS − GeForce 6800 Ultra observed : 53 GFLOPs

  4. CPU-GPU Analogies

  5. GPU Textures = CPU Arrays Textures are the equivalent of arrays.  Native data layout: Rectangular (2D) textures.  Size limitation: 4096 texels in each dimension.  Data formats: One channel (LUMINANCE) to four channels (RGBA).  They provide a natural data structure for vector data types with 2 to 4  components. Supported floating point formats: 16bit, 32bit, 24bit  Most basic operation :   array (memory) read == texture lookup  array offset == texture lookup

  6. Feedback = Texture Update  Feedback: Results of an intermediate computation used as an input to the next pass.  Trivially implemented in CPU using variables and arrays that can both be read and written.  Not trivial on GPUs  Output of fragment processor always written on frame buffer  Think of the frame buffer as a 2-D array that can't be read directly . Solution ? Use texture as a frame buffer so that GPUs can write to it for storing intermediate results. This is called Render-to-Texture.

  7. GPU Fragment Programs = CPU Loop Bodies  Consider a 2-D grid.  A CPU implementation uses a pair of nested loops to iterate over each cell in the grid and perform same computation at each cell.  GPUs do not have this capablity to perform this inner loop over each texel in a texture. Solution ?  Fragment pipeline is designed to perform identical computations at each fragment simultaneously.  It is similar to having a processor for each fragment.  Thus, GPU analog of computation inside nested loops over an array is a fragment program applied in data-parallel fashion to each fragment.

  8. The Modern Graphics Pipeline  Each stage in the graphics pipeline can be independently configured through graphics APIs like OpenGL or directX.  Programmable Graphics Pipeline  Fixed function operations on vertices like transformations and lighting calculations replaced by user-defined vertex program.  Fixed function operations on fragments that determine fragment's color replaced with user-defined fragment program.

  9. Textures Textures are the equivalent of arrays.  Size limitation: 4096 texels in each  dimension. Native data layout: Rectangular (2D)  textures. Data formats: One channel (LUMINANCE)  to four channels (RGBA). Supported floating point formats: 16bit,  32bit, 24bit

  10. Programmable Vertex Processor Input: Stream of geometry.  Transforms each vertex in homogeneous  coordinates (xyzw) independent of the other vertices, works on 4-tuples simultaneously. Fully Programmable (SIMD/MIMD)  Processes 4-component vectors (RGBA/XYZW)  Capable of Scatter but not gather   Can change the location of current vertex  Cannot read info from other vertices  Limited gather capabilities: Can fetch from texture but can't fetch from current vertex stream.  Output: Stream of transformed vertices and triangles.

  11. Fixed-Function Rasterizer Input: Stream of transformed  vertices and triangles. Generates fragment for each pixel  covered by transformed geometry. Interpolates vertex attributes  linearly. Output: Stream of fragments.  Fixed-function part of the pipeline. 

  12. Programmable Fragment Processor Input: Stream of fragments with interpolated  attributes. Applies fragment program to each fragment  independently. Capable of gather but not scatter.   Indirect memory read (texture fetch), but no indirect memory write.  Output address fixed to a specific pixel Fully Programmable (SIMD)  Processes 4-component vectors  (RGBA/XYZW) Output: Pixels to be displayed. 

  13. Feedback: Render-To-Texture Textures can be used as render targets!  Textures are either read-only or write-  only. Feedback loop: Render intermediate  results into a texture, use it as input in subsequent pass. Visualization: Render single quad into  frame buffer textured with last intermediate result. Further processing on CPU: Read back  texture data.

  14. GPGPU Terminology

  15. Arithmetic Intensity  Arithmetic intensity  Math operations per word transferred  Computation / bandwidth  Ideal applications to target GPGPU have:  Large data sets  High parallelism  Minimal dependencies between data elements  High arithmetic intensity  Lots of work to do without CPU intervention

  16. Data Streams & Kernels  Streams  Collection of records requiring similar computation  Thus they provide data parallelism  Kernels  Functions applied to each element in stream  transforms, PDE, …  No dependencies between stream elements  Encourage high Arithmetic Intensity

  17. Scatter vs. Gather  Gather  Indirect read from memory ( x = a[i] )  Naturally maps to a texture fetch  Used to access data structures and data streams  Scatter  Indirect write to memory ( a[i] = x )  Difficult to emulate:  change in frame buffer write location of a fragment  dependent texture write operation  Both these operations not available on GPUs  Solution ?  Rewrite the problem in terms of gather  Using vertex processor  Needed for building many data structures  Usually done on the CPU

  18. GPU Program Flow Control Highly parallel nature of GPUs !  Limitations of branching on GPUs ?  Techniques for iteration and decision making ? 

  19. Hardware mechanisms for Flow Control Three basic implementations of data parallel branching on GPUs -  Prediction  Single Instruction Multiple Data (SIMD)  Multiple Instruction Multiple Data (MIMD)

  20. Hardware mechanisms for Flow Control  Prediction  No true data-dependent branch instructions  GPU evaluates both sides of a branch & discards one of the results based on the value of boolean branch condition.  Disadvantage : evaluating both branches can be costly

  21. Hardware mechanisms for Flow Control  SIMD branching  All active processors execute the same instructions at the same time  When evaluation of a branch condition is identical on all active processors, only taken side of the branch is evaluated.  But when different, then both sides evaluated and the results predicted.  Thus divergence in branching of simultaneously processed fragments can lead to reduced performance.  MIMD branching  Different processors can follow different paths through the program

  22. Other Techniques for Flow Control  Static Branch Resolution  Avoid branching inside inner loops  Results in loops that contain efficient code without branches  Pre-Computation  Result of a branch constant over a large domain of input values or a number of iterations.  Evaluate branches only when results are known to change  Store the results for use over many subsequent iterations  Z-Cull  Feature to avoid shading pixels that will not be seen  Discard the fragments which fail the depth test, before their pixel colors are calculated in fragment processor  Lot of work saved !!!

  23. GPGPU : 4 Problems  Reduction: Max  Sorting  Searching  Matrix Multiplication

  24. Simple Fragment Application Flow Write Data to Texture Bind Textures Load Fragment Draw Large Write Results Program Quad to Texture Bind Fragment Program Configure OpenGL for 1:1 Rendering

  25. Reduction (max)  Goal  Find maximum element in an array of n elements.  Approach  Each fragment processor will find max of 4 adjacent array elements (each pass processes 16 elements)  Input:  Array of n elements stored as 2D texture  Output:  Array of n/4 elements to frame buffer (each pass overwrites the array)

  26. Reduction on GPU Store array as 2D texture  Max() comparison runs as  fragment program Each fragment compares 4  texels and returns max Frame buffer stores max from  each Fragment (buffer quarters  original array size) Frame buffer overwrites  previous texture

  27. Another look at Reduction Loop

  28. Sorting on GPU  Sort an array of n floats  CPU implementation: standard merge sort in Ο(n lgn)  GPU implementation: bitonic merge sort in Ο(lgn lgn)

  29. The Bitonic Merge Sort – A classic (parallel) algorithm  Repeatedly build Bitonic lists and then sort them  Bitonic list is two monotonic lists concatenated together, one increasing and one decreasing.  List A : (3, 4, 7, 8) monotonically increasing  List B : (6, 5, 2, 1) monotonically decreasing  List AB : (3, 4, 7, 8, 6, 5, 2, 1) Bitonic

  30. Similar to parallelizing Classic Merge Sort

  31. The Bitonic Sort 3 7 4 8 6 2 1 5 8x monotonic lists: (3) (7) (4) (8) (6) (2) (1) (5) 4x bitonic lists: (3,7) (4,8) (6,2) (1,5)

Recommend


More recommend