GPGPU: General-Purpose Computation on GPUs Prekshu Ajmera 03d05006
Overview 1. Motivation: Why GPGPU ? 2. CPU-GPU Analogies 3. GPU Resources The Graphics Pipeline Textures Programmable Vertex Processor Fixed Function Rasterizer Programmable Fragment Processor Feedback 4. GPU Program Flow Control 5. GPGPU Techniques Reduction : Max Sort Search Matrix Multiplication
Why GPGPU ? − The GPU has evolved into an extremely flexible and powerful processor Programmability − Programmable pixel and vertex engines − High-level language support Precision − 32-bit floating point throughout the pipeline Performance − 3 GHz Pentium 4 theoretical : 12 GFLOPS − GeForce 6800 Ultra observed : 53 GFLOPs
CPU-GPU Analogies
GPU Textures = CPU Arrays Textures are the equivalent of arrays. Native data layout: Rectangular (2D) textures. Size limitation: 4096 texels in each dimension. Data formats: One channel (LUMINANCE) to four channels (RGBA). They provide a natural data structure for vector data types with 2 to 4 components. Supported floating point formats: 16bit, 32bit, 24bit Most basic operation : array (memory) read == texture lookup array offset == texture lookup
Feedback = Texture Update Feedback: Results of an intermediate computation used as an input to the next pass. Trivially implemented in CPU using variables and arrays that can both be read and written. Not trivial on GPUs Output of fragment processor always written on frame buffer Think of the frame buffer as a 2-D array that can't be read directly . Solution ? Use texture as a frame buffer so that GPUs can write to it for storing intermediate results. This is called Render-to-Texture.
GPU Fragment Programs = CPU Loop Bodies Consider a 2-D grid. A CPU implementation uses a pair of nested loops to iterate over each cell in the grid and perform same computation at each cell. GPUs do not have this capablity to perform this inner loop over each texel in a texture. Solution ? Fragment pipeline is designed to perform identical computations at each fragment simultaneously. It is similar to having a processor for each fragment. Thus, GPU analog of computation inside nested loops over an array is a fragment program applied in data-parallel fashion to each fragment.
The Modern Graphics Pipeline Each stage in the graphics pipeline can be independently configured through graphics APIs like OpenGL or directX. Programmable Graphics Pipeline Fixed function operations on vertices like transformations and lighting calculations replaced by user-defined vertex program. Fixed function operations on fragments that determine fragment's color replaced with user-defined fragment program.
Textures Textures are the equivalent of arrays. Size limitation: 4096 texels in each dimension. Native data layout: Rectangular (2D) textures. Data formats: One channel (LUMINANCE) to four channels (RGBA). Supported floating point formats: 16bit, 32bit, 24bit
Programmable Vertex Processor Input: Stream of geometry. Transforms each vertex in homogeneous coordinates (xyzw) independent of the other vertices, works on 4-tuples simultaneously. Fully Programmable (SIMD/MIMD) Processes 4-component vectors (RGBA/XYZW) Capable of Scatter but not gather Can change the location of current vertex Cannot read info from other vertices Limited gather capabilities: Can fetch from texture but can't fetch from current vertex stream. Output: Stream of transformed vertices and triangles.
Fixed-Function Rasterizer Input: Stream of transformed vertices and triangles. Generates fragment for each pixel covered by transformed geometry. Interpolates vertex attributes linearly. Output: Stream of fragments. Fixed-function part of the pipeline.
Programmable Fragment Processor Input: Stream of fragments with interpolated attributes. Applies fragment program to each fragment independently. Capable of gather but not scatter. Indirect memory read (texture fetch), but no indirect memory write. Output address fixed to a specific pixel Fully Programmable (SIMD) Processes 4-component vectors (RGBA/XYZW) Output: Pixels to be displayed.
Feedback: Render-To-Texture Textures can be used as render targets! Textures are either read-only or write- only. Feedback loop: Render intermediate results into a texture, use it as input in subsequent pass. Visualization: Render single quad into frame buffer textured with last intermediate result. Further processing on CPU: Read back texture data.
GPGPU Terminology
Arithmetic Intensity Arithmetic intensity Math operations per word transferred Computation / bandwidth Ideal applications to target GPGPU have: Large data sets High parallelism Minimal dependencies between data elements High arithmetic intensity Lots of work to do without CPU intervention
Data Streams & Kernels Streams Collection of records requiring similar computation Thus they provide data parallelism Kernels Functions applied to each element in stream transforms, PDE, … No dependencies between stream elements Encourage high Arithmetic Intensity
Scatter vs. Gather Gather Indirect read from memory ( x = a[i] ) Naturally maps to a texture fetch Used to access data structures and data streams Scatter Indirect write to memory ( a[i] = x ) Difficult to emulate: change in frame buffer write location of a fragment dependent texture write operation Both these operations not available on GPUs Solution ? Rewrite the problem in terms of gather Using vertex processor Needed for building many data structures Usually done on the CPU
GPU Program Flow Control Highly parallel nature of GPUs ! Limitations of branching on GPUs ? Techniques for iteration and decision making ?
Hardware mechanisms for Flow Control Three basic implementations of data parallel branching on GPUs - Prediction Single Instruction Multiple Data (SIMD) Multiple Instruction Multiple Data (MIMD)
Hardware mechanisms for Flow Control Prediction No true data-dependent branch instructions GPU evaluates both sides of a branch & discards one of the results based on the value of boolean branch condition. Disadvantage : evaluating both branches can be costly
Hardware mechanisms for Flow Control SIMD branching All active processors execute the same instructions at the same time When evaluation of a branch condition is identical on all active processors, only taken side of the branch is evaluated. But when different, then both sides evaluated and the results predicted. Thus divergence in branching of simultaneously processed fragments can lead to reduced performance. MIMD branching Different processors can follow different paths through the program
Other Techniques for Flow Control Static Branch Resolution Avoid branching inside inner loops Results in loops that contain efficient code without branches Pre-Computation Result of a branch constant over a large domain of input values or a number of iterations. Evaluate branches only when results are known to change Store the results for use over many subsequent iterations Z-Cull Feature to avoid shading pixels that will not be seen Discard the fragments which fail the depth test, before their pixel colors are calculated in fragment processor Lot of work saved !!!
GPGPU : 4 Problems Reduction: Max Sorting Searching Matrix Multiplication
Simple Fragment Application Flow Write Data to Texture Bind Textures Load Fragment Draw Large Write Results Program Quad to Texture Bind Fragment Program Configure OpenGL for 1:1 Rendering
Reduction (max) Goal Find maximum element in an array of n elements. Approach Each fragment processor will find max of 4 adjacent array elements (each pass processes 16 elements) Input: Array of n elements stored as 2D texture Output: Array of n/4 elements to frame buffer (each pass overwrites the array)
Reduction on GPU Store array as 2D texture Max() comparison runs as fragment program Each fragment compares 4 texels and returns max Frame buffer stores max from each Fragment (buffer quarters original array size) Frame buffer overwrites previous texture
Another look at Reduction Loop
Sorting on GPU Sort an array of n floats CPU implementation: standard merge sort in Ο(n lgn) GPU implementation: bitonic merge sort in Ο(lgn lgn)
The Bitonic Merge Sort – A classic (parallel) algorithm Repeatedly build Bitonic lists and then sort them Bitonic list is two monotonic lists concatenated together, one increasing and one decreasing. List A : (3, 4, 7, 8) monotonically increasing List B : (6, 5, 2, 1) monotonically decreasing List AB : (3, 4, 7, 8, 6, 5, 2, 1) Bitonic
Similar to parallelizing Classic Merge Sort
The Bitonic Sort 3 7 4 8 6 2 1 5 8x monotonic lists: (3) (7) (4) (8) (6) (2) (1) (5) 4x bitonic lists: (3,7) (4,8) (6,2) (1,5)
Recommend
More recommend