How Powerful are GPUs? Pat Hanrahan Computer Science Department Stanford University Computer Forum 2007 Modern Graphics Pipeline Application C Command d Geometry Rasterization Texture Fragment Display Page 1
A Pitch from 5 Years Ago … Cinematic games and media drive GPU market Current GPUs faster than CPUs (at graphics) Gap between the GPU and the CPU increasing Why? Efficiently use VLSI resources Programmable GPUs ≈ Stream processors Many applications map to stream processing Therefore, a $50 high-performance, massively Therefore, a $50 high performance, massively parallel computer will soon ship with every PC Pat Hanrahan, circa 2002-2005 What Happened? Now AMD and Intel gave up on sequential CPUs with high clock rates and went multi-core (2-4) high clock rates and went multi-core (2-4) Gap between GPU and CPU stablelized GPUs are data parallel (64-128 cores) DX10 mandates unified graphics pipeline GPGPU – many algorithms implemented Future Future Two main types of processors CPU – fast sequential processor GPU – fast data parallel processor Hybrid CPU/GPU Page 2
Overview Current programmable GPUs Performance Programming model: Stream abstraction Applications How General? Programmable GPUs Page 3
ATI R600 (X2X00) � 80 nm process � ~700 million transistors � 64 4-wide unified shaders ~700 Mhz clock � 512-bit GDDR memory GDDR3 @ 900Mh GDDR3 @ 900Mhz = 115 GB/s 115 GB/ GDDR4 @ 1100Mhz = 140 GB/s � 230 Watt R300 not R600 NVIDIA G80 (8800) � 90 nm TSMC process � 681million transistors � 480 mm^2 � 128 scalar processors 1.3 Ghz clock rate � 384 bit GDDR � 384-bit GDDR memory GDDR3 @ 900Mhz = 86.4 GB/s � 130 Watts Page 4
GeForce 8800 Series GPU Host Input Assembler Input Assembler R Rasterization t i ti Vertex Thread Geometry Thread Pixel Thread Thread Processor SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB Shader Model 4.0 Architecture 32 4-32-bit 64K 32-bit Parameters Input Input Registers 32 4-32-bit 64K insts Program Textures 8 4-32-bit Output Page 5
Simple Graphics Pipeline # c[0-3] = modelview projection (composite) matrix # c[4-7] = modelview inverse transpose # c[32] = eye-space light direction # c[33] = constant eye-space half-angle vector # c[35].x = pre-multiplied diffuse light color & diffuse mat. # c[35].y = pre-multiplied ambient light color & diffuse mat. # c[36] = specular color; c[38].x = specular power DP4 o[HPOS].x, c[0], v[OPOS]; # Transform position. DP4 o[HPOS].y, c[1], v[OPOS]; DP4 o[HPOS].z, c[2], v[OPOS]; DP4 o[HPOS].w, c[3], v[OPOS]; DP3 R0.x, c[4], v[NRML]; # Transform normal. DP3 R0.y, c[5], v[NRML]; DP3 DP3 R0.z, c[6], v[NRML]; R0 z c[6] v[NRML]; DP3 R1.x, c[32], R0; # R1.x = L DOT N' DP3 R1.y, c[33], R0; # R1.y = H DOT N' MOV R1.w, c[38].x; # R1.w = specular power LIT R2, R1; # Compute lighting MAD R3, c[35].x, R2.y, c[35].y; # diffuse + ambient MAD o[COL0].xyz, c[36], R2.z, R3; # + specular END G80 = Data Parallel Computer Host Input Assembler Thread Execution Manager Thread Execution Manager SIMD Core SIMD core SIMD core SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Data Cache Data Cache Data Cache Data Cache Data Cache Data Cache Data Cache Data Cache Load/store Global Memory Page 6
G80 “core” Each core � 8 functional units � SIMD 16/32 “warp” SIMD 16/32 “ ” � 8-10 stage pipeline � Thread scheduler � 128-512 threads/core � 16 KB shared memory y P Parallel Data Cache ll l D t C h Total #threads/chip 16 * 512 = 8K GPU Multi-threading (version 1) Change threads each cycle (round robin) frag2 frag1 frag3 frag4 instr1 instr2 instr3 Page 7
GPU Multi-threading (version 2) Change thread after texture fetch/stall frag2 frag1 frag3 frag4 Run until stall at texture fetch (multiple instructions) 8800GTX Peak Performance 575 Mhz * 128 processors * 2 flop/inst * 2 inst/clock MAD instruction = 332.8 GFLOPS Page 8
Instructions Issue Rate http://graphics.stanford.edu/projects/gpubench/ ATI X1900XTX NVIDIA 7900GTX Instructions Issue Rate http://graphics.stanford.edu/projects/gpubench/ NVIDIA 7900GTX NVIDIA 8800GTX Page 9
Measured BLAS Performance SAXPY � X1900 (DX9): 6 GFlops � X1900 (CTM): � X1900 (CTM): 6 GFlops 6 GFlops � 8800GTX (DX9): 12 GFlops SGEMV � X1900 (DX9): 4 GFlops � X1900 (CTM): 6 GFlops � 8800GTX (DX9): 14 GFlops SGEMM � X1900 (DX9): 30 GFlops � X1900 (CTM): 120 GFlops � 8800GTX (DX9): 105 Gflops � 3 Ghz Core 2 40 Gflops Programming Abstractions Page 10
Approach I Run application using graphics library Graphics library-based programming models � NVIDIA’s Cg � Microsoft’s HLSL � OpenGL Shading Language � RapidMind Sh [McCool et al. 2004] � RapidMind Sh [McCool et al. 2004] Approach II Map application to parallel computer C Communicating sequential processes (CSP) i ti ti l (CSP) � Threads: pthreads, Occam, UPC, … � Message passing: MPI Data parallel programming � APL, SETL, S, Fortran90, … � C* (lisp*), NESL, … Stream languages � StreaMIT, StreamC/KernelC � MS Accelerator, CUDA, DPVM, PeakStream Page 11
Stream Programming Environment Collections stored in memory � Multidimensional arrays (stencils) � Graphs and meshes (topology) Data parallel operators � Application: map � Reductions: scan, reduce (fold) � Communication: send, sort, gather, scatter � Communication: send, sort, gather, scatter � Filter (|O|<|I|) and generate (|O|>|I|) Brook Ian Buck PhD Thesis Stanford University Brook for GPUs: Stream computing on graphics hardware, I. Buck, T. Foley, D. Horn, J. Sugarman, K. Fatahalian, M. Houston, P. Hanrahan, SIGGRAPH 2004 Page 12
Brook Example kernel void foo ( float a<>, float b<>, out float result<> ) t fl t lt ) { result = a + b; } float a<100>; float b<100>; float c<100>; for (i=0; i<100; i++) foo(a,b,c); c[i] = a[i]+b[i]; Classical N-Body Simulation Stellar dynamics � Gravitational acceleration � Gravitational accel. + jerk Molecular dynamics � Implicit solvent models � Implicit solvent models � Lennard-Jones Coulomb Page 13
Folding@Home Performance Vijay Pande Group GROMACs on Brook GPU:CPUcore 40:1 CPU: 3.0 Ghz P4 GPU: ATI X1900X Current Statistics: March 19, 2007 Client type Current Current TFLOPS* Processors Windows Windows 150 150 157457 157457 Mac OS X/PPC 7 8710 Mac OS X/Intel 7 2520 Linux 34 24639 GPU GPU 40 40 682 682 PS/3 26 877 Total 223 1824132 *TFLOPs is actual flops from software cores, not peak values Page 14
Folding@Home GPU Cluster 25 nodes � Nforce4 SLI � Dual core Opteron � 2x ATI X1900XTX � Linux 5 TFlops of folding “power” Not actual machine Future Page 15
Summary Cinematic games and media drive GPU market GPU evolving into a high throughput processor � “Data parallel multi-threaded machine” Many applications map to GPUs � Processor of the future likely to be a CPU/GPU � Small number of traditional CPU cores � Large number of GPU cores � Large number of GPU cores Opportunities Current hardware not optimal � Incredible opportunity for architectural innovation innovation Current software environment immature � Incredible opportunity for reinventing parallel computing software, programming environments and languages Page 16
Acknowledgements Bill Dally Ian Buck Eric Darve Eric Darve Mattan Erez Mattan Erez Vijay Pande Kayvon Fatahalian Bill Mark Tim Foley John Owens Daniel Horn Kurt Akeley Kurt Akeley Michael Houston Michael Houston Mark Horowitz Jeremy Sugarman Funding: DARPA, DOE, ATI, IBM, NVIDIA, SONY Questions? Page 17
Recommend
More recommend