S9624: Performance Analysis of GPU-Accelerated Applications using the Roofline Model GTC 2019, San Jose Samuel Williams Charlene Yang Application Performance Specialist Senior Staff Scientist NERSC, LBNL CRD, LBNL swwilliams@lbl.gov cjyang@lbl.gov
You just bought a $10,000 throughput-optimized GPU! Are you making good use of your investment? 1
You could just run benchmarks § Imagine a mix of benchmarks or kernels… § GFLOP/s alone may not be particularly insightful GFLOP/s § Moreover, speedup relative to a Xeon may seem random Kernel (or apps) 2
Making good use of your GPU? 1. Are you operating it in the throughput-limited regime? Not sensitive to Amdahl effects o Not sensitive to D2H/H2D transfers o Not sensitive to launch overheads o Not sensitive to latencies o 2. If in the throughput-limited regime, are you making good use of the GPU’s compute and bandwidth capabilities? 3
The Roofline Model § Roofline Model is a throughput- oriented performance model § Premised on the interplay between FLOP/s, bandwidth, and reuse § Tracks rates not times § Independent of ISA and architecture (applies to CPUs, GPUs, Google TPUs, etc…) https://crd.lbl.gov/departments/computer-science/PAR/research/roofline Jouppi et al, “In-Datacenter Performance Analysis of a Tensor Processing Unit”, ISCA, 2017. 4
(DRAM) Roofline § One could hope to always attain peak performance (GFLOP/s) GPU (compute, GFLOP/s) § However, finite locality (reuse) and bandwidth limit performance. DRAM Bandwidth (GB/s) § Assume: DRAM Idealized processor/caches o (data, GB) Cold start (data in DRAM) o #FLOPs / Peak GFLOP/s Time = max #Bytes / Peak GB/s 5
(DRAM) Roofline § One could hope to always attain peak performance (GFLOP/s) GPU (compute, GFLOP/s) § However, finite locality (reuse) and bandwidth limit performance. DRAM Bandwidth (GB/s) § Assume: DRAM Idealized processor/caches o (data, GB) Cold start (data in DRAM) o Peak GFLOP/s GFLOP/s = min AI * Peak GB/s Note, Arithmetic Intensity (AI) = FLOPs / Bytes (as presented to DRAM ) 6
Arithmetic Intensity § Arithmetic Intensity is the most important concept in Roofline. § Measure of data locality (data reuse) § Ratio of Total FLOPs performed to Total Bytes moved § For the DRAM Roofline… Total Bytes to/from DRAM and includes all cache and prefetcher effects o Can be very different from total loads/stores (bytes requested) due to cache reuse o 7
(DRAM) Roofline § Plot Roofline bound using Arithmetic Intensity as the x-axis Peak GFLOP/s § Log-log scale makes it easy to Attainable GFLOP/s doodle, extrapolate performance along Moore’s Law, etc… § Kernels with AI less than DRAM-bound Compute-bound machine balance are ultimately DRAM bound (we’ll refine this Arithmetic Intensity (FLOP:Byte) later…) Transition @ AI == Peak Gflop/s / Peak GB/s == ‘Machine Balance’ 8
Example § Consider 3 kernels (A,B,C) calculate or measure the Arithmetic o C Intensity for each Peak GFLOP/s B Attainable GFLOP/s Determine the Roofline intercept for o each kernel Ø kernels A and B are bound by A memory bandwidth Ø kernel C is bound by peak FLOP/s Arithmetic Intensity (FLOP:Byte) 9
Scaling to Future GPUs § Imagine you run on a future GPU C 2x GFLOP/s with twice the peak FLOPs… B Ø kernel C’s performance could double Attainable GFLOP/s kernels A and B will be no faster ✘ A Arithmetic Intensity (FLOP:Byte) 10
Scaling to Future GPUs § What if that future GPU also C 2x GFLOP/s doubled its memory bandwidth… B Ø kernel A and B’s performance could Attainable GFLOP/s also double A Arithmetic Intensity (FLOP:Byte) 11
Why is Roofline Useful? § Think back to our mix of loop nests where GFLOP/s alone wasn’t useful… GFLOP/s Kernel (or apps) 12
Why is Roofline Useful? § We can sort kernels by AI … GFLOP/s Arithmetic Intensity (FLOP:Byte) 13
Why is Roofline Useful? § We can sort kernels by AI … § … and compare performance Peak GFLOP/s relative to machine capabilities GFLOP/s Arithmetic Intensity (FLOP:Byte) 14
Why is Roofline Useful? § Kernels near the roofline are making good use of Peak GFLOP/s computational resources… Ø kernels can have low performance 50% of Peak GFLOP/s (GFLOP/s), but make good use of a machine Ø kernels can have high performance (GFLOP/s), but make poor use of a machine Arithmetic Intensity (FLOP:Byte) 15
Can Performance Be Below Roofline? § Analogous to asking whether one can always attain either… Peak GFLOP/s Peak Bandwidth o Attainable GFLOP/s Peak GFLOP/s o § Sure, there can be other performance bottlenecks… Cache bandwidth / locality o Lack of FMA / tensor instructions o Arithmetic Intensity (FLOP:Byte) Thread divergence / predication o Too many non-FP instructions o … o 16
Cache Effects… § Hierarchical Roofline Model § Construct superposition of Peak GFLOP/s Rooflines… Attainable GFLOP/s Measure AI and bandwidth for each o level of memory/cache Loop nests will have multiple AI’s and o L2 Bound multiple performance bounds… L2 AI*BW is less than … but performance is ultimately the o DDR AI*BW minimum of these bounds. Arithmetic Intensity (FLOP:Byte) 17
Cache Effects… § Hierarchical Roofline Model § Construct superposition of Peak GFLOP/s Rooflines… Attainable GFLOP/s Measure AI and bandwidth for each o level of memory/cache Loop nests will have multiple AI’s and o multiple performance bounds… … but performance is ultimately the o minimum of these bounds. § Extend to other memories… Arithmetic Intensity (FLOP:Byte) L1 / Shared o System o 18
Insights – Exploiting Caches § Widely separated Arithmetic Intensities indicate high reuse in Peak GFLOP/s the cache Attainable GFLOP/s High Reuse Arithmetic Intensity (FLOP:Byte) 19
Insights – Exploiting Caches § Widely separated Arithmetic Intensities indicate high reuse in Peak GFLOP/s the cache Attainable GFLOP/s § Similar Arithmetic Intensities indicate effectively no cache reuse ( == streaming ) § As one changes problem size, no reuse (streaming) L2 and DRAM arithmetic intensities can behave very Arithmetic Intensity (FLOP:Byte) differently 20
Failure to Exploit CISC Instructions § Death of Moore’s Law is motivating a return of Complex Instruction Set Computing (CISC) § Modern CPUs and GPUs are increasingly reliant on special (fused) instructions that perform multiple operations. FMA (Fused Multiply Add): z=a*x+y … z,x,y are vectors or scalars o 4FMA (quad FMA): z=A*x+z … A is a FP32 matrix; x,z are vectors o HMMA (Tensor Core): Z=AB+C …Z,A,B,C are FP16 matrices o … o Ø Performance is now a weighted average of Mul/Add, FMA, and HMMA operations. 21
Failure to Exploit CISC Instructions § Total lack of FMA reduces Volta performance by 2x… FMA.f64 Peak creates ADD.f64 ceiling o Attainable GFLOP/s Partial FMA ADD.f64 Ceiling § In reality, applications are a mix of FMA.f64, ADD.f64, and MUL.f64… Performance is a weighted average o Ø Produces a partial FMA ceiling that Arithmetic Intensity (FLOP:Byte) bounds kernel performance 22
Failure to Exploit CISC Instructions § On Volta, Tensor cores provide 125 TFLOPs of FP16 performance (vs. 15 for FP32) HMMA.f16 Peak Attainable GFLOP/s § However, kernels/apps will mix Partial HMMA Ceiling HMMA with FMA, MULs, ADDs, … ADD.f32 Ceiling Ø A few non-HMMA operations can quickly limit Tensor core performance Arithmetic Intensity (FLOP:Byte) 23
Using Roofline To Drive Optimization
Driving Performance Optimization § Broadly speaking, there are three approaches to improving Peak GFLOP/s performance: No FMA GFLOP/s Arithmetic Intensity (FLOP:Byte) 25
Driving Performance Optimization § Broadly speaking, there are three approaches to improving Peak GFLOP/s performance: No FMA § Maximize SM performance GFLOP/s (e.g. minimize predication) Current AI Arithmetic Intensity (FLOP:Byte) 26
Driving Performance Optimization § Broadly speaking, there are three approaches to improving Peak GFLOP/s performance: No FMA § Maximize SM performance (e.g. GFLOP/s minimize predication) § Maximize memory bandwidth Current AI (e.g. avoid pathological memory access patterns) Arithmetic Intensity (FLOP:Byte) 27
Driving Performance Optimization § Broadly speaking, there are three approaches to improving Peak GFLOP/s performance: No FMA § Maximize SM performance (e.g. GFLOP/s minimize predication) Compulsory AI § Maximize memory bandwidth Current AI (e.g. avoid pathological memory access patterns) Arithmetic Intensity (FLOP:Byte) § Minimize data movement (i.e. exploit reuse) 28
Estimating Arithmetic Intensity
DRAM vs L1 Arithmetic Intensity § Consider a 7-point constant GPU coefficient stencil… (compute, GFLOP/s) 7 FLOPs o 8 memory references (7 reads, 1 store) per point o AI = 0.11 FLOPs per byte (L1) o #pragma omp parallel for for(k=1;k<dim+1;k++){ DRAM Bandwidth for(j=1;j<dim+1;j++){ (GB/s) for(i=1;i<dim+1;i++){ new[k][j][i] = -6.0*old[k ][j ][i ] + old[k ][j ][i-1] DRAM + old[k ][j ][i+1] + old[k ][j-1][i ] (data, GB) + old[k ][j+1][i ] + old[k-1][j ][i ] + old[k+1][j ][i ]; }}} 30
Recommend
More recommend