performance analysis of gpu programming models using the
play

Performance Analysis of GPU Programming Models using the Roofline - PowerPoint PPT Presentation

Performance Analysis of GPU Programming Models using the Roofline Scaling Trajectories Khaled Ibrahim, Sam Williams, and Leonid Oliker Bench19 Conference, Nov. 14, 2019 Denver, Colorado UNIVERSITY OF CALIFORNIA Acknowledgements This


  1. Performance Analysis of GPU Programming Models using the Roofline Scaling Trajectories Khaled Ibrahim, Sam Williams, and Leonid Oliker Bench’19 Conference, Nov. 14, 2019 Denver, Colorado UNIVERSITY OF CALIFORNIA

  2. Acknowledgements § This material is based upon work supported by the Advanced Scientific Computing Research Program in the U.S. Department of Energy, Office of Science, under Award Number DE-AC02-05CH11231. § This material is based upon work supported by the DOE RAPIDS SciDAC Institute. § This research used resources of the National Energy Research Scientific Computing Center (NERSC), which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02-05CH11231. § This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

  3. Layout and Contributions CLASS A 250000 ● ● ● CLASS B ● Contributions ● CLASS C MFlop/s • The roofline-scaling trajectory technique ● 100000 for the performance analysis on GPU ● architecture. ● ● 0 • Comparison between CUDA and 02 04 08 16 32 48 64 80 SM count OpenACC models for GPU offloading. Potential • Case studies for tuning for performance throughput improvement with AI degradation (> 2x improvement). Measured < Potential indicates loss of occupancy while SM=80 scaling SM=32 Inefficiency at low SM count is typically correlated Measured throughput SM=2 with low warp efficiency improvement AI degradation due to excessive HBM data movement to the L2 cache

  4. Performance Analysis: Micro vs. Macro Micro Analysis: (typically attributed to a particular code site.) Miss rate, vectorization, Load imbalance? But, are they impacting performance? #pragma #pragma omp omp parallel parallel for for Macro Analysis: (Model-Based) for for ( ( int int i = 0; = 0; i < n; < n; i++ ) { ++ ) { s = 0; s = 0; Are we utilizing resources effectively? for for( ( int int j = j = ia ia[i]; j < ]; j < ia ia[i+1]; [i+1]; j++ j++ ) ) s += A[j] * x[ja[j]]; s += A[j] * x[ja[j]]; Roofline technique, logGP, etc y[i] = s; y[ ] = s; } Clear performance expectation. /* Is it vectorized? */ /* Is it vectorized? */ /* Is it balances? */ /* Is it balances? */ But, no source code attribution for for( ( int int j = j = ia ia[i]; j < ]; j < ia ia[i+1]; [i+1]; j++ j++ ) ) #pragma omp #pragma omp parallel parallel for for Vendor Hardware Events: for ( for ( int s += A[j] * x[ja[j]]; s += A[j] * x[ja[j]]; int i = 0; = 0; i < n; < n; i++ ) { ++ ) { Could serve both micro/macro analyses! But, they are many, hard to understand and time consuming to collect

  5. 7 Roofline Performance Model (Empirical) § Performance Model Bounds DRAM Bandwidth § Peak Flop/s Peak Flop/s Compute s / § B G s § Typical HPC application e Attainable Flop/s Attainable Flop/s / B h c G a c 2 s L M / Performance: Flop/s B A G § R M D C A Arithmetic intensity: Flop/Byte M R § D § DRAM Roofline DRAM-bound Compute-bound § Define an AI for each level of cache/memory Arithmetic Intensity (Flop:Byte) Arithmetic Intensity (Flop:Byte) § CARM Roofline (Cache-aware Roofline) Williams et al, "Roofline: An Insightful Visual Performance Model For Multicore Architectures", CACM, 2009. § Define a single AI for all levels of Aleksandar Ilic, et al. Cache-aware Roofline model: Upgrading the loft, IEEE memory based on L1 data movement. Computer Architecture Letters, vol. 13, n. 1, pp. 21-24, January 2014

  6. GPU Parallelism and Performance Streaming Multiprocessor (SM) . . . Streaming Multiprocessor (SM) Streaming Multiprocessor (SM) Multi-Level of parallelism Warp . . . Warp Level : Ideally threads execute the same instruction Warp Warp Thread block Level: Cooperative execution Warp SM Level: Resource sharing to hide latency. Occupancy of multiple blocks depends on resource requirements. GPU level: Occupancy Scaling depends on level of parallelism Performance influencing factor Cache/Shared Register File Warp efficiency Memory Occupancy L2 Cache Locality (especially spatial, not necessarily temporal) Global Memory (HBM) etc.

  7. Warp Efficiency “The ratio of the average active threads per warp to the maximum number of threads per warp” Control Divergence: nvprof : event based warp efficiency due only to control divergence. X Y Z Warp Level : (32 per warp) Latency Divergence: A B Z If If ( ( threadIdx.X threadIdx.X < N ) { < N ) { X; X; i = = map_func map_func(thread,block thread,block) ) Y; Y; s = 0; s = 0; } else { } else { for( for ( int int j = j = ia ia[i]; j < ]; j < ia ia[i+1]; [i+1]; j++ j++ ) ) A; A; s += A[j] * x[ja[j]]; s += A[j] * x[ja[j]]; B; B; y[ y[i] = s; ] = s; } Z; Z;

  8. Occupancy Streaming Multiprocessor (SM) . . . Streaming Multiprocessor (SM) Streaming Multiprocessor (SM) “The ratio of the average active warps per active cycle Warp . . . to the maximum number of warps supported on a Warp multiprocessor” Warp nvprof: report active occupancy (while GPU is active Warp executing a kernel) Thread block Level/SM Level: 64 warps per SM GPU level: Cache/Shared Register File Memory 80+ SM per Volta GPU Resource Sharing, impacting occupancy: L2 Cache Block level (register file, Shared mem) Global Memory (HBM)

  9. 12 Scaling Curves CLASS A 250000 ● ● Volta GPU ● CLASS B ● ● CLASS C Possible to control SM count MFlop/s NAS Parallel Benchmarks ● 100000 LU, CUDA ● Typical Scaling Plot ● Provide performance with SM change, ● No insights into root causes. 0 Why Class B scale better than A, but Class C is not better than B? 02 04 08 16 32 48 64 80 SM count

  10. Roofline Strong Scaling Trajectory Roofline Scaling Trajectories Diagnostic technique for scaling issue. Track performance while changing the level of concurrency. SM=80 Ideal behavior: SM=32 △ y = increase in computational resources or share of BW SM=2 △ x=0 (No change in arithmetic intensity)

  11. Typical Strong Scaling Trajectory Potential cannonical flop count AI ← throughput ( dram read trans + dram write trans ) × 32 improvement with AI degradation Measured < Potential indicates loss of occupancy while SM=80 scaling SM=32 Inefficiency at low SM count is typically Measured throughput SM=2 correlated with low warp improvement efficiency AI degradation due to excessive HBM data movement to the L2 cache

  12. Evaluation Testbed NAS Parallel benchmarks OLCF Summit, P9 - V100 node FT : spectral methods CG : sparse linear algebra MG : multi-grid PDE LU : regular-sparse lower and upper triangular system BT, SP : mini-apps adaptive mesh benchmark. Programming Models: Cuda, OpenACC Problem Sizes: Class A, B, and C (4x problem increase while changing classes) Low level Tools: nvprof for data movement. Flop count is based on application estimate (constant per class while scaling)

  13. GPU Programming Model Influence Pragma-based Language (OpenACC, OpenMP) Vender Programming Language (CUDA) • Full code rewrite • Preserve data layout • Full leverage of architectural features • Incremental source code annotation (e.g., texture, shared memory, etc) • Same loop structure for CPU and GPU. • Possible change of data layout Kernel/Loop Kernel/Loop Kernel/Loop Kernel/Loop Kernel/Loop Kernel/Loop

  14. Roofline Scaling Trajectories Case Studies Good Scaling: (e.g., ACC MG) 500 500 Locality at high concurrency Class A Class A Some Influence of problem size on locality Class B ● Class B ● Class C Class C ) HBM(SMs=80) (829) 9 2 GFlop/s 8 GFlop/s ● ● ● ( ● ) 50 0 ● 50 8 Need improvement = s M ● S ( M ● ● ● ● ● B ● H ● (e.g., ACC FT) ● ● ● ) Scaling problem due to occupancy HBM(SMs=2) (35) 5 5 5 3 ( ) 2 = s M S ( M B H 1 1 0.01 0.01 0.02 0.02 0.05 0.05 0.10 0.10 0.20 0.20 0.50 0.50 1.00 1.00 Arithmetic Intensity (Flops/Byte) Arithmetic Intensity (Flops/Byte)

  15. 18 GPU Scaling Trajectories Scaling plot vs. Roofline scaling trajectory 5000 ADD(SMs=80) (3536) Class A CLASS A 250000 ● ● ● Class B ● CLASS B ● Class C ● CLASS C 500 MFlop/s ● GFlop/s HBM(SMs=80) (829) ADD(SMs=2) (88) 100000 ● ● ● ● 50 ● ● ● ● ● ● 0 HBM(SMs=2) (35) 5 ● 02 04 08 16 32 48 64 80 SM count 1 0.01 0.02 0.05 0.10 0.20 0.50 1.00 2.00 5.00 Arithmetic Intensity (Flops/Byte)

  16. Scaling of NAS BT Behavior at low concurrency (warp efficiency) vs. occupancy Influence of programming model BT OpenACC Implementation BT CUDA Implementation 500 500 Class A Class A Class B ● Class B ● Class C HBM(SMs=80) (829) ● ● ● GFlop/s ● ● CUDA AI ● ● ● 50 GFlop/s ● 50 HBM(SMs=80) (829) ● ● ● ● ● ● HBM(SMs=2) (35) 5 ● 5 HBM(SMs=2) (35) 1 1 0.01 0.02 0.05 0.10 0.20 0.50 1.00 2.00 0.01 0.02 0.05 0.10 0.20 0.50 1.00 2.00 Arithmetic Intensity (Flops/Byte) Arithmetic Intensity (Flops/Byte)

Recommend


More recommend