Guided Profiling for Auto-Tuning Array Layouts on GPUs Nicolas - PowerPoint PPT Presentation
Guided Profiling for Auto-Tuning Array Layouts on GPUs Nicolas Weber, Sandra C. Amend and Michael Goesele TU Darmstadt Motivation Memory access is one of the most important performance factors in CUDA applications CUDA Programming Guide
Guided Profiling for Auto-Tuning Array Layouts on GPUs Nicolas Weber, Sandra C. Amend and Michael Goesele TU Darmstadt
Motivation • Memory access is one of the most important performance factors in CUDA applications • CUDA Programming Guide • It is one of the three basic optimization strategies to “Optimize memory usage to achieve maximum memory throughput” • Performance difference up to an order of magnitude between best and worst implementation • Experience alone does not guarantee to find the optimal configuration 2
Motivation • Tedious to optimize in big GPU applications • Layouts: Array of Structs, Structure of Arrays, AoSoA • Transpositions of multi-dimensional arrays • Size of L1 cache / shared memory • Memory placement: Global, Texture, Shared, Local and Constant memory • Changing GPU architectures require to reoptimize • Memory hierarchy was changed in every architecture Automated optimization for most GPUs and algorithm • We develop an open source auto-tuner to automatically optimize array access in CUDA applications (with minimal programming overhead) 3
What is the optimal configuration for a kernel? • Difficult to find an analytical solution • Memory access can be input data sensitive • Different optima for varying input data • Many GPU architectures with different memory hierarchies Empirical profiling • Requires to compile & execute many different implementations • Very time intensive 4
High Dimensionality Kernel A Function Optimizations L1 cache size Layout Memory Application Kernel B Array A Transposition Layout Memory Kernel C Array B Transposition Layout Memory Array C Transposition • Up to several million configurations! 5𝑡 𝐷𝑝𝑛𝑞𝑗𝑚𝑏𝑢𝑗𝑝𝑜 𝑢𝑗𝑛𝑓 • 1000000 ∙ + 0.5𝑡 𝐹𝑦𝑓𝑑𝑣𝑢𝑗𝑝𝑜 𝑢𝑗𝑛𝑓 ≥ 9 𝑒𝑏𝑧𝑡 16 (𝐷𝑝𝑠𝑓𝑡) 5
Measured Kernel Execution Time 560 A == 0 && B == 0 Time (ms) A != B A == 1 && B == 1 0 configurations ordered by runtime 0 5184 6 Measured
Toy Example: Performance Estimation 2D Texture 7ms Global 5ms 4ms AoS SoA AoSoA Layout 7
Toy Example: Performance Estimation 2D • Predicted Execution Time • Execution time of Base + Sum( Δ (Base, Support Configurations) ) Texture 7ms 6ms +2ms -1ms Global 5ms 4ms AoS SoA AoSoA Layout 8
Toy Example: Performance Estimation 3D Texture Global AoS SoA AoSoA Layout 9
Non-Linear Relationship • Not all configurations are linearly related to each other • Shared dimensions • Affect all arrays • L1 cache size • Independent dimensions • Only affect one array • Layout, memory and transposition 10
Toy Example: Prediction Domains L1 cache: EQ Texture Global AoS SoA AoSoA Layout L1 cache: SM L1 cache: L1 Texture Texture Global Global AoS SoA AoSoA AoS SoA AoSoA 11 Layout Layout
Toy Example: Prediction Domains L1 cache: EQ Texture Global AoS SoA AoSoA Layout L1 cache: SM L1 cache: L1 Texture Texture Global Global AoS SoA AoSoA AoS SoA AoSoA 12 Layout Layout
Real Example: Measured Time 560 Time (ms) 0 configurations ordered by runtime 0 5184 13 Measured
Real Example: Base Configurations 560 Time (ms) 0 configurations ordered by runtime 0 5184 14 Measured Base
Real Example: Support Configurations 560 Time (ms) 0 configurations ordered by runtime 0 5184 15 Measured Support Base
Real Example: Prediction 560 Time (ms) 0 configurations ordered by runtime 0 5184 16 Measured Predicted Support Base
Real Example: Prediction (zoom in) 90 85 Time (ms) 80 75 70 65 0 100 200 300 400 500 600 configuration ordered by runtime 17 Predicted Measured
Real Example: Prediction (zoom in) 90 85 Time (ms) 80 75 70 65 0 100 200 300 400 500 600 configuration ordered by runtime 18 Predicted Measured
Real Example: Prediction (zoom in) 90 85 Time (ms) 80 75 70 65 0 100 200 300 400 500 600 configuration ordered by runtime 19 Predicted Measured
Real Example: Prediction (zoom in) 90 85 Time (ms) 80 Measured: 44 / 5184 (0.85%) Our result: 72.52ms (+3.59ms) 75 Min: 68.93ms 70 Max: 526.48ms Avg: 300.75ms 65 0 100 200 300 400 500 600 configuration ordered by runtime 20 Predicted Measured Best Predicted
E VALUATION 21
1. Benchmark: BitonicSort struct { long a; int b; short c; char d; } • Sorting for each field, A < B < C < D • Values limited to 0…1023 to cause equal columns • 2 Kernels • 27 configurations 22
2. Benchmark: KD-Tree Builder • 9 Kernels • > 570k configurations 23
3. Benchmark: REYES • 4 Kernels • > 2.4M configurations 24
Profiling Algorithms • Exhaustive Search [Muraladinharan et al. 2014] • Tries all possible configurations • Greedy Profiling [Liu et al. 2008] • Optimize each dimension after each other • Evolutionary Algorithm [Jordan et al. 2012] • Starts with a random population of configurations • Good configurations are stored • Bad configurations are mutated, combined or randomly sampled 25
Evaluation • Profiling Algorithms • Exhaustive Search (E) • Greedy Algorithm (G) • Evolutionary Algorithm (A) • Our Algorithm (P) • GPUs • GeForce GTX980 (Maxwell) • Tesla K20 (Kepler) • CUDA WatchDog: kills configurations which exceed the execution time of the best found 26
Q UALITY 27
Execution Speed Up: GTX980 w/o WatchDog 1.50 Higher is better 1.00 Bitonic KD-Tree Reyes E G A P 28
Execution Speed Up: Tesla K20 w/o WatchDog 1.50 Higher is better 1.00 Bitonic KD-Tree Reyes E G A P 29
Execution Speed Up: Tesla K20 with WatchDog 1.50 Higher is better 1.00 Bitonic KD-Tree Reyes E EW G GW P PW 30
S PEED U P 31
Profiling Speed Up: BitonicSort 1.2 1.03 Higher is better 1.00 1.00 1.00 1.00 1.00 1 0.96 0.95 0.94 0.92 0.91 0.90 0.85 0.84 0.8 GTX980 E EW G GW A P PW K20 32
Profiling Speed Up: KD-Tree Builder 1,588.2 1600 1,327.8 1200 956.4 Higher is better 865.2 800 387.5 374.8 400 335.6 158.1 138.6 70.6 1.0 1.0 1.0 1.1 0 GTX980 E EW G GW A P PW K20 33
Profiling Speed Up: REYES 10000 9,395.9 8,657.3 8000 Higher is better 6000 4000 3,547.5 3,387.1 3,111.3 2,882.7 1,900.5 2000 987.3 930.3 949.4 1.0 1.0 1.0 1.0 0 GTX980 E EW G GW A P PW K20 34
Summary • Introduced prediction guided profiling algorithm • up to 5.5x faster than other state of the art methods • while achieving comparable results • up to 9300x faster than exhaustive search • 10 days 20 hours 1 minute 40 seconds • Limitations • No global optimization only one kernel at once is optimized 35
Thank you for your attention! Source Code available @ http://tinyurl.com/matog (BSD 3-Clause license) Contact: matog@gris.tu-darmstadt.de
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.