Guided Profiling for Auto-Tuning Array Layouts on GPUs Nicolas Weber, Sandra C. Amend and Michael Goesele TU Darmstadt
Motivation • Memory access is one of the most important performance factors in CUDA applications • CUDA Programming Guide • It is one of the three basic optimization strategies to “Optimize memory usage to achieve maximum memory throughput” • Performance difference up to an order of magnitude between best and worst implementation • Experience alone does not guarantee to find the optimal configuration 2
Motivation • Tedious to optimize in big GPU applications • Layouts: Array of Structs, Structure of Arrays, AoSoA • Transpositions of multi-dimensional arrays • Size of L1 cache / shared memory • Memory placement: Global, Texture, Shared, Local and Constant memory • Changing GPU architectures require to reoptimize • Memory hierarchy was changed in every architecture Automated optimization for most GPUs and algorithm • We develop an open source auto-tuner to automatically optimize array access in CUDA applications (with minimal programming overhead) 3
What is the optimal configuration for a kernel? • Difficult to find an analytical solution • Memory access can be input data sensitive • Different optima for varying input data • Many GPU architectures with different memory hierarchies Empirical profiling • Requires to compile & execute many different implementations • Very time intensive 4
High Dimensionality Kernel A Function Optimizations L1 cache size Layout Memory Application Kernel B Array A Transposition Layout Memory Kernel C Array B Transposition Layout Memory Array C Transposition • Up to several million configurations! 5𝑡 𝐷𝑝𝑛𝑞𝑗𝑚𝑏𝑢𝑗𝑝𝑜 𝑢𝑗𝑛𝑓 • 1000000 ∙ + 0.5𝑡 𝐹𝑦𝑓𝑑𝑣𝑢𝑗𝑝𝑜 𝑢𝑗𝑛𝑓 ≥ 9 𝑒𝑏𝑧𝑡 16 (𝐷𝑝𝑠𝑓𝑡) 5
Measured Kernel Execution Time 560 A == 0 && B == 0 Time (ms) A != B A == 1 && B == 1 0 configurations ordered by runtime 0 5184 6 Measured
Toy Example: Performance Estimation 2D Texture 7ms Global 5ms 4ms AoS SoA AoSoA Layout 7
Toy Example: Performance Estimation 2D • Predicted Execution Time • Execution time of Base + Sum( Δ (Base, Support Configurations) ) Texture 7ms 6ms +2ms -1ms Global 5ms 4ms AoS SoA AoSoA Layout 8
Toy Example: Performance Estimation 3D Texture Global AoS SoA AoSoA Layout 9
Non-Linear Relationship • Not all configurations are linearly related to each other • Shared dimensions • Affect all arrays • L1 cache size • Independent dimensions • Only affect one array • Layout, memory and transposition 10
Toy Example: Prediction Domains L1 cache: EQ Texture Global AoS SoA AoSoA Layout L1 cache: SM L1 cache: L1 Texture Texture Global Global AoS SoA AoSoA AoS SoA AoSoA 11 Layout Layout
Toy Example: Prediction Domains L1 cache: EQ Texture Global AoS SoA AoSoA Layout L1 cache: SM L1 cache: L1 Texture Texture Global Global AoS SoA AoSoA AoS SoA AoSoA 12 Layout Layout
Real Example: Measured Time 560 Time (ms) 0 configurations ordered by runtime 0 5184 13 Measured
Real Example: Base Configurations 560 Time (ms) 0 configurations ordered by runtime 0 5184 14 Measured Base
Real Example: Support Configurations 560 Time (ms) 0 configurations ordered by runtime 0 5184 15 Measured Support Base
Real Example: Prediction 560 Time (ms) 0 configurations ordered by runtime 0 5184 16 Measured Predicted Support Base
Real Example: Prediction (zoom in) 90 85 Time (ms) 80 75 70 65 0 100 200 300 400 500 600 configuration ordered by runtime 17 Predicted Measured
Real Example: Prediction (zoom in) 90 85 Time (ms) 80 75 70 65 0 100 200 300 400 500 600 configuration ordered by runtime 18 Predicted Measured
Real Example: Prediction (zoom in) 90 85 Time (ms) 80 75 70 65 0 100 200 300 400 500 600 configuration ordered by runtime 19 Predicted Measured
Real Example: Prediction (zoom in) 90 85 Time (ms) 80 Measured: 44 / 5184 (0.85%) Our result: 72.52ms (+3.59ms) 75 Min: 68.93ms 70 Max: 526.48ms Avg: 300.75ms 65 0 100 200 300 400 500 600 configuration ordered by runtime 20 Predicted Measured Best Predicted
E VALUATION 21
1. Benchmark: BitonicSort struct { long a; int b; short c; char d; } • Sorting for each field, A < B < C < D • Values limited to 0…1023 to cause equal columns • 2 Kernels • 27 configurations 22
2. Benchmark: KD-Tree Builder • 9 Kernels • > 570k configurations 23
3. Benchmark: REYES • 4 Kernels • > 2.4M configurations 24
Profiling Algorithms • Exhaustive Search [Muraladinharan et al. 2014] • Tries all possible configurations • Greedy Profiling [Liu et al. 2008] • Optimize each dimension after each other • Evolutionary Algorithm [Jordan et al. 2012] • Starts with a random population of configurations • Good configurations are stored • Bad configurations are mutated, combined or randomly sampled 25
Evaluation • Profiling Algorithms • Exhaustive Search (E) • Greedy Algorithm (G) • Evolutionary Algorithm (A) • Our Algorithm (P) • GPUs • GeForce GTX980 (Maxwell) • Tesla K20 (Kepler) • CUDA WatchDog: kills configurations which exceed the execution time of the best found 26
Q UALITY 27
Execution Speed Up: GTX980 w/o WatchDog 1.50 Higher is better 1.00 Bitonic KD-Tree Reyes E G A P 28
Execution Speed Up: Tesla K20 w/o WatchDog 1.50 Higher is better 1.00 Bitonic KD-Tree Reyes E G A P 29
Execution Speed Up: Tesla K20 with WatchDog 1.50 Higher is better 1.00 Bitonic KD-Tree Reyes E EW G GW P PW 30
S PEED U P 31
Profiling Speed Up: BitonicSort 1.2 1.03 Higher is better 1.00 1.00 1.00 1.00 1.00 1 0.96 0.95 0.94 0.92 0.91 0.90 0.85 0.84 0.8 GTX980 E EW G GW A P PW K20 32
Profiling Speed Up: KD-Tree Builder 1,588.2 1600 1,327.8 1200 956.4 Higher is better 865.2 800 387.5 374.8 400 335.6 158.1 138.6 70.6 1.0 1.0 1.0 1.1 0 GTX980 E EW G GW A P PW K20 33
Profiling Speed Up: REYES 10000 9,395.9 8,657.3 8000 Higher is better 6000 4000 3,547.5 3,387.1 3,111.3 2,882.7 1,900.5 2000 987.3 930.3 949.4 1.0 1.0 1.0 1.0 0 GTX980 E EW G GW A P PW K20 34
Summary • Introduced prediction guided profiling algorithm • up to 5.5x faster than other state of the art methods • while achieving comparable results • up to 9300x faster than exhaustive search • 10 days 20 hours 1 minute 40 seconds • Limitations • No global optimization only one kernel at once is optimized 35
Thank you for your attention! Source Code available @ http://tinyurl.com/matog (BSD 3-Clause license) Contact: matog@gris.tu-darmstadt.de
Recommend
More recommend