guided profiling for auto tuning array
play

Guided Profiling for Auto-Tuning Array Layouts on GPUs Nicolas - PowerPoint PPT Presentation

Guided Profiling for Auto-Tuning Array Layouts on GPUs Nicolas Weber, Sandra C. Amend and Michael Goesele TU Darmstadt Motivation Memory access is one of the most important performance factors in CUDA applications CUDA Programming Guide


  1. Guided Profiling for Auto-Tuning Array Layouts on GPUs Nicolas Weber, Sandra C. Amend and Michael Goesele TU Darmstadt

  2. Motivation • Memory access is one of the most important performance factors in CUDA applications • CUDA Programming Guide • It is one of the three basic optimization strategies to “Optimize memory usage to achieve maximum memory throughput” • Performance difference up to an order of magnitude between best and worst implementation • Experience alone does not guarantee to find the optimal configuration 2

  3. Motivation • Tedious to optimize in big GPU applications • Layouts: Array of Structs, Structure of Arrays, AoSoA • Transpositions of multi-dimensional arrays • Size of L1 cache / shared memory • Memory placement: Global, Texture, Shared, Local and Constant memory • Changing GPU architectures require to reoptimize • Memory hierarchy was changed in every architecture  Automated optimization for most GPUs and algorithm • We develop an open source auto-tuner to automatically optimize array access in CUDA applications (with minimal programming overhead) 3

  4. What is the optimal configuration for a kernel? • Difficult to find an analytical solution • Memory access can be input data sensitive • Different optima for varying input data • Many GPU architectures with different memory hierarchies  Empirical profiling • Requires to compile & execute many different implementations • Very time intensive 4

  5. High Dimensionality Kernel A Function Optimizations L1 cache size Layout Memory Application Kernel B Array A Transposition Layout Memory Kernel C Array B Transposition Layout Memory Array C Transposition • Up to several million configurations! 5𝑡 𝐷𝑝𝑛𝑞𝑗𝑚𝑏𝑢𝑗𝑝𝑜 𝑢𝑗𝑛𝑓 • 1000000 ∙ + 0.5𝑡 𝐹𝑦𝑓𝑑𝑣𝑢𝑗𝑝𝑜 𝑢𝑗𝑛𝑓 ≥ 9 𝑒𝑏𝑧𝑡 16 (𝐷𝑝𝑠𝑓𝑡) 5

  6. Measured Kernel Execution Time 560 A == 0 && B == 0 Time (ms) A != B A == 1 && B == 1 0 configurations ordered by runtime 0 5184 6 Measured

  7. Toy Example: Performance Estimation 2D Texture 7ms Global 5ms 4ms AoS SoA AoSoA Layout 7

  8. Toy Example: Performance Estimation 2D • Predicted Execution Time • Execution time of Base + Sum( Δ (Base, Support Configurations) ) Texture 7ms 6ms +2ms -1ms Global 5ms 4ms AoS SoA AoSoA Layout 8

  9. Toy Example: Performance Estimation 3D Texture Global AoS SoA AoSoA Layout 9

  10. Non-Linear Relationship • Not all configurations are linearly related to each other • Shared dimensions • Affect all arrays • L1 cache size • Independent dimensions • Only affect one array • Layout, memory and transposition 10

  11. Toy Example: Prediction Domains L1 cache: EQ Texture Global AoS SoA AoSoA Layout L1 cache: SM L1 cache: L1 Texture Texture Global Global AoS SoA AoSoA AoS SoA AoSoA 11 Layout Layout

  12. Toy Example: Prediction Domains L1 cache: EQ Texture Global AoS SoA AoSoA Layout L1 cache: SM L1 cache: L1 Texture Texture Global Global AoS SoA AoSoA AoS SoA AoSoA 12 Layout Layout

  13. Real Example: Measured Time 560 Time (ms) 0 configurations ordered by runtime 0 5184 13 Measured

  14. Real Example: Base Configurations 560 Time (ms) 0 configurations ordered by runtime 0 5184 14 Measured Base

  15. Real Example: Support Configurations 560 Time (ms) 0 configurations ordered by runtime 0 5184 15 Measured Support Base

  16. Real Example: Prediction 560 Time (ms) 0 configurations ordered by runtime 0 5184 16 Measured Predicted Support Base

  17. Real Example: Prediction (zoom in) 90 85 Time (ms) 80 75 70 65 0 100 200 300 400 500 600 configuration ordered by runtime 17 Predicted Measured

  18. Real Example: Prediction (zoom in) 90 85 Time (ms) 80 75 70 65 0 100 200 300 400 500 600 configuration ordered by runtime 18 Predicted Measured

  19. Real Example: Prediction (zoom in) 90 85 Time (ms) 80 75 70 65 0 100 200 300 400 500 600 configuration ordered by runtime 19 Predicted Measured

  20. Real Example: Prediction (zoom in) 90 85 Time (ms) 80 Measured: 44 / 5184 (0.85%) Our result: 72.52ms (+3.59ms) 75 Min: 68.93ms 70 Max: 526.48ms Avg: 300.75ms 65 0 100 200 300 400 500 600 configuration ordered by runtime 20 Predicted Measured Best Predicted

  21. E VALUATION 21

  22. 1. Benchmark: BitonicSort struct { long a; int b; short c; char d; } • Sorting for each field, A < B < C < D • Values limited to 0…1023 to cause equal columns • 2 Kernels • 27 configurations 22

  23. 2. Benchmark: KD-Tree Builder • 9 Kernels • > 570k configurations 23

  24. 3. Benchmark: REYES • 4 Kernels • > 2.4M configurations 24

  25. Profiling Algorithms • Exhaustive Search [Muraladinharan et al. 2014] • Tries all possible configurations • Greedy Profiling [Liu et al. 2008] • Optimize each dimension after each other • Evolutionary Algorithm [Jordan et al. 2012] • Starts with a random population of configurations • Good configurations are stored • Bad configurations are mutated, combined or randomly sampled 25

  26. Evaluation • Profiling Algorithms • Exhaustive Search (E) • Greedy Algorithm (G) • Evolutionary Algorithm (A) • Our Algorithm (P) • GPUs • GeForce GTX980 (Maxwell) • Tesla K20 (Kepler) • CUDA WatchDog: kills configurations which exceed the execution time of the best found 26

  27. Q UALITY 27

  28. Execution Speed Up: GTX980 w/o WatchDog 1.50 Higher is better 1.00 Bitonic KD-Tree Reyes E G A P 28

  29. Execution Speed Up: Tesla K20 w/o WatchDog 1.50 Higher is better 1.00 Bitonic KD-Tree Reyes E G A P 29

  30. Execution Speed Up: Tesla K20 with WatchDog 1.50 Higher is better 1.00 Bitonic KD-Tree Reyes E EW G GW P PW 30

  31. S PEED U P 31

  32. Profiling Speed Up: BitonicSort 1.2 1.03 Higher is better 1.00 1.00 1.00 1.00 1.00 1 0.96 0.95 0.94 0.92 0.91 0.90 0.85 0.84 0.8 GTX980 E EW G GW A P PW K20 32

  33. Profiling Speed Up: KD-Tree Builder 1,588.2 1600 1,327.8 1200 956.4 Higher is better 865.2 800 387.5 374.8 400 335.6 158.1 138.6 70.6 1.0 1.0 1.0 1.1 0 GTX980 E EW G GW A P PW K20 33

  34. Profiling Speed Up: REYES 10000 9,395.9 8,657.3 8000 Higher is better 6000 4000 3,547.5 3,387.1 3,111.3 2,882.7 1,900.5 2000 987.3 930.3 949.4 1.0 1.0 1.0 1.0 0 GTX980 E EW G GW A P PW K20 34

  35. Summary • Introduced prediction guided profiling algorithm • up to 5.5x faster than other state of the art methods • while achieving comparable results • up to 9300x faster than exhaustive search • 10 days 20 hours  1 minute 40 seconds • Limitations • No global optimization  only one kernel at once is optimized 35

  36. Thank you for your attention! Source Code available @ http://tinyurl.com/matog (BSD 3-Clause license) Contact: matog@gris.tu-darmstadt.de

Recommend


More recommend