gpu program optimizations
play

GPU Program Optimizations Yixun Liu, Eddy Z. Zhang , Xipeng Shen - PowerPoint PPT Presentation

A Cross-Input Adaptive Framework for GPU Program Optimizations Yixun Liu, Eddy Z. Zhang , Xipeng Shen Computer Science Department The College of William & Mary Eddy Z. Zhang Outline GPU overview G-Adapt Framework Evaluation


  1. A Cross-Input Adaptive Framework for GPU Program Optimizations Yixun Liu, Eddy Z. Zhang , Xipeng Shen Computer Science Department The College of William & Mary

  2. Eddy Z. Zhang Outline • GPU overview • G-Adapt Framework • Evaluation • Related & Future Work • Conclusion 2

  3. Eddy Z. Zhang GPU (Graphics Processing Unit) • Architecture ▫ SIMD parallel ▫ Multithreaded ▫ Many core • Feature • Applications ▫ Tremendous ▫ Traditional graphic computational rendering horsepower ▫ Emerging: general data ▫ High mem badwidth parallel computing 3

  4. Eddy Z. Zhang Programming GPU ▫ High level model  Abstraction to multithread platform  C-like programming  No explicit mapping to graphics rendering  E.g, CUDA, Brook+, openCL ▫ NVIDIA CUDA  Kernel func. on GPU  Threads->blocks->grids Graph From CUDA Manual 4

  5. Eddy Z. Zhang Optimization Challenges • Goal ▫ Maximize throughput  Increase occupancy, reduce latency, dynamic instr. • Difficulties ▫ Hard to predict optimization effects  Non-linearity, coupling, undisclosed CUDA details ▫ GPU hardware complexities  Limits: 512 threads per block, 768 threads per SM, etc  Various types of memories: constant, texture, etc ▫ Input sensitivity 5

  6. Eddy Z. Zhang Matrix-Vector Multiplication 6

  7. Eddy Z. Zhang Outline • GPU Overview • G-Adapt Framework • Evaluation • Related & Future Work • Conclusion 7

  8. Eddy Z. Zhang G-ADAPT • Empirical search-based optimization ▫ Three obstacles to address  Construction of the optimization space  Space pruning  Cross-input adaptation 8

  9. Eddy Z. Zhang G-ADAPT: Overview • Source-to-source Stage 2 Stage 1 compiler Code with pragmas & input • Cross-input Pattern recognition & code generation adaptation Empirical search & data collection • Automatic search Optimized input-adaptive <Input, best GPU program & transformations optimizations> • Easy integration of user knowledge through pragmas 9

  10. Eddy Z. Zhang G-ADAPT Pragmas • Supports a programmer-compiler synergy • Covers 2 levels of optimizations ▫ Execution configurations  E.g, thread block dimensions ▫ Code transformations  E.g, loop tile size, unrolling levels 10

  11. Eddy Z. Zhang Pragma Examples #pragma erange 64, 512, 2 #define BLKSZ 256 #pragma lpur_lrange 0, min(BLKSZ, 16), 2 For (i=1; i < BLKSZ; i++) { …… } 11

  12. Eddy Z. Zhang Stage I: Search & Collect Code with pragmas & inputs Optimization G-ADAPT compiler Parameters Optimized GPU code Performance Calibrator Performance Perf. DB Optimization Agent 12

  13. Eddy Z. Zhang G-ADAPT Compiler Code with pragmas & inputs • Two functionalities ▫ Recognize opt. space G-ADAPT Optimization ▫ Program Parameters compiler transformations Optimized GPU code • Based on Cetus [Purdue Univ] Performance Calibrator • Source-to-source Performance • GPU extensions Perf. DB • Support G-ADAPT Optimization pragmas Agent 13

  14. Eddy Z. Zhang Performance Calibrator Code with pragmas & inputs • Invokes CUDA compiler and G-ADAPT Optimization Parameters runs the compiler executable Optimized GPU code • Collect running Performance Calibrator time and GPU occupancy Performance Perf. DB Optimization Agent 14

  15. Eddy Z. Zhang Optimization Agent Code with pragmas & inputs • Determines the G-ADAPT Optimization optimization Parameters compiler param. to try next Optimized GPU code • Uses hill Performance Calibrator climbing to overcome space Performance explosion Perf. problem DB Optimization Agent 15

  16. Eddy Z. Zhang G-ADAPT: Overview Stage 2 Stage 1 Code with pragmas & input Pattern recognition & code generation Empirical search & data collection Optimized input- <Input, best adaptive GPU program optimizations> 16

  17. Eddy Z. Zhang Stage II: PR & Code Gen Perf. • Pattern recognizer DB ▫ Recognize input best parameters Pattern Recognizer ▫ Regression Trees with Least Mean Square G-ADAPT Code Generator • Options for code generator Optimized input adaptive ▫ Multiple versions GPU program ▫ JIT compilers ▫ Linker 17

  18. Eddy Z. Zhang G-ADAPT Code with pragmas & inputs G-ADAPT Optimization Parameters compiler Optimized GPU code Pattern Recognizer Performance Calibrator Performance G-ADAPT Perf. Code Generator DB Optimization Final input adaptive GPU Agent program 18

  19. Eddy Z. Zhang Outline • GPU overview • G-Adapt Framework • Evaluation • Related & Future Work • Conclusion 19

  20. Eddy Z. Zhang Evaluation - Platform • GPU: NVIDIA GeForce 8800 GT ▫ 14 multiprocessors (MP), 112 cores ▫ 512M global mem, 16KB shared mem/MP, 8192 registers/MP ▫ CUDA 2.0 • Host: Intel Xeon 3.6 GHz, Suse Linux 2.6.22 20

  21. Eddy Z. Zhang Benchmarks Benchmark Description #of Inputs Convolution Convolution filter of a 2D signal 10 matrixMul Dense matrix multiplication 9 mvMul Dense matrix vector multiplication (by Fujimoto) 15 reduction Sum of an array 15 scalarProd Scalar products 7 transpose Matrix transpose 18 Transpose-co Coalescing matrix transpose 18 21

  22. Eddy Z. Zhang Training and Prediction Benchmark Training Training time (s) Prediction iterations accuracy convolution 200 2825 100% matrixMul 196 2539 100% mvMul 124 124 93.3% reduction 75 29 80% scalarProd 93 237 100% transpose 54 1639 100% Transpose-co 54 631 100% 22

  23. Eddy Z. Zhang Matrix Vector Multiplication Best Parameter V.S. Input Speed up V.S. Input 23

  24. Eddy Z. Zhang Speed up over default 24 24

  25. Eddy Z. Zhang Speed up over default 25 25

  26. Eddy Z. Zhang Speed up over default http://www.cs.wm.edu/~xshen/Publications/ipdps09.pdf 26 26

  27. Eddy Z. Zhang Outline • GPU overview • G-Adapt Framework • Evaluation • Related & Future Work • Conclusion 27

  28. Eddy Z. Zhang Related Work • Ryoo+: CGO’08 ▫ Efficiency and utilization model for search ▫ Manual transformation; assumptions on applications • Baskaran+:ICS’08 ▫ Polyhedral model for optimizing memory access ▫ Limited to affine loop nests Features of G-Adapt First generally applicable framework Cross-input adaptation 28

  29. Eddy Z. Zhang Future Work • More optimization options ▫ Algorithm Selection ▫ Memory optimization ▫ Divergence Elimination • General Support ▫ Cetus – ANSI C compiler  Non ANSI C features, C++ ▫ CUDA built-in types  E.g. float4, texture and etc 29

  30. Eddy Z. Zhang Outline • GPU overview • G-Adapt Framework • Evaluation • Related & Future Work • Conclusion 30

  31. Eddy Z. Zhang Conclusion • A general tool for GPU optimization • Cross-input adaptation • Synergy between compilers and programmers • Alternative of manual tuning, enabling easy adaptation across architectures 31

  32. Eddy Z. Zhang Acknowledgement • Cetus authors at Purdue ▫ Group led by Eigenmann and Midkiff • John Owens • NVIDIA ▫ donation of device • NSF grants 32

  33. Eddy Z. Zhang Thank you! 33

Recommend


More recommend