overview of performance prediction tools for better
play

Overview of Performance Prediction Tools for Better Development and - PowerPoint PPT Presentation

Overview of Performance Prediction Tools for Better Development and Tuning Support Universidade Federal Fluminense Rommel Anatoli Quintanilla Cruz / Master's Student Esteban Clua / Associate Professor th GTC 2016, San Jose, CA, USA, April 7 ,


  1. Overview of Performance Prediction Tools for Better Development and Tuning Support Universidade Federal Fluminense Rommel Anatoli Quintanilla Cruz / Master's Student Esteban Clua / Associate Professor th GTC 2016, San Jose, CA, USA, April 7 , 2016

  2. What you will learn from this talk ...

  3. Outline • Motivation • Performance models • Applications • Challenges

  4. Performance Optimization Cycle* 1. Profile Application 2. Identify 5. Change and Performance Test Code Limiters 3. Analyze Profile 4. Reflect & Find Indicators * Adapted from S5173 CUDA Optimization with NVIDIA NSIGHT ECLIPSE Edition – GTC 2015

  5. Performance Analysis Tools The PAPI NVIDIA Visual CUDA Profiler Component The NVIDIA CUDA Profiling Tools Interface

  6. Performance tools are still evolving CUDA 7.5 Instruction-level profiling NVIDIA Visual Profiler

  7. Performance tools are still evolving But it's still not enough ... Concurrent Kernel Execution Power Streaming

  8. Outline • Motivation • Performance models • Applications • Challenges

  9. Performance models Performance model

  10. Performance models Pseudocode Source code PTX Performance model CUBIN Target Device Information Input

  11. Performance models Pseudocode Power consumption Source code estimation PTX Execution time prediction Performance model on a target device CUBIN Performance bottlenecks Target Device identification Information Input Output

  12. Types of performance models Analytical Simulation Models Statistical Models Advantages & Disadvantages

  13. Analytical models The MWP-CWP model [Hong & Kim 2009] MWP: Memory warp parallelism CWP: Computation warp parallelism

  14. Statistical models * GPGPU performance and power estimation using machine learning. - Wu, Gene, et al.

  15. Simulation PTX Emulation PTX Kernel GPU Execution LLVM Translation GPU Ocelot

  16. Outline • Motivation • Performance models • Applications • Challenges

  17. Applications of performance models Successfully used to … schedule concurrent kernels • Today ! make auto-tuning • estimate power consumption • identify performance bottlenecks • make workload balancing •

  18. Auto-tuning • Optimization goals • Parameters • Large search space

  19. Concurrent Kernel Execution Supported since Fermi Limitations: Registers, Shared Memory, Occupancy * Image from http://www.turkpaylasim.com/cevahir

  20. Outline • Motivation • Performance models • Applications • Challenges

  21. Challenges Multiple-gpu systems, heterogeneous systems • Each microarchitecture has its own features • More complex execution behavior is harder to • model accurately

  22. References and Further reading Hong, Sunpyo, and Hyesoon Kim. "An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness." ACM SIGARCH Computer Architecture News. Vol. 37. No. 3. ACM, 2009. Kim, Hyesoon, et al. "Performance analysis and tuning for general purpose graphics processing units (GPGPU)." Synthesis Lectures on Computer Architecture 7.2 (2012): 1-96. Lopez-Novoa, Unai, Alexander Mendiburu, and José Miguel-Alonso. "A survey of performance modeling and simulation techniques for accelerator-based computing." Parallel and Distributed Systems, IEEE Transactions on 26.1 (2015): 272-281. Zhong, Jianlong, and Bingsheng He. "Kernelet: High-throughput gpu kernel executions with dynamic slicing and scheduling." Parallel and Distributed Systems, IEEE Transactions on 25.6 (2014): 1522-1532.

  23. Acknowledgements

  24. Thank you! #GTC16 http://medialab.ic.uff.br Contact: rquintanillac@ic.uff.br esteban@ic.uff.br

  25. Questions & Answers

  26. Backup Slides

  27. Simplified compilation flow CUDA Compiler .cu nvcc CUDA Front End cudafe .cpu .gpu Host code Device code cicc High level optimizer and PTX generator .ptx Virtual Instruction Set ptxas PTX Optimizing Assembler .cubin CUDA Binary File Host Compiler .fatbinary CUDA Executable

  28. Concurrent Kernel Execution Leftover policy Timeline K1 4 blocks K2 K1 K1 ... ... 16 blocks 16 blocks 16 blocks K2 12 blocks Kernel slicing Timeline K1 6 blocks K1 6 blocks K1 6 blocks K1 6 blocks ... K2 10 blocks K2 10 blocks K2 10 blocks K2 10 blocks * Improving GPGPU energy-efficiency through concurrent kernel execution and DVFS - Jiao, Qing, et al.

Recommend


More recommend