GPU Program Optimizations Yixun Liu, Eddy Z. Zhang , Xipeng Shen - PowerPoint PPT Presentation

A Cross-Input Adaptive Framework for GPU Program Optimizations Yixun Liu, Eddy Z. Zhang , Xipeng Shen Computer Science Department The College of William & Mary

Eddy Z. Zhang Outline • GPU overview • G-Adapt Framework • Evaluation • Related & Future Work • Conclusion 2

Eddy Z. Zhang GPU (Graphics Processing Unit) • Architecture ▫ SIMD parallel ▫ Multithreaded ▫ Many core • Feature • Applications ▫ Tremendous ▫ Traditional graphic computational rendering horsepower ▫ Emerging: general data ▫ High mem badwidth parallel computing 3

Eddy Z. Zhang Programming GPU ▫ High level model  Abstraction to multithread platform  C-like programming  No explicit mapping to graphics rendering  E.g, CUDA, Brook+, openCL ▫ NVIDIA CUDA  Kernel func. on GPU  Threads->blocks->grids Graph From CUDA Manual 4

Eddy Z. Zhang Optimization Challenges • Goal ▫ Maximize throughput  Increase occupancy, reduce latency, dynamic instr. • Difficulties ▫ Hard to predict optimization effects  Non-linearity, coupling, undisclosed CUDA details ▫ GPU hardware complexities  Limits: 512 threads per block, 768 threads per SM, etc  Various types of memories: constant, texture, etc ▫ Input sensitivity 5

Eddy Z. Zhang Matrix-Vector Multiplication 6

Eddy Z. Zhang Outline • GPU Overview • G-Adapt Framework • Evaluation • Related & Future Work • Conclusion 7

Eddy Z. Zhang G-ADAPT • Empirical search-based optimization ▫ Three obstacles to address  Construction of the optimization space  Space pruning  Cross-input adaptation 8

Eddy Z. Zhang G-ADAPT: Overview • Source-to-source Stage 2 Stage 1 compiler Code with pragmas & input • Cross-input Pattern recognition & code generation adaptation Empirical search & data collection • Automatic search Optimized input-adaptive <Input, best GPU program & transformations optimizations> • Easy integration of user knowledge through pragmas 9

Eddy Z. Zhang G-ADAPT Pragmas • Supports a programmer-compiler synergy • Covers 2 levels of optimizations ▫ Execution configurations  E.g, thread block dimensions ▫ Code transformations  E.g, loop tile size, unrolling levels 10

Eddy Z. Zhang Pragma Examples #pragma erange 64, 512, 2 #define BLKSZ 256 #pragma lpur_lrange 0, min(BLKSZ, 16), 2 For (i=1; i < BLKSZ; i++) { …… } 11

Eddy Z. Zhang Stage I: Search & Collect Code with pragmas & inputs Optimization G-ADAPT compiler Parameters Optimized GPU code Performance Calibrator Performance Perf. DB Optimization Agent 12

Eddy Z. Zhang G-ADAPT Compiler Code with pragmas & inputs • Two functionalities ▫ Recognize opt. space G-ADAPT Optimization ▫ Program Parameters compiler transformations Optimized GPU code • Based on Cetus [Purdue Univ] Performance Calibrator • Source-to-source Performance • GPU extensions Perf. DB • Support G-ADAPT Optimization pragmas Agent 13

Eddy Z. Zhang Performance Calibrator Code with pragmas & inputs • Invokes CUDA compiler and G-ADAPT Optimization Parameters runs the compiler executable Optimized GPU code • Collect running Performance Calibrator time and GPU occupancy Performance Perf. DB Optimization Agent 14

Eddy Z. Zhang Optimization Agent Code with pragmas & inputs • Determines the G-ADAPT Optimization optimization Parameters compiler param. to try next Optimized GPU code • Uses hill Performance Calibrator climbing to overcome space Performance explosion Perf. problem DB Optimization Agent 15

Eddy Z. Zhang G-ADAPT: Overview Stage 2 Stage 1 Code with pragmas & input Pattern recognition & code generation Empirical search & data collection Optimized input- <Input, best adaptive GPU program optimizations> 16

Eddy Z. Zhang Stage II: PR & Code Gen Perf. • Pattern recognizer DB ▫ Recognize input best parameters Pattern Recognizer ▫ Regression Trees with Least Mean Square G-ADAPT Code Generator • Options for code generator Optimized input adaptive ▫ Multiple versions GPU program ▫ JIT compilers ▫ Linker 17

Eddy Z. Zhang G-ADAPT Code with pragmas & inputs G-ADAPT Optimization Parameters compiler Optimized GPU code Pattern Recognizer Performance Calibrator Performance G-ADAPT Perf. Code Generator DB Optimization Final input adaptive GPU Agent program 18

Eddy Z. Zhang Evaluation - Platform • GPU: NVIDIA GeForce 8800 GT ▫ 14 multiprocessors (MP), 112 cores ▫ 512M global mem, 16KB shared mem/MP, 8192 registers/MP ▫ CUDA 2.0 • Host: Intel Xeon 3.6 GHz, Suse Linux 2.6.22 20

Eddy Z. Zhang Benchmarks Benchmark Description #of Inputs Convolution Convolution filter of a 2D signal 10 matrixMul Dense matrix multiplication 9 mvMul Dense matrix vector multiplication (by Fujimoto) 15 reduction Sum of an array 15 scalarProd Scalar products 7 transpose Matrix transpose 18 Transpose-co Coalescing matrix transpose 18 21

Eddy Z. Zhang Training and Prediction Benchmark Training Training time (s) Prediction iterations accuracy convolution 200 2825 100% matrixMul 196 2539 100% mvMul 124 124 93.3% reduction 75 29 80% scalarProd 93 237 100% transpose 54 1639 100% Transpose-co 54 631 100% 22

Eddy Z. Zhang Matrix Vector Multiplication Best Parameter V.S. Input Speed up V.S. Input 23

Eddy Z. Zhang Speed up over default 24 24

Eddy Z. Zhang Speed up over default 25 25

Eddy Z. Zhang Speed up over default http://www.cs.wm.edu/~xshen/Publications/ipdps09.pdf 26 26

Eddy Z. Zhang Related Work • Ryoo+: CGO’08 ▫ Efficiency and utilization model for search ▫ Manual transformation; assumptions on applications • Baskaran+:ICS’08 ▫ Polyhedral model for optimizing memory access ▫ Limited to affine loop nests Features of G-Adapt First generally applicable framework Cross-input adaptation 28

Eddy Z. Zhang Future Work • More optimization options ▫ Algorithm Selection ▫ Memory optimization ▫ Divergence Elimination • General Support ▫ Cetus – ANSI C compiler  Non ANSI C features, C++ ▫ CUDA built-in types  E.g. float4, texture and etc 29

Eddy Z. Zhang Conclusion • A general tool for GPU optimization • Cross-input adaptation • Synergy between compilers and programmers • Alternative of manual tuning, enabling easy adaptation across architectures 31

Eddy Z. Zhang Acknowledgement • Cetus authors at Purdue ▫ Group led by Eigenmann and Midkiff • John Owens • NVIDIA ▫ donation of device • NSF grants 32

Eddy Z. Zhang Thank you! 33

GPU Program Optimizations Yixun Liu, Eddy Z. Zhang , Xipeng Shen - PowerPoint PPT Presentation

A Cross-Input Adaptive Framework for GPU Program Optimizations Yixun Liu, Eddy Z. Zhang , Xipeng Shen Computer Science Department The College of William & Mary Eddy Z. Zhang Outline GPU overview G-Adapt Framework Evaluation

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

Analysis and Optimizations Analysis and Optimizations Program Analysis Program Analysis

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Concepts Introduced in Chapter 9 introduction to compiler optimizations basic blocks and

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Analysis and Optimizations Program Analysis P3 / 2006 Discovers properties of a program

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

2 3 Motivations 4 Motivations 5 Motivations 6 Motivations 7 8 System Implementation and

Verifying Optimizations using SMT Solvers Nuno Lopes technology Why verify optimizations? from

Implementing Data Layout Optimizations Implementing Data Layout Optimizations in the LLVM

Khem Raj Embedded Linux Conference 2014, San Jose, CA } What is GCC } General Optimizations

Use Tesla to provide first GPU VM Service in China Feng Zhu

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Modeling and Predicting Application Performance on Hardware Accelerators Presented by: Alexander

Ken Shiozaki RIKEN Corroborators: Hassan Shapourian University of Chicago Shinsei Ryu

McBits Revisited ia.cr/2017/793 Tung Chou Osaka University, Japan Code-based cryptography

Announcements Class is 170. Matlab Grader homework, 1 and 2 (of less than 9) homeworks Due 22

Preliminary Comments The initial results of this chapter were covered in the Chapter on

Linear algebra and differential equations (Math 54): Lecture 16 Vivek Shende March 14, 2019

Course on Inverse Problems Albert Tarantola Lesson XIX: Fitting Waveforms (Theory) From medium

Two 2-traces Simon Willerton University of Sheffield f Tr ( f ) := V

GPU Program Optimizations Yixun Liu, Eddy Z. Zhang , Xipeng Shen - PowerPoint PPT Presentation

A Cross-Input Adaptive Framework for GPU Program Optimizations Yixun Liu, Eddy Z. Zhang , Xipeng Shen Computer Science Department The College of William & Mary Eddy Z. Zhang Outline GPU overview G-Adapt Framework Evaluation

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

Analysis and Optimizations Analysis and Optimizations Program Analysis Program Analysis

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Concepts Introduced in Chapter 9 introduction to compiler optimizations basic blocks and

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Analysis and Optimizations Program Analysis P3 / 2006 Discovers properties of a program

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

2 3 Motivations 4 Motivations 5 Motivations 6 Motivations 7 8 System Implementation and

Verifying Optimizations using SMT Solvers Nuno Lopes technology Why verify optimizations? from

Implementing Data Layout Optimizations Implementing Data Layout Optimizations in the LLVM

Khem Raj Embedded Linux Conference 2014, San Jose, CA } What is GCC } General Optimizations

Use Tesla to provide first GPU VM Service in China Feng Zhu

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Modeling and Predicting Application Performance on Hardware Accelerators Presented by: Alexander

Ken Shiozaki RIKEN Corroborators: Hassan Shapourian University of Chicago Shinsei Ryu

McBits Revisited ia.cr/2017/793 Tung Chou Osaka University, Japan Code-based cryptography

Announcements Class is 170. Matlab Grader homework, 1 and 2 (of less than 9) homeworks Due 22

Preliminary Comments The initial results of this chapter were covered in the Chapter on

Linear algebra and differential equations (Math 54): Lecture 16 Vivek Shende March 14, 2019

Course on Inverse Problems Albert Tarantola Lesson XIX: Fitting Waveforms (Theory) From medium

Two 2-traces Simon Willerton University of Sheffield f Tr ( f ) := V

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team