Woodpeck odpecker er-DL DL an efficient compiler for accelerating deep learning on heterogeneous computing architectures Yong Chen Ant Financial, China Dec. 19, 2019
Outline Woodp dpec ecker er-DL DL Overvi rview Key Components and Technology Graph Optimization Auto Search with GA, RL Algorithm DSL Compiler (Halide) Integration with TensorRT and Experiment Figures
Introduction Accelerating model training and inference is crucial to deep learning Graph-level optimizations to increase inner-node compute efficiency and reduce inter-node data movement overhead Operator-level optimizations to speed up mathematical function execution. Exploration of specialized hardware targeting deep learning Aims to accelerate deep learning on heterogeneous computing architectures by means of compiling techniques Explores multi-level optimizations from compute graph to hardware Exploits machine-learning based fast hyper-parameter search approaches to yield better math function kernels. Supports diverse hardware including CPUs, GPUs, FPGAs and et al. Woodp odpeck ecker er-DL DL is part rt of Woodp odpeck ecker er, a ge gene neric ric compi piler er frame amewor ork k for r he heter erog ogeneous eneous comp mputin ing g develo veloped ped in n Ant nt Fina nanci cial al.
Deep Learning Compilers Optimization Compute Graph Optimization Framework Tensor Expressions of Graph Auto-Tuners DSL Compiler Software DL Hardware DL Expert- (LLVM, CUDA, (Verilog, HLS, Optimized Metal) Spatial) Libraries
Woodpecker-DL Architecture Woodpecker Frontend Graph Optimization Shape (in-place, pruning, fusion) Inference Math functions in optimized graph Woodpecker Addons Woodpecker AutoSearch Optimizer Model Optimized graph Custom TF Ops Safeguard Ordinary Composite Functions Functions TensorRT Custom PyTorch Extensions Plugins CUDA assembly codes generated Woodpecker Runtime Engine Proprietary PyTorch TensorRT TensorFlow Engine
Outline Woodpecker-DL Overview Key Components and Technology Graph aph Optimi mizat zatio ion Auto Search with GA, RL Algorithm DSL Compiler (Halide) Integration with TensorRT and Experiment Figures
Graph Optimization Support multiple deep learning frameworks TensorFlow, PyTorch, Caffe, CoreML Compute graph optimization Convolution before merging Simplification, removal and fusion Horizontal or vertical compositional transformation. Shape inference of operators Batch Normalization Convolution Convolution Fuse Simplify Smart Batch Norm BiasAdd Merge conv and bn Operator Activation Activation A well ll-kno known n graph ph optimiz timizati tion on example ple
Outline Woodpecker-DL Overview Key Components and Technology Graph Optimization Auto o Sear arch h with th GA, RL Algorith orithm DSL Compiler (Halide) Integration with TensorRT and Experiment Figures
AutoSearch Optimizer A machine learning-based framework for automated mathematical kernel optimizations Algorithms from various domains Deep learning Parameterized program Efficient program or hardware Graph computing …… Halide GraphIt Weld Spatial CUDA Math optimizations Hardware Data analysis CPU Feedback GPU Optimization algorithms Measurement Perf model FPGA Genetic RL Bayesian Profiling Program Ali-NPU MCMC SA … Historical data Hardware Plasticine Mobile/Embed
AutoSearch: Genetic Algorithm Genetic Algorithm Varies population size as per scales of real search space Joins all hyper-parameters (genes) in order to form a chromosome Uses Roulette wheel selection
AutoSearch: Search Space TileRZ Shared Thread Options ThreadX ThreadY ThreadZ TileX TileY TileZ Mem LoopOrder Range (1, 56) (1, 56) (1, 64) (1, 8) (1, 8) (1, 8) (1, 6) (1, 4) (1, 6) Take convolution as an example : Image size (1, 64, 56, 56), filter size (64, 64, 1, 1) 9 optimizing dimensions: data splitting dimension, granularity, processing order, caching or not 56 * 56 * 64 * 8 * 8* 8 * 6 * 4 * 6 = 14 billion choices Brute force: 14 billion * 100 ms per iteration → 22 22 years ars Brute force with pruning: 230 thousands choices → 1. 1.35 35 da days Genetic search: 1600 choices → 12 12 mi minu nutes tes
AutoSearch Performance: Genetic Algorithm Converges in 10 minutes with a population size of 64 2.8x faster than NVIDIA cuDNN, 1.5x faster than TVM
AutoSearch: Reinforcement Learning Reinforcement Learning Customized environment and policy graph Uses RLlib scalable reinforcement learning framework
AutoSearch Performance: Reinforcement Learning Operations taken from a convolutional model for Ant Financial Payment business RL finds better performance than GA in some cases (within the same time) 3 2.67 2.60 2.5 2.31 2.13 1.89 Relative Speedup 2 1.36 1.5 1.25 1.18 1.14 1 0.75 0.5 0 conv1a conv1b conv2 conv3 conv4 cuDNN Woodpecker GA Woodpecker RL RL does s not always ays outper tperfor forms s GA
Outline Woodpecker-DL Overview Key Components and Technology Graph Optimization Auto Search with GA, RL Algorithm DSL L Compil iler er (Hali lide de) Integration with TensorRT and Experiment Figures
DSL Compiler: Halide A Domain-Specific Language (DSL) and compiler for image processing pipelines. Separates algorithm from schedule Enables more efficient and flexible optimizations Open source: : https://github.com/halide/Halide Algorithm: g x, y = x + y f x, y = g x, y − 1 + g x, y + g(x, y + 1) 3 Schedule: f. gpu_tile(x, y, x o , y o , x i , y i , 8,8)
Intermediate Codes Generated by Halide
Halide Schedules Drawbacks Still needs domain-specific knowledge and skills to get good performance. Given a specific architecture, there are considerable number of schedules to explore Some schedules are architecture-aware, and thus different hardware needs to exploit different schedules. Example schedules Loop split, reorder, unroll, tile, storage layout and et al. Stage granularity Coarse-grained: insufficient shared memory, limiting other schedules (storage granularity) Fine-grained: insufficient data reuse and inefficient load/store Schedules are crucial for gaining high performance given a math function Thus motivated the development of automated search approaches for optimal schedules.
An Example Schedule without Layout Optimization Storage transform (put C o inner-most) (N, C o , H, W) (N, H, W, C o ) N: batch size • C o : output channels • H: output height • W: output width • Performance Profiling 2 150% 1.625 100% 96.30% 1.5 100% 1 1 50.00% 50% 14.10% 9.10% 8.90% 0.5 4.50% 0.00% 0% Global Load Global Store shared Occupancy 0 layout optimize w/o layout Efficiency Efficiency Efficiency optimize layout optimize w/o layout optimize
Outline Woodpecker-DL Overview Key Components and Technology Graph Optimization Auto Search with GA, RL Algorithm DSL Compiler (Halide) Integr egrati ation on with th Tenso sorRT rRT and Experi eriment ent Figur ures es
Runtime Engines Supports multiple inference engines Proprietary engine, external serving (TensorFlow, PyTorch, and TensorRT) Diagr gram am showin wing Wood odpecker er-DL DL Integ tegrati tion on with th Tenso sorRT rRT
Performance: ResNet-18 (Breakdown) For separate convolution operations Resnet-18 (Higher is better) 6.00 5.40 5.00 4.15 3.89 Relative Speedup 3.67 4.00 3.60 3.41 3.46 3.18 2.66 3.00 2.56 2.56 2.36 2.17 1.80 1.81 1.69 2.00 1.46 0.86 0.79 0.61 0.84 0.76 0.53 0.90 1.00 0.00 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 cuDNN TVM Woodpecker
Performance: ResNet-18 (Summation) Resnet-18(Higher is better) 2.5 Accumulated Relative Speedup 2.04 2 1.67 1.5 1 1 0.5 0 cuDNN TVM Woodpecker Sum up the runtim ntimes es of all l convo voluti lution on operation ations
Performance: Ant Financial Payment Model Ant Financial Payment Business (Higher is better) 2.5 TensorRT 2.12 Dynami amic batchi hing ng enabl bled ed 2.00 Woodpecker 2 1.77 Relative Speedup 1.52 1.48 1.43 1.40 1.50 1.5 1.31 1.26 1.34 1.23 1.20 1.23 1.33 1.24 1 0.5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Batch Size
References S. Chetlur et al. (2014) cuDNN: Efficient Primitives for Deep Learning . arXiv:1410.0759v3. T. Chen et al. (2018) TVM: an automated end-to-end optimizing compiler for deep learning . In Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation (OSDI'18). E. Liang et al. (2018) RLlib: Abstractions for Distributed Reinforcement Learning . Proceedings of the 35th International Conference on Machine Learning, 80:3053- 3062. J. Ragan-Kelley et al. (2018) Halide: decoupling algorithms from schedules for high-performance image processing . Communications of the ACM, 61(1): 106-115. NVIDIA TensorRT (2019) Programmable Inference Accelerator. https://developer.nvidia.com/tensorrt
Team Members Liu, Chen, Yong Ou, Hang Jin, Yue Yongchao Zhao, Rui Zhang, Yao Teng, Teng Thank You !
Recommend
More recommend