woodpeck odpecker er dl dl
play

Woodpeck odpecker er-DL DL an efficient compiler for accelerating - PowerPoint PPT Presentation

Woodpeck odpecker er-DL DL an efficient compiler for accelerating deep learning on heterogeneous computing architectures Yong Chen Ant Financial, China Dec. 19, 2019 Outline Woodp dpec ecker er-DL DL Overvi rview Key Components


  1. Woodpeck odpecker er-DL DL an efficient compiler for accelerating deep learning on heterogeneous computing architectures Yong Chen Ant Financial, China Dec. 19, 2019

  2. Outline  Woodp dpec ecker er-DL DL Overvi rview  Key Components and Technology  Graph Optimization  Auto Search with GA, RL Algorithm  DSL Compiler (Halide)  Integration with TensorRT and Experiment Figures

  3. Introduction  Accelerating model training and inference is crucial to deep learning  Graph-level optimizations to increase inner-node compute efficiency and reduce inter-node data movement overhead  Operator-level optimizations to speed up mathematical function execution.  Exploration of specialized hardware targeting deep learning  Aims to accelerate deep learning on heterogeneous computing architectures by means of compiling techniques  Explores multi-level optimizations from compute graph to hardware  Exploits machine-learning based fast hyper-parameter search approaches to yield better math function kernels.  Supports diverse hardware including CPUs, GPUs, FPGAs and et al. Woodp odpeck ecker er-DL DL is part rt of Woodp odpeck ecker er, a ge gene neric ric compi piler er frame amewor ork k for r he heter erog ogeneous eneous comp mputin ing g develo veloped ped in n Ant nt Fina nanci cial al.

  4. Deep Learning Compilers Optimization Compute Graph Optimization Framework Tensor Expressions of Graph Auto-Tuners DSL Compiler Software DL Hardware DL Expert- (LLVM, CUDA, (Verilog, HLS, Optimized Metal) Spatial) Libraries

  5. Woodpecker-DL Architecture Woodpecker Frontend Graph Optimization Shape (in-place, pruning, fusion) Inference Math functions in optimized graph Woodpecker Addons Woodpecker AutoSearch Optimizer Model Optimized graph Custom TF Ops Safeguard Ordinary Composite Functions Functions TensorRT Custom PyTorch Extensions Plugins CUDA assembly codes generated Woodpecker Runtime Engine Proprietary PyTorch TensorRT TensorFlow Engine

  6. Outline  Woodpecker-DL Overview  Key Components and Technology  Graph aph Optimi mizat zatio ion  Auto Search with GA, RL Algorithm  DSL Compiler (Halide)  Integration with TensorRT and Experiment Figures

  7. Graph Optimization  Support multiple deep learning frameworks  TensorFlow, PyTorch, Caffe, CoreML  Compute graph optimization Convolution before merging  Simplification, removal and fusion  Horizontal or vertical compositional transformation.  Shape inference of operators Batch Normalization Convolution Convolution Fuse Simplify Smart Batch Norm BiasAdd Merge conv and bn Operator Activation Activation A well ll-kno known n graph ph optimiz timizati tion on example ple

  8. Outline  Woodpecker-DL Overview  Key Components and Technology  Graph Optimization  Auto o Sear arch h with th GA, RL Algorith orithm  DSL Compiler (Halide)  Integration with TensorRT and Experiment Figures

  9. AutoSearch Optimizer  A machine learning-based framework for automated mathematical kernel optimizations Algorithms from various domains Deep learning Parameterized program Efficient program or hardware Graph computing …… Halide GraphIt Weld Spatial CUDA Math optimizations Hardware Data analysis CPU Feedback GPU Optimization algorithms Measurement Perf model FPGA Genetic RL Bayesian Profiling Program Ali-NPU MCMC SA … Historical data Hardware Plasticine Mobile/Embed

  10. AutoSearch: Genetic Algorithm  Genetic Algorithm  Varies population size as per scales of real search space  Joins all hyper-parameters (genes) in order to form a chromosome  Uses Roulette wheel selection

  11. AutoSearch: Search Space TileRZ Shared Thread Options ThreadX ThreadY ThreadZ TileX TileY TileZ Mem LoopOrder Range (1, 56) (1, 56) (1, 64) (1, 8) (1, 8) (1, 8) (1, 6) (1, 4) (1, 6)  Take convolution as an example :  Image size (1, 64, 56, 56), filter size (64, 64, 1, 1)  9 optimizing dimensions: data splitting dimension, granularity, processing order, caching or not  56 * 56 * 64 * 8 * 8* 8 * 6 * 4 * 6 = 14 billion choices  Brute force: 14 billion * 100 ms per iteration → 22 22 years ars  Brute force with pruning: 230 thousands choices → 1. 1.35 35 da days  Genetic search: 1600 choices → 12 12 mi minu nutes tes

  12. AutoSearch Performance: Genetic Algorithm  Converges in 10 minutes with a population size of 64  2.8x faster than NVIDIA cuDNN, 1.5x faster than TVM

  13. AutoSearch: Reinforcement Learning  Reinforcement Learning  Customized environment and policy graph  Uses RLlib scalable reinforcement learning framework

  14. AutoSearch Performance: Reinforcement Learning  Operations taken from a convolutional model for Ant Financial Payment business  RL finds better performance than GA in some cases (within the same time) 3 2.67 2.60 2.5 2.31 2.13 1.89 Relative Speedup 2 1.36 1.5 1.25 1.18 1.14 1 0.75 0.5 0 conv1a conv1b conv2 conv3 conv4 cuDNN Woodpecker GA Woodpecker RL RL does s not always ays outper tperfor forms s GA

  15. Outline  Woodpecker-DL Overview  Key Components and Technology  Graph Optimization  Auto Search with GA, RL Algorithm  DSL L Compil iler er (Hali lide de)  Integration with TensorRT and Experiment Figures

  16. DSL Compiler: Halide  A Domain-Specific Language (DSL) and compiler for image processing pipelines.  Separates algorithm from schedule  Enables more efficient and flexible optimizations  Open source: : https://github.com/halide/Halide  Algorithm: g x, y = x + y f x, y = g x, y − 1 + g x, y + g(x, y + 1) 3  Schedule: f. gpu_tile(x, y, x o , y o , x i , y i , 8,8)

  17. Intermediate Codes Generated by Halide

  18. Halide Schedules  Drawbacks  Still needs domain-specific knowledge and skills to get good performance.  Given a specific architecture, there are considerable number of schedules to explore  Some schedules are architecture-aware, and thus different hardware needs to exploit different schedules.  Example schedules  Loop  split, reorder, unroll, tile, storage layout and et al.  Stage granularity  Coarse-grained: insufficient shared memory, limiting other schedules (storage granularity)  Fine-grained: insufficient data reuse and inefficient load/store  Schedules are crucial for gaining high performance given a math function  Thus motivated the development of automated search approaches for optimal schedules.

  19. An Example Schedule without Layout Optimization Storage transform (put C o inner-most) (N, C o , H, W) (N, H, W, C o ) N: batch size • C o : output channels • H: output height • W: output width • Performance Profiling 2 150% 1.625 100% 96.30% 1.5 100% 1 1 50.00% 50% 14.10% 9.10% 8.90% 0.5 4.50% 0.00% 0% Global Load Global Store shared Occupancy 0 layout optimize w/o layout Efficiency Efficiency Efficiency optimize layout optimize w/o layout optimize

  20. Outline  Woodpecker-DL Overview  Key Components and Technology  Graph Optimization  Auto Search with GA, RL Algorithm  DSL Compiler (Halide)  Integr egrati ation on with th Tenso sorRT rRT and Experi eriment ent Figur ures es

  21. Runtime Engines  Supports multiple inference engines  Proprietary engine, external serving (TensorFlow, PyTorch, and TensorRT) Diagr gram am showin wing Wood odpecker er-DL DL Integ tegrati tion on with th Tenso sorRT rRT

  22. Performance: ResNet-18 (Breakdown)  For separate convolution operations Resnet-18 (Higher is better) 6.00 5.40 5.00 4.15 3.89 Relative Speedup 3.67 4.00 3.60 3.41 3.46 3.18 2.66 3.00 2.56 2.56 2.36 2.17 1.80 1.81 1.69 2.00 1.46 0.86 0.79 0.61 0.84 0.76 0.53 0.90 1.00 0.00 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 cuDNN TVM Woodpecker

  23. Performance: ResNet-18 (Summation) Resnet-18(Higher is better) 2.5 Accumulated Relative Speedup 2.04 2 1.67 1.5 1 1 0.5 0 cuDNN TVM Woodpecker Sum up the runtim ntimes es of all l convo voluti lution on operation ations

  24. Performance: Ant Financial Payment Model Ant Financial Payment Business (Higher is better) 2.5 TensorRT 2.12 Dynami amic batchi hing ng enabl bled ed 2.00 Woodpecker 2 1.77 Relative Speedup 1.52 1.48 1.43 1.40 1.50 1.5 1.31 1.26 1.34 1.23 1.20 1.23 1.33 1.24 1 0.5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Batch Size

  25. References  S. Chetlur et al. (2014) cuDNN: Efficient Primitives for Deep Learning . arXiv:1410.0759v3.  T. Chen et al. (2018) TVM: an automated end-to-end optimizing compiler for deep learning . In Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation (OSDI'18).  E. Liang et al. (2018) RLlib: Abstractions for Distributed Reinforcement Learning . Proceedings of the 35th International Conference on Machine Learning, 80:3053- 3062.  J. Ragan-Kelley et al. (2018) Halide: decoupling algorithms from schedules for high-performance image processing . Communications of the ACM, 61(1): 106-115.  NVIDIA TensorRT (2019) Programmable Inference Accelerator. https://developer.nvidia.com/tensorrt

  26. Team Members Liu, Chen, Yong Ou, Hang Jin, Yue Yongchao Zhao, Rui Zhang, Yao Teng, Teng Thank You !

Recommend


More recommend