Woodpeck odpecker er-DL DL an efficient compiler for accelerating - PowerPoint PPT Presentation

Woodpeck odpecker er-DL DL an efficient compiler for accelerating deep learning on heterogeneous computing architectures Yong Chen Ant Financial, China Dec. 19, 2019

Outline  Woodp dpec ecker er-DL DL Overvi rview  Key Components and Technology  Graph Optimization  Auto Search with GA, RL Algorithm  DSL Compiler (Halide)  Integration with TensorRT and Experiment Figures

Introduction  Accelerating model training and inference is crucial to deep learning  Graph-level optimizations to increase inner-node compute efficiency and reduce inter-node data movement overhead  Operator-level optimizations to speed up mathematical function execution.  Exploration of specialized hardware targeting deep learning  Aims to accelerate deep learning on heterogeneous computing architectures by means of compiling techniques  Explores multi-level optimizations from compute graph to hardware  Exploits machine-learning based fast hyper-parameter search approaches to yield better math function kernels.  Supports diverse hardware including CPUs, GPUs, FPGAs and et al. Woodp odpeck ecker er-DL DL is part rt of Woodp odpeck ecker er, a ge gene neric ric compi piler er frame amewor ork k for r he heter erog ogeneous eneous comp mputin ing g develo veloped ped in n Ant nt Fina nanci cial al.

Deep Learning Compilers Optimization Compute Graph Optimization Framework Tensor Expressions of Graph Auto-Tuners DSL Compiler Software DL Hardware DL Expert- (LLVM, CUDA, (Verilog, HLS, Optimized Metal) Spatial) Libraries

Woodpecker-DL Architecture Woodpecker Frontend Graph Optimization Shape (in-place, pruning, fusion) Inference Math functions in optimized graph Woodpecker Addons Woodpecker AutoSearch Optimizer Model Optimized graph Custom TF Ops Safeguard Ordinary Composite Functions Functions TensorRT Custom PyTorch Extensions Plugins CUDA assembly codes generated Woodpecker Runtime Engine Proprietary PyTorch TensorRT TensorFlow Engine

Outline  Woodpecker-DL Overview  Key Components and Technology  Graph aph Optimi mizat zatio ion  Auto Search with GA, RL Algorithm  DSL Compiler (Halide)  Integration with TensorRT and Experiment Figures

Graph Optimization  Support multiple deep learning frameworks  TensorFlow, PyTorch, Caffe, CoreML  Compute graph optimization Convolution before merging  Simplification, removal and fusion  Horizontal or vertical compositional transformation.  Shape inference of operators Batch Normalization Convolution Convolution Fuse Simplify Smart Batch Norm BiasAdd Merge conv and bn Operator Activation Activation A well ll-kno known n graph ph optimiz timizati tion on example ple

Outline  Woodpecker-DL Overview  Key Components and Technology  Graph Optimization  Auto o Sear arch h with th GA, RL Algorith orithm  DSL Compiler (Halide)  Integration with TensorRT and Experiment Figures

AutoSearch Optimizer  A machine learning-based framework for automated mathematical kernel optimizations Algorithms from various domains Deep learning Parameterized program Efficient program or hardware Graph computing …… Halide GraphIt Weld Spatial CUDA Math optimizations Hardware Data analysis CPU Feedback GPU Optimization algorithms Measurement Perf model FPGA Genetic RL Bayesian Profiling Program Ali-NPU MCMC SA … Historical data Hardware Plasticine Mobile/Embed

AutoSearch: Genetic Algorithm  Genetic Algorithm  Varies population size as per scales of real search space  Joins all hyper-parameters (genes) in order to form a chromosome  Uses Roulette wheel selection

AutoSearch: Search Space TileRZ Shared Thread Options ThreadX ThreadY ThreadZ TileX TileY TileZ Mem LoopOrder Range (1, 56) (1, 56) (1, 64) (1, 8) (1, 8) (1, 8) (1, 6) (1, 4) (1, 6)  Take convolution as an example ：  Image size (1, 64, 56, 56), filter size (64, 64, 1, 1)  9 optimizing dimensions: data splitting dimension, granularity, processing order, caching or not  56 * 56 * 64 * 8 * 8* 8 * 6 * 4 * 6 = 14 billion choices  Brute force: 14 billion * 100 ms per iteration → 22 22 years ars  Brute force with pruning: 230 thousands choices → 1. 1.35 35 da days  Genetic search: 1600 choices → 12 12 mi minu nutes tes

AutoSearch Performance: Genetic Algorithm  Converges in 10 minutes with a population size of 64  2.8x faster than NVIDIA cuDNN, 1.5x faster than TVM

AutoSearch: Reinforcement Learning  Reinforcement Learning  Customized environment and policy graph  Uses RLlib scalable reinforcement learning framework

AutoSearch Performance: Reinforcement Learning  Operations taken from a convolutional model for Ant Financial Payment business  RL finds better performance than GA in some cases (within the same time) 3 2.67 2.60 2.5 2.31 2.13 1.89 Relative Speedup 2 1.36 1.5 1.25 1.18 1.14 1 0.75 0.5 0 conv1a conv1b conv2 conv3 conv4 cuDNN Woodpecker GA Woodpecker RL RL does s not always ays outper tperfor forms s GA

Outline  Woodpecker-DL Overview  Key Components and Technology  Graph Optimization  Auto Search with GA, RL Algorithm  DSL L Compil iler er (Hali lide de)  Integration with TensorRT and Experiment Figures

DSL Compiler: Halide  A Domain-Specific Language (DSL) and compiler for image processing pipelines.  Separates algorithm from schedule  Enables more efficient and flexible optimizations  Open source: : https://github.com/halide/Halide  Algorithm: g x, y = x + y f x, y = g x, y − 1 + g x, y + g(x, y + 1) 3  Schedule: f. gpu_tile(x, y, x o , y o , x i , y i , 8,8)

Intermediate Codes Generated by Halide

Halide Schedules  Drawbacks  Still needs domain-specific knowledge and skills to get good performance.  Given a specific architecture, there are considerable number of schedules to explore  Some schedules are architecture-aware, and thus different hardware needs to exploit different schedules.  Example schedules  Loop  split, reorder, unroll, tile, storage layout and et al.  Stage granularity  Coarse-grained: insufficient shared memory, limiting other schedules (storage granularity)  Fine-grained: insufficient data reuse and inefficient load/store  Schedules are crucial for gaining high performance given a math function  Thus motivated the development of automated search approaches for optimal schedules.

An Example Schedule without Layout Optimization Storage transform (put C o inner-most) (N, C o , H, W) (N, H, W, C o ) N: batch size • C o : output channels • H: output height • W: output width • Performance Profiling 2 150% 1.625 100% 96.30% 1.5 100% 1 1 50.00% 50% 14.10% 9.10% 8.90% 0.5 4.50% 0.00% 0% Global Load Global Store shared Occupancy 0 layout optimize w/o layout Efficiency Efficiency Efficiency optimize layout optimize w/o layout optimize

Outline  Woodpecker-DL Overview  Key Components and Technology  Graph Optimization  Auto Search with GA, RL Algorithm  DSL Compiler (Halide)  Integr egrati ation on with th Tenso sorRT rRT and Experi eriment ent Figur ures es

Runtime Engines  Supports multiple inference engines  Proprietary engine, external serving (TensorFlow, PyTorch, and TensorRT) Diagr gram am showin wing Wood odpecker er-DL DL Integ tegrati tion on with th Tenso sorRT rRT

Performance: ResNet-18 (Breakdown)  For separate convolution operations Resnet-18 (Higher is better) 6.00 5.40 5.00 4.15 3.89 Relative Speedup 3.67 4.00 3.60 3.41 3.46 3.18 2.66 3.00 2.56 2.56 2.36 2.17 1.80 1.81 1.69 2.00 1.46 0.86 0.79 0.61 0.84 0.76 0.53 0.90 1.00 0.00 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 cuDNN TVM Woodpecker

Performance: ResNet-18 (Summation) Resnet-18(Higher is better) 2.5 Accumulated Relative Speedup 2.04 2 1.67 1.5 1 1 0.5 0 cuDNN TVM Woodpecker Sum up the runtim ntimes es of all l convo voluti lution on operation ations

Performance: Ant Financial Payment Model Ant Financial Payment Business (Higher is better) 2.5 TensorRT 2.12 Dynami amic batchi hing ng enabl bled ed 2.00 Woodpecker 2 1.77 Relative Speedup 1.52 1.48 1.43 1.40 1.50 1.5 1.31 1.26 1.34 1.23 1.20 1.23 1.33 1.24 1 0.5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Batch Size

References  S. Chetlur et al. (2014) cuDNN: Efficient Primitives for Deep Learning . arXiv:1410.0759v3.  T. Chen et al. (2018) TVM: an automated end-to-end optimizing compiler for deep learning . In Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation (OSDI'18).  E. Liang et al. (2018) RLlib: Abstractions for Distributed Reinforcement Learning . Proceedings of the 35th International Conference on Machine Learning, 80:3053- 3062.  J. Ragan-Kelley et al. (2018) Halide: decoupling algorithms from schedules for high-performance image processing . Communications of the ACM, 61(1): 106-115.  NVIDIA TensorRT (2019) Programmable Inference Accelerator. https://developer.nvidia.com/tensorrt

Team Members Liu, Chen, Yong Ou, Hang Jin, Yue Yongchao Zhao, Rui Zhang, Yao Teng, Teng Thank You !

Woodpeck odpecker er-DL DL an efficient compiler for accelerating - PowerPoint PPT Presentation

Woodpeck odpecker er-DL DL an efficient compiler for accelerating deep learning on heterogeneous computing architectures Yong Chen Ant Financial, China Dec. 19, 2019 Outline Woodp dpec ecker er-DL DL Overvi rview Key Components

Docker cker Ov Overlay rlay Networks tworks Performance analysis in high-latency environments

1 High Speed Broadband Access F ig u r e 2 : H i g h S p e e d B r o a d b a n d A c c e s

Broadband in Rural Virginia JCOTS Telecommuting Advisory Committee October 31, 2006 Verizon

Measuring Broadband America FCC Headquarters 445 12th Street, SW Washington, DC 20554 25 August

For personal use only FY09 Financial Highlights For personal use only $m SP Telemedia FY09

Project Development under the Oregon DSL In-Lieu Fee Program Dana Hicks Mitigation Policy

5/10/2012 Presentation Overview Permitting the Obvious? Project History A Recycled Water

Formal software engineering for computational modelling Formal software engineering Focus on

Using OPM as a DSM Group Members: Amir Hasson Lital Peretz Neta Kedem Ortal Betesh Oshrit Saad

8 th DSM Workshop Undoing Operational Steps of Domain-Specific Modeling Languages Tim Hartmann ,

Towards Model-Based Testing of Domain-Specific Modelling Languages J. Merilinna, Olli-Pekka

Diagrammatic Tool for Creating RDF graphs Anca Chi - Raiu , Robert Andrei Buchmann University

OCCIware un cadre formel et ou2ll pour la ges2on de

Towards Providing Debugging in the Domain-Specific Modeling Languages for Software Agents Baris

XMLText: From XML Schema to Xtext* Patrick Neubauer, Alexander Bergmayr, Tanja Mayerhofer, Javier

FUEL PRICES PRESENTATION 27 MARCH 2014 Presentation Outline Introduction Policy position

RE-STRATEGISE YOUR VISUAL MERCHANDISING CREATING IN STORE CONCEPTS FOR ONLINE CUSTOMERS

New Center for Student Involvement and updated student organization policy August 17, 2017

New DSO Data: A High Stakes Game Speaker: Maxime Lemerle, Economic Research Moderator: Ilan Goren,

Investors presentation Q1 2019 Results May 29, 2019 1 Disclaimer This presentation contains

LIMESTONE DSO PROJECT NEAR TERM OPPORTUNITY FOR EARLY CASH FLOW GENERATION; LIME FOR NICKEL

Trends in the Development of the Dental Service Organization (DSO) Model: Implications for the

Westchester Community College Disability Services Presented by: Renee Balotti Coordinator of

Direct Support Organizations Presented by Shawnta Friday-Stroud, Ph.D. Board of Trustees Meeting,