AI Compiler @ Alibaba Xiaoyong Liu Presenting the work of many - PowerPoint PPT Presentation

AI Compiler @ Alibaba Xiaoyong Liu Presenting the work of many people � PAI (Platform of AI) Alibaba Cloud Intelligence

AI Compiler Stack PAI PAI EAS PAI Tensorflow PAI Blade AI Compiler High Performance libraries TVM PAI TAO CPU GPU ASIC FPGA

How TVM is used @ Alibaba • An End-to-End Deep Learning Compiler Ø Empower AI service Ø Generate high performance operators • subgraph & kernel • heterogenous computing • An Optimizer & Compiler Ø Enable chips such as CPU, GPU, DSP & etc. � potentially FPGA, AI Chips. Ø Deploy algorithm automatically • All Scenarios Ø Cloud, Edge & IoT Ø Training & Inference

TVM + AIService � PAI-Blade

Things We Experienced • Current approach is too much engineering effort , difficult for platform service • TVM is good at Ø To generate high-performance computing intensive kernels • Automatic is the key Ø Heterogenous hardware friendly, if ISA is provided • Performance portability Ø Software Architect friendly to Auto TVM / Schedule… Ø Whole-Graph Optimization • Challenges Ø Easy of deployment, including coverage, quality & compatibility • Correctness, Performance & easy of new device enabling Ø Systems don’t interop Ø Maturity/Standardization … 5

Contributed to TVM Community • Automatic Tensor Core Scheduling Ø Nvidia tensor core in V100/T4 NMT • Schedule Algorithm enablement, as Batch Matmul and etc. Ø Know what/how • Support TFLite models Ø Automatically • C++ RPC Server Ø Tuning your program in embedded environment without python

Ongoing Effort to Community • Automatic Tensor Core Scheduling Enhancement Ø vthread supporting • Operators: New Ops in Relay / TVM Ø HashTable � Embedding… • Standardize GraphRuntime Exports Into a Single DLL Ø A way to unify runtime models exports

Product-driven TVM enhancement • Brings online inference service • Enhances Infrastructure • Compiles heterogenous hardware at cloud & edge Ø Nvidia server GPU Ø HIFI4 DSP • V100/T4 on FP16/INT8/INT4/INT1 Ø Hexagon DSP Ø Intel X86 Server CPU • on INT8/FP32/BF16 Ø PowerVR GPU Ø ARM64 CPU Ø Intel GPU • on INT8 / FP32 Ø ARM32 CPU Any general solution is planed to contribute back to TVM Community!

TVM produced more performance • VS Ø Chip supplier’s latest manually optimized high performance library Ø Assembly-level optimized edge machine learning framework • Optimized to gain decent performance on various products Ø Server Nvidia GPU Automatic TC Scheduling + tensorization + tensorcore Ø Edge Arm64 Ø IoT Arm32

Performance on V100 (FP16) M, N, K cuBLAS TensorCore TVM TensorCore speedup 512, 16, 512 7.7470us 5.2570us 1.47X 512, 32, 512 8.0140us 6.0220us 1.33X 512, 64, 512 8.7530us 6.2390us 1.40X 512, 128, 512 9.0290us 7.1610us 1.26X 256, 256, 256 6.9380us 4.5930us 1.51X 1024, 32, 512 8.3320us 6.3770us 1.30X 2048, 32, 512 9.0640us 7.5070us 1.21X

Performance on T4

AliOS enhances TVM on vehicles • To accelerate NLU and AR-Nav models • ARM64 CPU performance on INT8 / FP32 Ø NHWC/img2col+pack/no tensorize&co-optimized with llvm Ø Planning to contribute back to community • Hexagon DSP Ø vrmpy tensorize/llvm-codegen Ø Could run end-to-end Mobilenet V2 INT8 model • Intel GPU Ø Schedule algorithm Ø Boost 1.6X performance of Lanenet model

Performance on ARM64 INT8 Performance Comparison @ rasp 3b+ AARCH64 10.00 8.87 9.00 8.00 7.00 6.00 5.00 3.83 4.00 3.00 2.08 1.61 1.60 1.46 1.32 1.29 1.23 2.00 1.11 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.00 MobileNetV1 MobileNetV2 LaneNet TFLite 1 core TFLite 4 core QNNPACK 1 core QNNPACK 4 core TVM 1 core TVM 4 core

Performance on ARM64 FP32 Performance Comparison AARCH64 1.2 1.17 1.15 1.13 1.1 1.07 1.05 1.03 1 0.95 Mobilenet V1 Mobilenet V2 TVM / MNN @ A53 TVM / MNN @ A72

AI Labs Compiles TMallGenie Models • ARM32 CPU Ø Overflow-Aware Quantization (INT16 = INT8 * INT8) Ø GEMM Tensorize • HIFI4 DSP Ø GEMM Tensorize, 10X speed up • PowerVR GPU Ø Schedule Algorithm

Performance CPU � MTK8167S � ARM32 A35 1.5GHz � Model � MobileNetV2_1.0_224 400 336 350 322 300 230 250 214 200 140 150 100 50 0 1 TF Lite 8bit NCNN 8bit QNNPACK 8bit MNN 8bit ACE Overflow-aware (Assembly)

A DL Compiler in T-HEAD SoC • TVM has been integrated into WuJian( �� ) SoC toolchain • Support Caffe Frontend Ø Tested pass alexnet / resnet 50 / mobilenet v1 / mobilenet v2 / … Tensorflow Caffe TVM T-HEAD NN LLVM WuJian SoC Customized AI Accelerator

TVM Roadmap @ Alibaba • Keep contributing general effort back to community • Auto Schedule (“with Berkeley team”) Ø Auto* is the key to build machine-learning-powered system • Interpolate with top frameworks • Auto heterogenous hardware placement in system level • Infra Maturity Ø Completeness & Seamless Deployment, as quantization, model compatibility • Workload Characterization Ø To improve the key workloads within community • AI Service & Operators Ø more chips, more models

Alibaba & OpenSource Embrace OpenSource Contribute OpenSource Win-Win OpenSource

Takeaways • A golden age of deep learning compiler • Industry-grade deep-learning compilation solution is still in evolution • We are working to contribute to TVM Ø Development & Research • Welcome to join us to contribute to TVM together Ø Xiaoyong Liu (xiaoyong.liu@Alibaba-inc.com)

Thank you! � � �

AI Compiler @ Alibaba Xiaoyong Liu Presenting the work of many - PowerPoint PPT Presentation

AI Compiler @ Alibaba Xiaoyong Liu Presenting the work of many people PAI (Platform of AI) Alibaba Cloud Intelligence AI Compiler Stack PAI PAI EAS PAI Tensorflow PAI Blade AI Compiler High Performance libraries TVM PAI TAO CPU

Percona XtraBackup at Alibaba Cloud Bo Wang Alibaba Cloud About Me Bo Wang (Fungo Wang)

Alibaba Cloud DNS Practice ICANN64 TechDay guochuan.gc@alibaba-inc.com introduction Who we

Trends in Alibaba Zhaogang Wang zhaogang.wzg@alibaba-inc.com 1 About me Senior Specialist

Alibaba Dragonwell JDK: Towards a Java Runtime for Cloud Computing Xiaoming Gu Alibaba JVM Team

Self-Driving Networks Speaker Ming Zhang Alibaba Group Alibaba serves users around the globe

11/8/2012 The Structure of a Compiler (2) The Structure of a Compiler (1) Any compiler must

Technological Innova.on at Alibaba Alan Qi Vice President of Ant Financial Service Group Outline

PolarDB Cloud Native DB @ Alibaba Lixun Peng Inaam Rana Alibaba Cloud Team Agenda

Compiler Construction Chapter 11 1 Compiler Construction Compiler Construction A New Compiler

Accelerated Machine Learning Feng Xie(stephen.xf@alibaba-inc.com) Zhaowei

Live Migration @Alibaba Cloud: issues settled & challenges remain Chao Zhang Email:

252-210: Compiler Design 1.1 Simple compiler model 1.2

BRION TINGLER, Alibaba International Corporate Communications Welcome to Gateway Canada 17

Introduction to YACC Some slides borrowed from Louden YACC Yet Another Compiler Compiler

Whats New in Alibabas X-DB SQL Engine Min Qiu, Alibaba Group Santa Clara, California |

Towards a formally verified obfuscating compiler Sandrine Blazy joint work with Roberto Giacobazzi

Generative Adversarial Networks, Wasserstein Distance, and Adversarial Loss Zhiyu Min Alibaba

Multicore debugging from a SW compiler perspective Marco Roodzant, marco@ace.nl ACE Associated

THE EFFECT OF MERGER AND ACQUISITION ON THE FINANCIAL PERFORMANCE OF ALIBABA IN CHINA Gong Rui 1 ,

Concepts Introduced in Chapter 9 introduction to compiler optimizations basic blocks and

Compiler Construction Lecture 19: Code Generation V (Compiler Backend) Winter Semester 2018/19

Characterizing and Synthesizing Task Dependencies of Data-Parallel Jobs in Alibaba Cloud

Compiler Construction Lecture 2: Compiler Structure and Lexical Analysis 2020-01-10 Michael

CSE 401: Introduction to Compiler Construction Course Outline Goals: Compiler front-ends:

AI Compiler @ Alibaba Xiaoyong Liu Presenting the work of many - PowerPoint PPT Presentation

AI Compiler @ Alibaba Xiaoyong Liu Presenting the work of many people PAI (Platform of AI) Alibaba Cloud Intelligence AI Compiler Stack PAI PAI EAS PAI Tensorflow PAI Blade AI Compiler High Performance libraries TVM PAI TAO CPU

Percona XtraBackup at Alibaba Cloud Bo Wang Alibaba Cloud About Me Bo Wang (Fungo Wang)

Alibaba Cloud DNS Practice ICANN64 TechDay guochuan.gc@alibaba-inc.com introduction Who we

Trends in Alibaba Zhaogang Wang zhaogang.wzg@alibaba-inc.com 1 About me Senior Specialist

Alibaba Dragonwell JDK: Towards a Java Runtime for Cloud Computing Xiaoming Gu Alibaba JVM Team

Self-Driving Networks Speaker Ming Zhang Alibaba Group Alibaba serves users around the globe

11/8/2012 The Structure of a Compiler (2) The Structure of a Compiler (1) Any compiler must

Technological Innova.on at Alibaba Alan Qi Vice President of Ant Financial Service Group Outline

PolarDB Cloud Native DB @ Alibaba Lixun Peng Inaam Rana Alibaba Cloud Team Agenda

Compiler Construction Chapter 11 1 Compiler Construction Compiler Construction A New Compiler

Accelerated Machine Learning Feng Xie(stephen.xf@alibaba-inc.com) Zhaowei

Live Migration @Alibaba Cloud: issues settled &amp; challenges remain Chao Zhang Email:

252-210: Compiler Design 1.1 Simple compiler model 1.2

BRION TINGLER, Alibaba International Corporate Communications Welcome to Gateway Canada 17

Introduction to YACC Some slides borrowed from Louden YACC Yet Another Compiler Compiler

Whats New in Alibabas X-DB SQL Engine Min Qiu, Alibaba Group Santa Clara, California |

Towards a formally verified obfuscating compiler Sandrine Blazy joint work with Roberto Giacobazzi

Generative Adversarial Networks, Wasserstein Distance, and Adversarial Loss Zhiyu Min Alibaba

Multicore debugging from a SW compiler perspective Marco Roodzant, marco@ace.nl ACE Associated

THE EFFECT OF MERGER AND ACQUISITION ON THE FINANCIAL PERFORMANCE OF ALIBABA IN CHINA Gong Rui 1 ,

Concepts Introduced in Chapter 9 introduction to compiler optimizations basic blocks and

Compiler Construction Lecture 19: Code Generation V (Compiler Backend) Winter Semester 2018/19

Characterizing and Synthesizing Task Dependencies of Data-Parallel Jobs in Alibaba Cloud

Compiler Construction Lecture 2: Compiler Structure and Lexical Analysis 2020-01-10 Michael

CSE 401: Introduction to Compiler Construction Course Outline Goals: Compiler front-ends:

Live Migration @Alibaba Cloud: issues settled & challenges remain Chao Zhang Email: