End to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul G. Allen School of Computer Science & Engineering University of Washington
Collaborators University of Washington AWS AI Team Ziheng Jiang Tianqi Chen Thierry Moreau Haichen Shen ARM, NNVM pipeline ML, Software Stack Hardware Stack GPU and many more contributors in the DMLC community Carlos Guestrin Luis Ceze Arvind Krishnamurthy
Deep Learning System Research is Exciting but Hard CNTK Frameworks Computational graph Operator Libraries cuDNN, NNPack, MKL-DNN Hardware
Deep Learning System Research is Exciting but Hard CNTK Frameworks Computational graph Operator Libraries cuDNN, NNPack, MKL-DNN Hardware
Deep Learning System Research is Exciting but Hard CNTK Frameworks Computational graph Operator Libraries cuDNN, NNPack, MKL-DNN Hardware
Deep Learning System Research is Exciting but Hard CNTK Frameworks Computational graph Operator Libraries cuDNN, NNPack, MKL-DNN Hardware Built a new accelerator
Deep Learning System Research is Exciting but Hard Need entire software stack CNTK Frameworks on top of it! Layout transformation Quantization Computational graph Operator kernel optimization Benchmarking …. Operator Libraries cuDNN, NNPack, MKL-DNN Hardware Built a new accelerator
Deep Learning System Research is Exciting but Hard CNTK Frameworks Computational graph Operator Libraries cuDNN, NNPack, MKL-DNN Hardware
Deep Learning System Research is Exciting but Hard CNTK Frameworks Computational graph Operator Libraries cuDNN, NNPack, MKL-DNN Hardware
Deep Learning System Research is Exciting but Hard CNTK Frameworks Data Layout Optimization Computational graph Operator Libraries cuDNN, NNPack, MKL-DNN Hardware
Deep Learning System Research is Exciting but Hard CNTK Frameworks Data Layout Optimization Operator Fusion Computational graph Operator Libraries cuDNN, NNPack, MKL-DNN Hardware
Deep Learning System Research is Exciting but Hard CNTK Frameworks Data Layout Optimization Operator Fusion Computational graph Operator Libraries cuDNN, NNPack, MKL-DNN Need optimized hardware kernel Hardware for each variant, on each hardware!
Deep Learning System Research is Exciting but Hard CNTK Frameworks Data Layout Optimization Operator Fusion Computational graph Serving Operator Libraries cuDNN, NNPack, MKL-DNN Need optimized hardware kernel Hardware for each variant, on each hardware!
The End to End System Challenge Frameworks Hardware Back-Ends
The End to End System Challenge Frameworks Hardware Back-Ends
The End to End System Challenge Frameworks Hardware Back-Ends
The End to End System Challenge Frameworks Hardware Back-Ends
The End to End System Challenge Frameworks Hardware Back-Ends
The End to End System Challenge Frameworks Hardware Back-Ends
The End to End System Challenge Frameworks Hardware Back-Ends
The End to End System Challenge Frameworks Intermediate representation Hardware Back-Ends
Computational Graph IR and Remaining Gap Examples: NGraph, XLA, NNVM, DLVM … Computational Graph Auto Differentiation Memory Plan Operator Fusion Backends
Computational Graph IR and Remaining Gap Computational Graph Auto Differentiation Memory Plan Operator Fusion Backends
Computational Graph IR and Remaining Gap Computational Graph Auto Differentiation Memory Plan Operator Fusion too many possible choices: precision, layout, fused pattern, device, threading … Need a low level IR to express them explicitly Backends
TVM: Low Level IR Framework • Concise and compact description NNVM Graph Auto Differentiation • Explicit control on codegen Memory Plan • Ease of deployment TVM • Support new hardware backends Hardware backends
Tensor Index Expression Declaration Compute C = dot(A, B.T) import tvm m, n, h = tvm.var('m'), tvm.var('n'), tvm.var('h') Inputs A = tvm.placeholder((m, h), name='A') B = tvm.placeholder((n, h), name=‘B') k = tvm.reduce_axis((0, h), name=‘k') C = tvm.compute((m, n), lambda i, j: tvm.sum(A[i, k] * B[j, k], axis=k)) Computation Rule Shape of C
Challenge: Hardware Diversities IR
Challenge: Hardware Diversities CPU GPU Accelerators IR
Challenge: Hardware Diversities CPU GPU Accelerators L2 FIFO L3 Unified Memory IR SM SM L2 L2 Buffer subsystem TX/L1 TX/L1 Acc L1D L1I L1D L1I RF RF RF RF implicitly managed mixed explicitly managed
Challenge: Hardware Diversities CPU GPU Accelerators L2 FIFO L3 Unified Memory IR SM SM L2 L2 Buffer subsystem TX/L1 TX/L1 Acc L1D L1I L1D L1I RF RF RF RF implicitly managed mixed explicitly managed Compute primitives scalar vector tensor
Challenge: Hardware Diversities CPU GPU Accelerators L2 FIFO L3 Unified Memory IR SM SM L2 L2 Buffer subsystem TX/L1 TX/L1 Acc L1D L1I L1D L1I RF RF RF RF implicitly managed mixed explicitly managed Compute primitives scalar vector tensor Data type fp32 fp16 int8
Unified Schedule Optimizations for Hardwares Algorithm described in IR Lowering Generated code (LLVM, CUDA, OpenCL…)
Unified Schedule Optimizations for Hardwares Algorithm Scheduling described in IR Optimization Lowering Generated code (LLVM, CUDA, OpenCL…)
Unified Schedule Optimizations for Hardwares Scheduling Optimizations Algorithm Scheduling described in IR Optimization ( ✔ ) Data layout Lowering Generated code (LLVM, CUDA, OpenCL…)
Unified Schedule Optimizations for Hardwares Scheduling Optimizations Algorithm Scheduling described in IR Optimization ( ✔ ) Data layout ( ✔ ) Tiling Lowering Generated code (LLVM, CUDA, OpenCL…)
Unified Schedule Optimizations for Hardwares Scheduling Optimizations Algorithm Scheduling described in IR Optimization ( ✔ ) Data layout ( ✔ ) Tiling Lowering ( ✔ ) Thread cooperation Generated code (LLVM, CUDA, OpenCL…)
Unified Schedule Optimizations for Hardwares Scheduling Optimizations Algorithm Scheduling described in IR Optimization ( ✔ ) Data layout ( ✔ ) Tiling Lowering ( ✔ ) Thread cooperation ( ✔ ) Latency hiding Generated code (LLVM, CUDA, OpenCL…)
Unified Schedule Optimizations for Hardwares Scheduling Optimizations Algorithm Scheduling described in IR Optimization ( ✔ ) Data layout ( ✔ ) Tiling Lowering ( ✔ ) Thread cooperation ( ✔ ) Latency hiding Generated code ( ✔ ) T ensorization (LLVM, CUDA, OpenCL…)
Separation of Compilation and Deployment Compilation Stack TVM Runtimes Framework Frontends Deploy NNVM TVM TVM Graph Module Heavy optimizations Lightweight, 300 to 600 KB
Remote Execution and Profiling Devices with TVM Runtime TVM RPC Server with TVM Compiler
Performance Portable against state of art Raspberry Pi 3 K80, Baseline Baseline: MXNet with OpenBLAS and NNPack MXNet with cuDNN auto tune enabled Two undergrad weeks One grad student month 3000 8 MXNet MXNet NNVM Compiler NNVM Compiler 2250 1.2x 6 Time cost(ms) Time cost(ms) 1500 1.2x 4 750 2.2x 2 11.5x 0 0 ResNet18 MobileNet ResNet18 MobileNet Credit: Leyuan Wang(AWS/UCDavis), Yuwei Hu(TuSimple), Zheng Jiang(AWS/FDU)
Coming Soon: Target New Accelerators Tensorization Latency Hiding FPGA Example for building new hardware backend Open-source soon
NNVM Compiler: Open Compiler for AI Systems Caffe Keras MXNet PyTorch Caffe2 CNTK NNVM CoreML ONNX Graph Optimizations External Support TVM Supported Joint Work with AWS AI Team Work in progress TVM Primitives and DMLC community More hardware Metal OpenCL LLVM CUDA backends X86 AMDGPUs ARM Javascript/WASM
Deep Learning System Research is Exciting but Hard CNTK Frameworks Computational graph Operator Libraries cuDNN, NNPack, MKL-DNN Hardware
Deep Learning System Research is Just Exciting CNTK Frameworks My new optimizations NNVM works on all platforms ! Graph Optimizations TVM I can program my new accelerators from python :) TVM Primitives Hardware
Deep Learning System Research is Just Exciting CNTK Frameworks My new optimizations NNVM works on all platforms ! Graph Optimizations TVM I can program my new accelerators from python :) TVM Primitives Hardware You can be part of it!
Recommend
More recommend