end to end deep learning solution on arm architecture
play

End to End Deep Learning Solution on Arm Architecture Jan. 14 2019, - PowerPoint PPT Presentation

End to End Deep Learning Solution on Arm Architecture Jan. 14 2019, Jammy Zhou HPC and AI convergence TOP500 Trend Arm on the road More than 50 percent of additional flops in the Astra at Sandia National Lab of US is the first latest TOP500


  1. End to End Deep Learning Solution on Arm Architecture Jan. 14 2019, Jammy Zhou

  2. HPC and AI convergence TOP500 Trend Arm on the road More than 50 percent of additional flops in the Astra at Sandia National Lab of US is the first latest TOP500 rankings were from Nvidia Tesla Arm based supercomputer entering TOP500 list, GPUs according to TOP500 report numbered at 203 in the latest ranking. Half of TOP10 systems use Nvidia GPUs, and 122 Good momentum of Arm based supercomputers systems of TOP500 use Nvidia GPUs (64 systems around the world, Post-K from Japan, Tianhe-3 uses P100 GPUs, 46 systems uses V100 GPUs, 12 from China, Catalyst UK, GW4 Isambad and systems uses Kepler GPUs) CEA system from Europe Arm SVE is enabled by Post-K together with the More AI/ML/DL workloads are being added to HPC Tofu D interconnect and HBM2 memory, and will applications with wide adoption of Nvidia GPUs be used for some AI workloads Besides Nvidia GPUs, there are some other accelerator options in the market, for example, MI60/MI50 Radeon Instinct GPUs from AMD, Xilinx and Intel FPGAs, customized ASIC products, etc

  3. HPC and AI in the Cloud HPC Services AI & ML Services Arm on the road Science Cloud with Arm based HPC from HPC Systems (supporting Hisilicon Hi1616 and Marvell Thunder X2) Amazon EC2 A1 instances based on AWS Graviton Arm 64-bit processor for scale-out and Arm based workloads Accelerator CPU Arm Neoverse continuous improvement Accelerators (GPUs, FPGAs, ASICs) Storage Network HPC & AI software stack (languages, Fast and scalable storage, 100 Gbps Ethernet, frameworks, libraries, drivers, compilers, etc), such as NVMe based local InfiniBand, Omni-Path, multi-node distributed support and MPI SSD RDMA and RoCE

  4. HW Diversity & SW Fragmentation Big Data Analytics 1. Difficult to switch between TensorFlowOnSpark SparkFlow CaffeOnSpark ... frameworks by application and algorithm developers 1 DL Frameworks PyTorch Keras 2. Different backends to maintain by framework Caffe TensorFlow MXNet Theano developers for various Caffe2 CNTK Chainer... accelerators PaddlePaddle Model Formats (framework specific, ONNX, NNEF) 3. Multiple frameworks to 4 Deep Learning Compilers (TVM, Glow, XLA, ONNC, etc) support by chip and IP vendors with duplicated Framework support for multiple accelerators 2 efforts, and out-of-tree Libraries support by forking the MIOpen ACL CMSIS-NN cuDNN upstream FFT RNG SPARSE Eigen ... BLAS 4. Multiple configurations to HAL and Drivers support by OEMs/ODMs and cloud vendors Hardware (CPU, GPU, FPGA, ASIC, DSP) 3

  5. Open Neural Network eXchange Ecosystem Framework interoperability & Hardware optimizations ONNX Format ONNX Models ONNX Tools ONNXIFI ONNX Runtime Create Convert Optimize Deploy

  6. ONNX Specifications ONNX-ML Extension Neural-network-only ONNX Defines an extensible computation graph model, Classical machine learning extension built-in operators and standard data types Also support data types of sequences and maps, Support only tensors for input/output data types extend ONNX operator set with ML algorithms not based on neural networks ONNX v1.3 Released on Sep. 1st 2018 More to come... Control Flow support Quantization Functions (composable operators, experimental) Test/Compliance Enhanced shape inference Data pipelines Additional optimization passes Edge/Mobile/IoT ONNXIFI 1.0 (C-backend for accelerators)

  7. ONNX Interface for Framework Integration ONNXIFI Backend ONNXIFI A combination of software layer and hardware Standardized interface for NN inference on device used to run an ONNX graph different accelerators The same software layer can expose multiple Runtime discovery and selection of execution backends backends, as well as ONNX operators supported on each backend Heterogeneous type of backend can distribute work across multiple device types internally Support ONNX format & online model conversion Applications ONNX ONNXIFI ... Frameworks Models libonnxifi.so libonnxifi-glow.so libonnxifi-a.so libonnxifi-b.so libonnxifi-c.dll libonnxifi-d.dylib Library C Glow Library A Library B Library D

  8. ONNX Runtime High-performance and cross-platform inference engine for ONNX models Fully implements the ONNX specification including the ONNX-ML extension Arm platforms are supported on both Linux (experimental) and Windows Diagram from https://github.com/Microsoft/onnxruntime/blob/master/docs/HighLevelDesign.md TensorRT and nGraph support are work in progress

  9. Machine Intelligence A Linaro Strategic Initiative Provide the best-in-class Deep Learning performance by leveraging Neural Network acceleration in IP and SoCs from the Arm ecosystem, through collaborative seamless integration with the ecosystem of AI/ML software frameworks and libraries

  10. Scope from HPC to microcontroller training Edge node & device inference Initial focus on inference support for Cortex-A SOCs Common model description format and APIs to the runtime Common optimized runtime inference engine for Arm-based SoC Plug-in framework to support multiple 3rd party IPs (NPU, GPU, DSP, FPGA) Continuous integration testing and benchmarking Microcontroller * HPC, Data Center & Cloud * CMSIS-NN optimized frameworks/libraries on RTOS SVE based optimization for DL frameworks & libraries Frameworks like uTendor and TensorFlow Lite PCIe/CCIX based heterogeneous accelerator support (quantization, footprint reduction, etc) on Arm servers (drivers, compilers and framework integration, etc) IP based accelerator support & optimization Scale out support for distributed training * under discussion

  11. ArmNN based collaborations - ongoing A good base for future collaborations: 100 man-years of effort, 340,000 lines of code Shipping in over 200 million Android devices based on estimation Impressive performance uplift by software-only improvements over a period of 6 months https://developer.arm.com/products/processors/machine-learning/arm-nn https://community.arm.com/tools/b/blog/posts/arm-nn-the-easy-way-to-deploy-edge-ml

  12. Thanks!

Recommend


More recommend