Early Experience in Benchmarking Edge AI Processors with Object Detection Workloads Bench 2019 Yujie Hui 1 , Jeffrey Lien 2 , and Xiaoyi Lu 1 1 Department of Computer Science and Engineering, The Ohio State University {hui.82, lu.932}@osu.edu 2 NovuMind Inc. jlien@novumind.com The Ohio State University
Overview • Introduction • Overview of Edge AI Processors • Benchmarking Methodology • Evaluation • Conclusion The Ohio State University 2
Edge Computing DATA APP APP APP DATA APP Edge DATA APP Network Computing • Store and process the data closer to the location where it is needed • Deliver low latency to the end users The Ohio State University 3
Artificial Intelligence at the Edge Datacenter (e.g., GPU) • Inference is moving to the edge Data Features Training Evaluation Inference ❖ Heavy workloads in datacenters ❖ Less computationally demanding Edge Devices Datacenter (e.g., GPU) ❖ Low power consumption ❖ Low cost Data Features Training Evaluation Inference The Ohio State University 4
Killer Applications for AI@Edge – Object Detection Ma Machine Learning Use Cases in Facebook • Object Detection: Recommendat Face ID Recommendation ion 3% 2% RNN ASR RNN ASR Object ❖ Higher resolution of input RNN 10% Segmentation Translator 3% images Image RNN Translator Classification 6% Object Object ❖ Larger output tensors Detection Image Detection 34% Classification Object 42% Segmentation ❖ More complicated tasks Face ID Wu et al., Machine Learning at Facebook: Understanding Inference at Edge, HPCA-2019 C. Wu, At-Scale Infrastructure Challenges for Machine Learning, IISWC-2019 (Invited Talk) The Ohio State University 5
Object Detection Workloads - Demo Real life applications: ❖ Self driving cars ❖ Tracking objects ❖ Face detection ❖ Pedestrian detection ❖ Medical imaging ❖ Robotics Low latency and high accuracy inference needs high performance edge devices! The Ohio State University 6
Overview • Introduction • Overview of Edge AI Processors • Edge TPU • NVIDIA Xavier • NovuTensor • Benchmarking Methodology • Evaluation • Conclusion The Ohio State University 7
Edge AI Processors - EdgeTPU • A single-board computer • On-board Edge TPU coprocessor with capable for performing 4 TOPS • 1 GB LPDDR4 memory • Precision: INT 8 • Power: 2.5 watts https://coral.withgoogle.com/products/dev-board • Supports TensorFlow Lite model The Ohio State University 8
Edge AI Processors - Xavier • Volta GPU with 512 CUDA cores • TOPS: 22.6/11.3/1.3 • 16GB LPDDR4X memory • Precision: INT8/FP16/FP32 • Power: 10/15/30 watts https://developer.nvidia.com/embedded/jetson-agx-xavier-developer-kit • Supports CUDA, cuDNN, TensorRT The Ohio State University 9
Edge AI Processors – NovuTensor • Domain specific architecture focusing on performing 3D tensor computation • 2GB DDR4 memory, 15 TOPS • Precision: INT8 • TOPS: 15 Output Tensor Weight Tensor • Power: 20 watts Tensor Convolution Data Tensor • Support PyTorch Novutensor’s 3D Operation [1] https:// patentscope .wipo.int/search/en/detail.jsf?docId=US225521272&tab=NATIONALBIBLIO The Ohio State University 10
Challenges of Benchmarking Edge AI Processors • Challenge-1: Workload Selection v What are the representative models and datasets for benchmarking edge AI processors with object detection workload? • Challenge-2: Deployment v How to deploy deep neural networks on edge devices, given that each edge device needs a specific framework? • Challenge-3: Metrics and Dimensions v How to select an essential set of metrics and dimensions to comprehensively evaluate edge AI devices? The Ohio State University 11
Overview • Introduction • Overview of Edge AI Processors • Benchmarking Methodology • Workload and Dataset Selection • Deployment Experience • Metrics and Dimensions Selection • Evaluation • Conclusion The Ohio State University 12
Object Detection Workloads – YOLOv2 https://pjreddie.com/darknet/yolov2/ Darknet-19 • A real-time object detection system, which tells us what objects are seen • Tiny-YOLO is a lite version of YOLOv2 • Based on Darknet framework, can detect objects in an image or a video • Darknet-19 neural network YOLO9000: Better, Faster, Stronger. Joseph Redmon, Ali Farhadi The Ohio State University 13
Object Detection Workloads – MS COCO • 330K images (>200K labeled) • 1.5 million object instances • 80 object categories Microsoft COCO Dataset Examples ❖ Images contain rich information with many objects per image ❖ Large in number of instances per category http://cocodataset.org/#home Microsoft COCO: Common Objects in Context. Lin et al. The Ohio State University 14
Deployment Experience Retrain the model using ReLU activation function EdgeTPU Xavier NovuTensor Ed�e�TPU���de� Te����F������de� Modify the weights of NVIDIA’s deepstream NovuSDK 32-b������a�����b��� .����������� first convolutional reference applications [3] layer TensorRT 5.0.3 C������� Post-Training Quantization Ed�� TPU D����� Integer 15-watt and 30-watt DarkFlow [1] Ca���a�����da�a modes Ed�e�TPU���de� Te����F����L��e Post-Training Integer 8-b�������d����b��� .����������� Quantization [2] EdgeTPU compiler O��E��� O��H��� �P� [1]https://github.com/thtrieu/darkflow [2]https://medium.com/tensorflow/tensorflow-model-optimization-toolkit-post-training-integer-quantization-b4964a1ea9ba [3]https://github.com/NVIDIA-AI-IOT/deepstream_reference_apps The Ohio State University 15
Metrics and Dimensions Execution time: Mean Average Precision: & • Preprocess !" = $ ' ( )( • Execution Latency (ms) Accuracy % • Postprocess 0 *!" = 1 N - !" ./& Energy Efficiency (Images/sec/watt) Number of input images can be fully processed per unit-power The Ohio State University 16
Overview • Introduction • Overview of Edge AI Processors • Benchmarking Methodology • Evaluation • Conclusion The Ohio State University 17
Accuracy Dimension Execution time: Mean Average Precision: & • Preprocess !" = $ ' ( )( • Execution Latency (ms) Accuracy % • Postprocess 0 *!" = 1 N - !" ./& Energy Efficiency (Images/sec/watt) Number of input images can be fully processed per unit-power The Ohio State University 18
Evaluation Results - Accuracy 0.6 Tiny-YOLO YOLOv2 0.4 mAP 0.2 0 Edge TPU Xavier 15w Xavier MAXW NovuTensor 1080Ti+TensorRT 1080Ti Performance running YOLOv2 and Tiny-YOLO with 416x416 input images • Provide accurate results with 1% to 3% accuracy difference due to lower precision arithmetic • Accuracy degradation is different since the diversified implementation of quantization The Ohio State University 19
Latency Dimension Execution time: Mean Average Precision: & • Preprocess !" = $ ' ( )( • Execution Latency (ms) Accuracy % • Postprocess 0 *!" = 1 N - !" ./& Energy Efficiency (Images/sec/watt) Number of input images can be fully processed per unit-power The Ohio State University 20
Evaluation Results - Latency 100 Tiny-YOLO YOLOv2 Latency 80 (ms) 60 40 20 0 Edge TPU Xavier 15w Xavier MAXW NovuTensor 1080Ti+TensorRT 1080Ti Performance running YOLOv2 and Tiny-YOLO with 416x416 input images ❖ EdgeTPU is 9.5X and 14.79X slower than GPU with running Tiny-YOLO and YOLOv2 ❖ NovuTensor and Xavier are 4.66X - 6.08X slower than the GPU ❖ Xavier is 2X and 5.28X faster than EdgeTPU in the max power mode ❖ NovuTensor is 2.04X and 3.8X faster than EdgeTPU for YOLOv2 and Tiny-YOLO The Ohio State University 21
Energy Efficiency Dimension Execution time: Mean Average Precision: & • Preprocess !" = $ ' ( )( • Execution Latency (ms) Accuracy % • Postprocess 0 *!" = 1 N - !" ./& Energy Efficiency (Images/sec/watt) Number of input images can be fully processed per unit-power The Ohio State University 22
Evaluation Results – Energy Efficiency 15 (image/sec/watt) Tiny-YOLO YOLOv2 Efficiency 10 Energy 5 0 Edge TPU Xavier 15w Xavier MAXW NovuTensor 1080Ti+TensorRT 1080Ti Performance running YOLOv2 and Tiny-YOLO with 416x416 input images ❖ All edge AI processors have higher energy efficiency due to low power consumptions ❖ EdgeTPU delivers 2.9X and 1.13X higher energy efficiency than Xavier; 1.96X and 1.04X higher than NovuTensor The Ohio State University 23
Evaluation Results – Large Images 1 200 0.8 Energy Efficiency (image/sec/watt) Latency (ms) 0.6 100 0.4 0.2 0 0 w W T r i o T R 5 s X 0 1 n r 8 o w W r T i A e r 0 o T T s R e 5 M 1 s 0 n X i u 1 n r v e 8 o v A e T a r r 0 s e o T X e + M n 1 i N i u v i v e T v T a a r e o X 0 X + i N 8 v i 0 T a 1 X 0 8 0 1 (a) Latency (b) Energy Efficiency Performance running YOLOv2 and Tiny-YOLO with 1024X1024 input images • Xavier in the 15-watt mode delivers the best energy efficiency • 1080Ti using TensorRT has the best performance of latency The Ohio State University 24
Recommend
More recommend