deep learning deployment with nvidia tensorrt
play

DEEP LEARNING DEPLOYMENT WITH NVIDIA TENSORRT Shashank Prasanna - PowerPoint PPT Presentation

DEEP LEARNING DEPLOYMENT WITH NVIDIA TENSORRT Shashank Prasanna Deep Learning in Production - Current Approaches - Deployment Challenges NVIDIA TensorRT - Programmable Inference Accelerator - Performance, Optimizations and Features


  1. DEEP LEARNING DEPLOYMENT WITH NVIDIA TENSORRT Shashank Prasanna

  2. Deep Learning in Production - Current Approaches - Deployment Challenges NVIDIA TensorRT - Programmable Inference Accelerator - Performance, Optimizations and Features AGENDA Example - Import, Optimize and Deploy TensorFlow Models with TensorRT Key Takeaways and Additional Resources Q&A 2

  3. DEEP LEARNING IN PRODUCTION Speech Recognition Recommender Systems Autonomous Driving Real-time Object Recognition Robotics Real-time Language Translation Many More… 3

  4. CURRENT DEPLOYMENT WORKFLOW TRAINING UNOPTIMIZED DEPLOYMENT 1 Deploy training Data Management framework 2 Training Training Trained Neural Deploy custom Data Network application using NVIDIA DL SDK Model Assessment 3 Framework or custom CPU-Only application CUDA, NVIDIA Deep Learning SDK (cuDNN, cuBLAS, NCCL) 4

  5. CHALLENGES WITH CURRENT APPROACHES Requirement Challenges Unable to processing high-volume, high-velocity data High Throughput ➢ Impact: Increased cost ($, time) per inference Applications don’t deliver real -time results ➢ Impact: Negatively affects user experience (voice recognition, Low Response Time personalized recommendations, real-time object detection) Inefficient applications Power and Memory ➢ Impact: Increased cost (running and cooling), makes deployment Efficiency infeasible Research frameworks not designed for production Deployment-Grade ➢ Impact: Framework overhead and dependencies increases time Solution to solution and affects productivity 5

  6. NVIDIA TENSORRT Programmable Inference Accelerator FRAMEWORKS GPU PLATFORMS TESLA P4 TensorRT JETSON TX2 Optimizer Runtime DRIVE PX 2 NVIDIA DLA TESLA V100 6 developer.nvidia.com/tensorrt

  7. TENSORRT PERFORMANCE 40x Faster CNNs on V100 vs. CPU-Only 140x Faster Language Translation RNNs on Under 7ms Latency (ResNet50) V100 vs. CPU-Only Inference (OpenNMT) 600 500 40 550 5700 6,000 450 35 500 400 5,000 30 350 Latency (ms) Images/sec 400 Latency (ms) 4,000 Images/sec 25 300 280 ms 20 300 250 3,000 14 ms 200 15 2,000 200 153 ms 150 10 117 ms 6.83 ms 6.67 ms 100 1,000 5 100 305 140 50 25 4 0 0 0 0 CPU-Only V100 + V100 + TensorRT CPU-Only + Torch V100 + Torch V100 + TensorRT TensorFlow Inference throughput (images/sec) on ResNet50. V100 + TensorRT : NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2- Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT : NVIDIA TensorRT (FP32), batch size 64, Tesla V100- 16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow : Preview of volta optimized TensorFlow (FP16), PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. V100 + Torch : Torch (FP32), batch size 4, Tesla V100-PCIE- batch size 2, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587 16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 v4@2.60GHz Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake 3.5GHz Turbo (Broadwell) HT On with AVX512. 7 developer.nvidia.com/tensorrt

  8. TENSORRT DEPLOYMENT WORKFLOW Step 1 : Optimize trained model Plan 1 Import Serialize Model Engine Plan 2 Plan 3 Trained Neural TensorRT Optimizer Optimized Plans Network Step 2 : Deploy optimized plans with runtime De-serialize Deploy Plan 1 Engine Runtime Plan 2 Data center Plan 3 TensorRT Runtime Engine Optimized Plans Automotive Embedded 8

  9. MODEL IMPORTING ➢ AI Researchers ➢ Data Scientists Example: Importing a TensorFlow model Other Frameworks Python/C++ API Python/C++ API Network Model Importer Definition API Runtime inference C++ or Python API 9 developer.nvidia.com/tensorrt

  10. TENSORRT LAYERS Built-in Layer Support Custom Layer API Deployed Application • Convolution TensorRT Runtime • LSTM and GRU Custom Layer • Activation: ReLU, tanh, sigmoid • Pooling: max and average • Scaling Element wise operations • LRN • Fully-connected • SoftMax • Deconvolution • CUDA Runtime 10

  11. TENSORRT OPTIMIZATIONS Layer & Tensor Fusion ➢ Optimizations are completely automatic ➢ Performed with a single function call Weights & Activation Precision Calibration Kernel Auto-Tuning Dynamic Tensor Memory 11

  12. LAYER & TENSOR FUSION Un-Optimized Network TensorRT Optimized Network next input next input concat relu relu relu relu bias bias bias bias 1x1 CBR 3x3 CBR 5x5 CBR 3x3 conv. 5x5 conv. 1x1 conv. 1x1 conv. relu relu max pool bias bias max pool 1x1 CBR 1x1 conv. 1x1 conv. input input concat 12

  13. LAYER & TENSOR FUSION Un-Optimized Network TensorRT Optimized Network Vertical Fusion • next input • Horizonal Fusion next input concat Layer Elimination • relu relu relu relu bias bias bias bias Network Layers Layers 1x1 CBR 3x3 CBR 5x5 CBR 3x3 conv. 5x5 conv. 1x1 conv. 1x1 conv. before after relu relu VGG19 43 27 max pool bias bias max pool 1x1 CBR Inception 309 113 1x1 conv. 1x1 conv. V3 input input ResNet-152 670 159 concat 13

  14. FP16, INT8 PRECISION CALIBRATION Reduced Precision Inference Performance Precision Dynamic Range (ResNet50) 38 ~ +3.4x10 38 6,000 Training precision FP32 -3.4x10 FP16 Tensor Core No calibration required FP16 -65504 ~ +65504 5,000 INT8 -128 ~ +127 Requires calibration 4,000 Images/Second 3,000 Precision calibration for INT8 inference: 2,000 INT8 ➢ Minimizes information loss between FP32 and FP32 1,000 INT8 inference on a calibration dataset FP32 FP32 ➢ Completely automatic 0 CPU-Only P4 V100 14

  15. FP16, INT8 PRECISION CALIBRATION Reduced Precision Inference Performance Precision Dynamic Range (ResNet50) FP32 INT8 Difference 38 ~ +3.4x10 Top 1 38 Top 1 6,000 Training precision FP32 -3.4x10 FP16 Googlenet 68.87% 68.49% 0.38% Tensor Core No calibration required FP16 -65504 ~ +65504 5,000 VGG 68.56% 68.45% 0.11% INT8 -128 ~ +127 Requires calibration Resnet-50 73.11% 72.54% 0.57% 4,000 Images/Second Resnet-152 75.18% 74.56% 0.61% 3,000 Precision calibration for INT8 inference: 2,000 INT8 ➢ Minimizes information loss between FP32 and FP32 1,000 INT8 inference on a calibration dataset FP32 FP32 ➢ Completely automatic 0 CPU-Only P4 V100 15

  16. KERNEL AUTO-TUNING DYNAMIC TENSOR MEMORY Kernel Auto-Tuning Dynamic Tensor Memory Reduces memory footprint and • 100s for specialized kernels improves memory re-use Optimized for every GPU platform Manages memory allocation for • each tensor only for the duration of its usage Multiple parameters: • Batch size • Input dimensions • Filter dimensions Tesla V100 Jetson TX2 Drive PX2 16 ...

  17. TENSORRT DEPLOYMENT WORKFLOW Step 1 : Optimize trained model Plan 1 Import Serialize Model Engine Plan 2 Plan 3 Trained Neural TensorRT Optimizer Optimized Plans Network Step 2 : Deploy optimized plans with runtime De-serialize Deploy Plan 1 Engine Runtime Plan 2 Data center Plan 3 TensorRT Runtime Engine Optimized Plans Automotive Embedded 17

  18. EXAMPLE: DEPLOYING TENSORFLOW MODELS WITH TENSORRT Deployment and Inference Import, optimize and deploy TensorFlow models using TensorRT python API New Data Steps: Start with a frozen TensorFlow • Trained Neural Network model • Create a model parser TensorRT Optimize model and create a • Optimizer Optimized runtime engine Runtime Engine Perform inference using the • optimized runtime engine Inference Results 18 developer.nvidia.com/tensorrt

  19. 7 STEPS TO DEPLOYMENT WITH TENSORRT Step 1: Convert trained model into TensorRT format Step 2: Create a model parser Step 3: Register inputs and outputs Step 4: Optimize model and create a runtime engine Step 5: Serialize optimized engine Step 6: De-serialize engine Step 7: Perform inference developer.nvidia.com/tensorrt

  20. RECAP: DEPLOYMENT WORKFLOW Step 1 : Optimize trained model Import Serialize Model Engine Plan file FP32, FP16, keras_vgg19_b1_fp32.engine VGG19 Batch Size 1 Step 2 : Deploy optimized plans with runtime De-serialize Deploy New flower Engine Runtime images Plan file Prediction keras_vgg19_b1_fp32.engine Results TensorRT Runtime Engine 20

  21. CHALLENGES ADDRESSED BY TENSORRT Requirement TensorRT Delivers Maximizes inference performance on NVIDIA GPUs High Throughput ➢ INT8, FP16 Precision Calibration, Layer & Tensor Fusion, Kernel Auto-Tuning ➢ Up to 40x Faster than CPU-Only inference and 18x faster inference of TensorFlow models Low Response Time ➢ Under 7ms real-time latency Performs target specific optimizations Power and Memory ➢ Platform specific kernels for Embedded (Jetson), Datacenter Efficiency (Tesla GPUs) and Automotive (DrivePX) ➢ Dynamic Tensor Memory management improves memory re-use Designed for production environments Deployment-Grade ➢ No framework overhead, minimal dependencies Solution ➢ Multiple frameworks, Network Definition API ➢ C++, Python API, Customer Layer API 21

Recommend


More recommend