s8286 quick and easy dl workflow proof of concept
play

S8286 : QUICK AND EASY DL WORKFLOW PROOF OF CONCEPT Alec Gunny Ken - PowerPoint PPT Presentation

S8286 : QUICK AND EASY DL WORKFLOW PROOF OF CONCEPT Alec Gunny Ken Hester Deep Learning in Production - Current Approaches - Deployment Challenges NVIDIA TensorRT - Programmable Inference Accelerator - Performance, Optimizations and


  1. S8286 : QUICK AND EASY DL WORKFLOW PROOF OF CONCEPT Alec Gunny Ken Hester

  2. Deep Learning in Production - Current Approaches - Deployment Challenges NVIDIA TensorRT - Programmable Inference Accelerator - Performance, Optimizations and Features AGENDA Example - Import, Optimize and Deploy TensorFlow Models with TensorRT Additional Resources Q&A 2

  3. SINGLE GPU PLATFORM FOR ALL ACCELERATED WORKLOADS BOOSTS ALL ACCELERATED WORKLOADS 10M Users 40 years of video/day HPC AI Training AI Inference Data Analytic +450 Applications cuBLAS NCCL NVIDIA DEEP LEARNING DeepStream SDK SDK and CUDA libraries cuDNN DGX TESLA V100 - UNIVERSAL GPU 3

  4. WHERE TO TRAIN At Your Desk In-the-Cloud On-Prem 4

  5. CURRENT DEPLOYMENT WORKFLOW TRAINING UNOPTIMIZED DEPLOYMENT 1 Deploy training Data Management framework 2 Training Training Trained Neural Deploy custom Data Network application using NVIDIA DL SDK Model Assessment 3 Framework or custom CPU-Only application CUDA, NVIDIA Deep Learning SDK (cuDNN, cuBLAS, NCCL) 5

  6. CHALLENGES WITH CURRENT APPROACHES Requirement Challenges Unable to processing high-volume, high-velocity data High Throughput ➢ Impact: Increased cost ($, time) per inference Applications don’t deliver real -time results ➢ Impact: Negatively affects user experience (voice recognition, Low Response Time personalized recommendations, real-time object detection) Inefficient applications Power and Memory ➢ Impact: Increased cost (running and cooling), makes deployment Efficiency infeasible Research frameworks not designed for production Deployment-Grade ➢ Impact: Framework overhead and dependencies increases time Solution to solution and affects productivity 6

  7. NVIDIA DEEP LEARNING SOFTWARE PLATFORM TRAINING INFERENCE Data GRE + T ensorRT Data center Management Training Training Trained Neural Data Network Embedded JETPACK SDK Model Assessment Automotive DriveWorks SDK NVIDIA DEEP LEARNING SDK and CUDA 7 developer .nvidia.com/deep-learning-software

  8. NVIDIA TENSORRT Programmable Inference Accelerator FRAMEWORKS GPU PLATFORMS TESLA P4 TensorRT JETSON TX2 Optimizer Runtime DRIVE PX 2 NVIDIA DLA TESLA V100 8 developer.nvidia.com/tensorrt

  9. NVIDIA TENSORRT PROGRAMMABLE NVIDIA TENSORRT PROGRAMMABLE INFERENCING PLATFORM INFERENCING PLATFORM UFF TESLA P4 TensorRT JETSON TX2 Optimizer Runtime DRIVE PX 2 TRT Network NVIDIA DLA API TESLA V100 9

  10. TENSORRT PERFORMANCE 40x Faster CNNs on V100 vs. CPU-Only 140x Faster Language Translation RNNs on Under 7ms Latency (ResNet50) V100 vs. CPU-Only Inference (OpenNMT) 600 500 40 550 6,000 5700 450 35 500 400 5,000 30 Latency (ms) 350 Images/sec 400 Latency (ms) Images/sec 4,000 25 300 280 ms 300 250 20 3,000 200 14 ms 15 200 2,000 153 ms 150 117 ms 10 100 6.83 ms 6.67 ms 100 1,000 5 50 25 305 140 4 0 0 0 0 CPU-Only + Torch V100 + Torch V100 + TensorRT CPU-Only V100 + TensorFlow V100 + TensorRT Inference throughput (images/sec) on ResNet50. V100 + TensorRT : NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2- Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT : NVIDIA TensorRT (FP32), batch size 64, Tesla V100- 16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow : Preview of volta optimized TensorFlow (FP16), PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. V100 + Torch : Torch (FP32), batch size 4, Tesla V100-PCIE- batch size 2, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587 16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 v4@2.60GHz Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake 3.5GHz Turbo (Broadwell) HT On with AVX512. 10 developer.nvidia.com/tensorrt

  11. TENSORRT DEPLOYMENT WORKFLOW Step 1 : Optimize trained model Plan 1 Import Serialize Model Engine Plan 2 Plan 3 Trained Neural TensorRT Optimizer Optimized Plans Network Step 2 : Deploy optimized plans with runtime De-serialize Deploy Plan 1 Engine Runtime Plan 2 Data center Plan 3 TensorRT Runtime Engine Optimized Plans Automotive Embedded 11

  12. TENSORRT DEPLOYMENT WORKFLOW Step 1 : Optimize trained model Plan 1 Import Serialize Model Engine Plan 2 Plan 3 Trained Neural TensorRT Optimizer Optimized Plans Network Step 2 : Deploy optimized plans with runtime De-serialize Deploy Plan 1 Engine Runtime Plan 2 Data center Plan 3 TensorRT Runtime Engine Optimized Plans Automotive Embedded 12

  13. MODEL IMPORTING AI Researchers ➢ ➢ Data Scientists Example: Importing a T ensorFlow model Other Frameworks Python/C++ API Python/C++ API Network Model Importer Definition API Runtime inference C++ or Python API 13 developer.nvidia.com/tensorrt

  14. TENSORRT OPTIMIZATIONS Layer & Tensor Fusion ➢ Optimizations are completely automatic ➢ Performed with a single function call Weights & Activation Precision Calibration Kernel Auto-Tuning Dynamic Tensor Memory 14

  15. TENSORRT DEPLOYMENT WORKFLOW Step 1 : Optimize trained model Plan 1 Import Serialize Model Engine Plan 2 Plan 3 Trained Neural TensorRT Optimizer Optimized Plans Network Step 2 : Deploy optimized plans with runtime De-serialize Deploy Plan 1 Engine Runtime Plan 2 Data center Plan 3 TensorRT Runtime Engine Optimized Plans Automotive Embedded 15

  16. NVIDIA TENSORRT 3 NOW AVAILABLE Volta TensorCore  TensorFlow Importer  Python API Volta TensorCore Import TensorFlow Python API Support Models Data Scientists Compiled & Optimized Model 3.7x faster inference on Tesla Optimize and deploy TensorFlow Improved productivity with easy V100 vs. Tesla P100 under 7ms models up to 18x faster vs. to use Python API for data real-time latency TensorFlow framework science workflows Free download to members of NVIDIA Developer Program developer.nvidia.com/tensorrt 16

  17. NVIDIA JETPACK 3.2 SDK for embedded AI computing Deep Learning Computer Vision GPU Compute Multimedia TensorRT ISP Support CUDA VisionWorks cuDNN Camera Imaging CUDA Libs OpenCV DIGITS Workflow Video CODEC Also includes ROS compatibility, OpenGL, advanced developer tools, and much more 17

  18. DEMO Jetson TX2 AI Computer on a Module Advanced tech for intelligent machines Unmatched performance under 10W Smaller than a credit card 18

  19. LEARN MORE Jetson: developer.nvidia.com/embedded-computing Success Stories: developer.nvidia.com/embedded/learn/success-stories Partners and Ecosystem: developer.nvidia.com/embedded/community Deep Learning Institute: www.nvidia.com/object/deep-learning-institute.html Two Days To A Demo: developer.nvidia.com/embedded/twodaystoademo Inception Program: www.nvidia.com/inception 19

Recommend


More recommend