DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA GPU Xu Tianhao, Deep Learning Solution Architect, NVIDIA
DEEP INTO TRTIS • TensorRT Hyperscale Inference Platform overview • TensorRT Inference Server • Overview and Deep Dive: Key features • Deployment possibilities: Generic deployment ecosystem AGENDA • Hands-on • NVIDA BERT Overview • FasterTransformer and TRT optimized BERT inference • Deploy BERT TensorFlow model with custom op • Deploy BERT TensorRT model with plugins • Benchmarking • Open Discussion 2
TENSORRT HYPERSCALE INFERENCE PLATFORM WORLD’S MOST ADVANCED INTEGRATED INTO TENSORFLOW & TENSORRT SCALE-OUT GPU ONNX SUPPORT INFERENCE SERVER 3
ANNOUNCING TESLA T4 WORLD’S MOST ADVANCED INFERENCE GPU Universal Inference Acceleration 320 Turing Tensor cores 2,560 CUDA cores 65 FP16 TFLOPS | 130 INT8 TOPS | 260 INT4 TOPS 16GB | 320GB/s 4
NEW TURING TENSOR CORE MULTI-PRECISION FOR AI INFERENCE 65 TFLOPS FP16 | 130 TeraOPS INT8 | 260 TeraOPS INT4 5
WORLD’S MOST PERFORMANT INFERENCE PLATFORM Up To 36X Faster Than CPUs | Accelerates All AI Workloads Peak Performance Speech Inference Video Inference Natural Language Processing Inference 25 30 300 27X 40 260 21X 36X 25 250 20 35 Speedup v. CPU Server Speedup v. CPU Server Speedup v. CPU Server 30 20 200 15 TFLOPS / TOPS 25 15 150 130 20 10 10X 15 10 100 10X 65 10 4X 5 5 50 5 22 1.0 1.0 1.0 - - - 5.5 0 0 0 0 Float INT8 INT4 Float INT8 CPU Server Tesla P4 Tesla T4 CPU Server Tesla P4 Tesla T4 CPU Server Tesla P4 Tesla T4 P4 T4 Speedup: 21X faster Speedup: 27x faster Speedup: 36x faster DeepSpeech 2 ResNet-50 (7ms latency limit) GNMT 6
NVIDIA TENSORRT OVERVIEW From Every Framework, Optimized For Each Target Platform NVIDIA T4 JETSON TX2 TensorRT DRIVE PX 2 NVIDIA DLA TESLA V100 7
NVIDIA TENSORRT OVERVIEW From Every Framework, Optimized For Each Target Platform Quantized INT8 (Precision Optimization) Significantly improves inference performance of models trained in FP32 full precision by quantizing them to INT8, while minimizing accuracy loss Layer Fusion (Graph Optimization) Improves GPU utilization and optimizes memory storage and bandwidth by combining successive nodes into a single node, for single kernel execution Kernel Auto-Tuning (Auto-tuning) Optimizes execution time by choosing the best data layer and best parallel algorithms for the target Jetson, Tesla or DrivePX GPU platform Dynamic Tensor Memory (Memory optimization) Reduces memory footprint and improves memory re-use by allocating memory for each tensor only for the duration it’s usage Multi Stream Execution Scales to multiple input streams, by processing them parallel using the same model and weights 8
GRAPH OPTIMIZATION Un-Optimized Network Non-Optimized Network Vertical Fusion • next input • Horizonal Fusion next input concat Layer Elimination • concat relu relu relu relu bias bias bias bias relu relu relu Network Layers Layers relu 3x3 conv. 5x5 conv. 1x1 conv. 1x1 conv. before after bias bias bias bias relu relu 3x3 conv. 5x5 conv. 1x1 conv. 1x1 conv. VGG19 43 27 max pool bias bias relu relu Inception 309 113 1x1 conv. 1x1 conv. max pool bias bias V3 input 1x1 conv. 1x1 conv. ResNet-152 670 159 concat input concat 9
GRAPH OPTIMIZATION Un-Optimized Network TensorRT Optimized Network Vertical Fusion • next input • Horizonal Fusion next input concat Layer Elimination • relu relu relu relu bias bias bias bias Network Layers Layers 1x1 CBR 3x3 CBR 5x5 CBR 3x3 conv. 5x5 conv. 1x1 conv. 1x1 conv. before after relu relu VGG19 43 27 max pool bias bias max pool 1x1 CBR Inception 309 113 1x1 conv. 1x1 conv. V3 input input ResNet-152 670 159 concat 10
TENSORRT PERFORMANCE 40x Faster CNNs on V100 vs. CPU-Only 140x Faster Language Translation RNNs on Under 7ms Latency (ResNet50) V100 vs. CPU-Only Inference (OpenNMT) 600 500 40 550 5700 6,000 450 35 500 400 5,000 30 350 Latency (ms) Images/sec 400 Latency (ms) 4,000 Images/sec 25 300 280 ms 20 300 250 3,000 14 ms 200 15 2,000 200 153 ms 150 10 117 ms 6.83 ms 6.67 ms 100 1,000 5 100 305 140 50 25 4 0 0 0 0 CPU-Only V100 + V100 + TensorRT CPU-Only + Torch V100 + Torch V100 + TensorRT TensorFlow Inference throughput (images/sec) on ResNet50. V100 + TensorRT : NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2- Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT : NVIDIA TensorRT (FP32), batch size 64, Tesla V100- 16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow : Preview of volta optimized TensorFlow (FP16), PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. V100 + Torch : Torch (FP32), batch size 4, Tesla V100-PCIE- batch size 2, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587 16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 v4@2.60GHz Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake 3.5GHz Turbo (Broadwell) HT On with AVX512. 11 developer.nvidia.com/tensorrt
DEEP INTO TRTIS • TensorRT Hyperscale Inference Platform overview • TensorRT Inference Server • Overview and Deep Dive: Key features • Deployment possibilities: Generic deployment ecosystem AGENDA • Hands-on • NVIDA BERT Overview • FasterTransformer and TRT optimized BERT inference • Deploy BERT TensorFlow model with custom op • Deploy BERT TensorRT model with plugins • Benchmarking • Open Discussion 12
INEFFICIENCY LIMITS INNOVATION Difficulties with Deploying Data Center Inference Single Framework Only Single Model Only Custom Development ! Rec- NLP ASR ommender Some systems are overused while Developers need to reinvent the Solutions can only support others are underutilized plumbing for every application models from one framework 13
NVIDIA TENSORRT INFERENCE SERVER Architected for Maximum Datacenter Utilization Maximize real-time inference performance of GPUs NVIDIA Inference TensorRT Server T4 Quickly deploy and manage multiple NVIDIA T4 models per GPU per node Tesla Inference TensorRT Server V100 Easily scale to heterogeneous GPUs Tesla and multi GPU nodes V100 Integrates with orchestration Inference TensorRT Tesla P4 Server systems and auto scalers via latency and health metrics Tesla P4 Now open source for thorough customization and integration 14
FEATURES Utilization Usability Performance Customization CPU Model Inference System/CUDA Shared Custom Backend Dynamic Batching Execution Memory Custom backend allows the user Inference requests can be more flexibility by providing their Framework native models can Inputs/outputs needed to be batched up by the own implementation of an execute inference requests on the passed to/from TRTIS are stored inference server to 1) the execution engine through the use CPU in system/CUDA shared memory. model-allowed maximum or of a shared library Reduces HTTP/gRPC overhead 2) the user-defined latency Multiple Model Format SLA Library Version Model Ensemble Support Concurrent Model Link against libtrtserver.so so that Pipeline of one or more models PyTorch JIT (.pt) you can include all the inference and the connection of input and TensorFlow GraphDef/SavedModel Execution server functionality directly in output tensors between those TensorFlow+TensorRT GraphDef Multiple models (or multiple your application models (can be used with custom ONNX graph (ONNX Runtime) instances of same model) may backend) TensorRT Plans execute on GPU simultaneously Caffe2 NetDef (ONNX import path) Streaming API Metrics Built-in support for audio Utilization, count, memory, and streaming input e.g. for speech latency recognition Model Control API Explicitly load/unload models into and out of TRTIS based on changes made in the model-control configuration 15
INFERENCE SERVER ARCHITECTURE Available with Monthly Updates Models supported Python/C++ Client Library ● TensorFlow GraphDef/SavedModel ● TensorFlow and TensorRT GraphDef ● TensorRT Plans ● Caffe2 NetDef (ONNX import) ● ONNX graph ● PyTorch JIT (.pb) Multi-GPU support Concurrent model execution Server HTTP REST API/gRPC Python/C++ client libraries 16
COMMON WAYS TO FULLY UTILIZE GPU 1. Increase computation intensity – Increase batch size 2. Execute multi- tasks concurrently with multi- streams or MPS (MULTI-PROCESS SERVICE) 17
DYNAMIC BATCHING SCHEDULER TensorRT Inference Server Framework Backend Runtime Batch-1 Request Batch-4 Request Context Dynamic Batcher Context 18
DYNAMIC BATCHING SCHEDULER TensorRT Inference Server Grouping requests into a ModelY Backend single “batch” increases Runtime overall GPU throughput Context Dynamic Batcher Context Preferred batch size and wait time are configuration options. Assume 4 gives best utilization in this example. 19
Recommend
More recommend