April 4-7, 2016 | Silicon Valley NVIDIA GIE: HIGH-PERFORMANCE GPU INFERENCE ENGINE Michael Andersch, 7 th April 2016
WHAT IS INFERENCE, ANYWAYS? Building a deep neural network based application Step 1: Use data to train the neural network - training Step 2: Use the neural network to process unseen data - inference 2
INFERENCE VS TRAINING How is inference different from training? 1. No backpropagation / static weights enables graph optimizations, simplifies memory management 2. Tendency towards smaller batch sizes harder to amortize weight loading, achieve high GPU utilization 3. Reduced precision requirements provides opportunity for BW savings and accelerated arithmetic 3
OPTIMIZING SOFTWARE FOR INFERENCE Extracting every bit of performance What’s running on the GPU: cuDNN optimizations Support for standard tensor layouts and major frameworks Available automatically and “for free” How you use it: Framework optimizations Every last bit of performance matters Challenging due to framework structure Changes to one framework don’t propagate to others 4
OPTIMIZING SOFTWARE FOR INFERENCE Challenge: Efficient small batch convolutions Optimal convolution algorithm depends on convolution layer dimensions Winograd speedup over GEMM-based convolution (VGG-E layers, N=1) 2.26 2.07 2.03 1.98 1.92 1.84 1.83 1.25 0.73 conv 1.1 conv 1.2 conv 2.1 conv 2.2 conv 3.1 conv 3.2 conv 4.1 conv 4.2 conv 5.0 Meta-parameters (data layouts, texture memory) afford higher performance Using texture memory for convolutions: 13% inference speedup (GoogLeNet, batch size 1) 5
OPTIMIZING SOFTWARE FOR INFERENCE Challenge: Graph optimization tensor concat 3x3 conv. 5x5 conv. 1x1 conv. 1x1 conv. 1x1 conv. 1x1 conv. max pool input 6
OPTIMIZING SOFTWARE FOR INFERENCE Challenge: Graph optimization next input concat relu relu relu relu bias bias bias bias 3x3 conv. 3x3 conv. 3x3 conv. 1x1 conv. relu relu max pool bias bias 1x1 conv. 3x3 conv. input concat 7
OPTIMIZING SOFTWARE FOR INFERENCE Graph optimization: Vertical fusion next input concat 1x1 CBR 5x5 CBR 3x3 CBR 1x1 CBR max pool 1x1 CBR 1x1 CBR input concat 8
OPTIMIZING SOFTWARE FOR INFERENCE Graph optimization: Horizontal fusion next input concat 1x1 CBR 5x5 CBR 3x3 CBR max pool 1x1 CBR input concat 9
OPTIMIZING SOFTWARE FOR INFERENCE Graph optimization: Concat elision next input 1x1 CBR 5x5 CBR 3x3 CBR max pool 1x1 CBR input 10
OPTIMIZING SOFTWARE FOR INFERENCE Graph optimization: Concurrency next input 1x1 CBR 5x5 CBR 3x3 CBR max pool 1x1 CBR input 11
OPTIMIZING SOFTWARE FOR INFERENCE Challenge: Effective use of cuBLAS intrinsics Run GEMV instead of GEMM Small batch sizes degrade N dimension B matrix becomes narrow Pre-transpose weight matrices Allows using NN/NT GEMM, where NT > NN > TN 12
ACCELERATED INFERENCE ON PASCAL Support for fast mixed precision arithmetic Inference products will support a new dedicated vector math instruction Multi-element dot product, 8-bit integer inputs, 32-bit accumulator 4x the rate of equivalent FP32 operations Full-speed FP32 processing for any layers that require higher precision 13
BUT WHO WILL IMPLEMENT IT? Introducing NVIDIA GIE: GPU Inference Engine EXECUTION ENGINE OPTIMIZATION ENGINE STRATEGY 14
GPU INFERENCE ENGINE WORKFLOW OPTIMIZATION ENGINE DIGITS TRAINING TOOLS STRATEGY EXECUTION ENGINE 15
SUMMARY Inference on the GPU Tesla M4 Hyperscale Accelerator GPUs are a great platform for inference Efficiency: great performance/watt Scalability: from 3W to 300W GPU- based inference affords … … same performance in much tighter power envelope … freeing up the CPU to do other work Questions: mandersch@nvidia.com, or find me after the talk! 16
April 4-7, 2016 | Silicon Valley THANK YOU JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join
Recommend
More recommend