GPU INFERENCE ENGINE Michael Andersch, 7 th April 2016 WHAT IS - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley NVIDIA GIE: HIGH-PERFORMANCE GPU INFERENCE ENGINE Michael Andersch, 7 th April 2016

WHAT IS INFERENCE, ANYWAYS? Building a deep neural network based application Step 1: Use data to train the neural network - training Step 2: Use the neural network to process unseen data - inference 2

INFERENCE VS TRAINING How is inference different from training? 1. No backpropagation / static weights enables graph optimizations, simplifies memory management 2. Tendency towards smaller batch sizes harder to amortize weight loading, achieve high GPU utilization 3. Reduced precision requirements provides opportunity for BW savings and accelerated arithmetic 3

OPTIMIZING SOFTWARE FOR INFERENCE Extracting every bit of performance What’s running on the GPU: cuDNN optimizations Support for standard tensor layouts and major frameworks Available automatically and “for free” How you use it: Framework optimizations Every last bit of performance matters Challenging due to framework structure Changes to one framework don’t propagate to others 4

OPTIMIZING SOFTWARE FOR INFERENCE Challenge: Efficient small batch convolutions Optimal convolution algorithm depends on convolution layer dimensions Winograd speedup over GEMM-based convolution (VGG-E layers, N=1) 2.26 2.07 2.03 1.98 1.92 1.84 1.83 1.25 0.73 conv 1.1 conv 1.2 conv 2.1 conv 2.2 conv 3.1 conv 3.2 conv 4.1 conv 4.2 conv 5.0 Meta-parameters (data layouts, texture memory) afford higher performance Using texture memory for convolutions: 13% inference speedup (GoogLeNet, batch size 1) 5

OPTIMIZING SOFTWARE FOR INFERENCE Challenge: Graph optimization tensor concat 3x3 conv. 5x5 conv. 1x1 conv. 1x1 conv. 1x1 conv. 1x1 conv. max pool input 6

OPTIMIZING SOFTWARE FOR INFERENCE Challenge: Graph optimization next input concat relu relu relu relu bias bias bias bias 3x3 conv. 3x3 conv. 3x3 conv. 1x1 conv. relu relu max pool bias bias 1x1 conv. 3x3 conv. input concat 7

OPTIMIZING SOFTWARE FOR INFERENCE Graph optimization: Vertical fusion next input concat 1x1 CBR 5x5 CBR 3x3 CBR 1x1 CBR max pool 1x1 CBR 1x1 CBR input concat 8

OPTIMIZING SOFTWARE FOR INFERENCE Graph optimization: Horizontal fusion next input concat 1x1 CBR 5x5 CBR 3x3 CBR max pool 1x1 CBR input concat 9

OPTIMIZING SOFTWARE FOR INFERENCE Graph optimization: Concat elision next input 1x1 CBR 5x5 CBR 3x3 CBR max pool 1x1 CBR input 10

OPTIMIZING SOFTWARE FOR INFERENCE Graph optimization: Concurrency next input 1x1 CBR 5x5 CBR 3x3 CBR max pool 1x1 CBR input 11

OPTIMIZING SOFTWARE FOR INFERENCE Challenge: Effective use of cuBLAS intrinsics Run GEMV instead of GEMM Small batch sizes degrade N dimension B matrix becomes narrow Pre-transpose weight matrices Allows using NN/NT GEMM, where NT > NN > TN 12

ACCELERATED INFERENCE ON PASCAL Support for fast mixed precision arithmetic Inference products will support a new dedicated vector math instruction Multi-element dot product, 8-bit integer inputs, 32-bit accumulator 4x the rate of equivalent FP32 operations Full-speed FP32 processing for any layers that require higher precision 13

BUT WHO WILL IMPLEMENT IT? Introducing NVIDIA GIE: GPU Inference Engine EXECUTION ENGINE OPTIMIZATION ENGINE STRATEGY 14

GPU INFERENCE ENGINE WORKFLOW OPTIMIZATION ENGINE DIGITS TRAINING TOOLS STRATEGY EXECUTION ENGINE 15

SUMMARY Inference on the GPU Tesla M4 Hyperscale Accelerator GPUs are a great platform for inference Efficiency: great performance/watt Scalability: from 3W to 300W GPU- based inference affords … … same performance in much tighter power envelope … freeing up the CPU to do other work Questions: mandersch@nvidia.com, or find me after the talk! 16

April 4-7, 2016 | Silicon Valley THANK YOU JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join

GPU INFERENCE ENGINE Michael Andersch, 7 th April 2016 WHAT IS - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley NVIDIA GIE: HIGH-PERFORMANCE GPU INFERENCE ENGINE Michael Andersch, 7 th April 2016 WHAT IS INFERENCE, ANYWAYS? Building a deep neural network based application Step 1: Use data to train the neural network -

TensorRT 2. Setup of the TensorRT inference engine 2. Setup of the TensorRT inference engine 3. I/O

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UberFlow: A GPU-Based UberFlow: A GPU-Based Particle Engine Particle Engine Peter Kipfer Mark

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

Walter Lepore Promises and Perils of Community Based Research: A Workshop May 26, 2018 The

CBR: The Dutch Driving Test Agency Eveline van Oostrom Vienna, 18 October 2018 Pagina 2 Our

COMMUNITY-BASED RESEARCH An Introduction for Faculty Presented by (first) (last), (position),

2012 FULL YEAR Results Presentation Investor Road Show May 2012 IMPORTANT NOTICE AND DISCLAIMER

Learn CBR Call Save Code Repair Write-back Summary Hansruedi Patzen 1 References University

Acquired Tenax Geocomposite product line in 2009 Manufacturer of extruded Civil and

Chemical-Soil Stabilization for Runway Shoulder Widening at Singapore Changi Airport Koh Ming

4 Updated Plan must be fiscally Every constrained Years Plan must conform to air quality

Sambuz

Useful Links

Newsletter

Mail Us

GPU INFERENCE ENGINE Michael Andersch, 7 th April 2016 WHAT IS - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley NVIDIA GIE: HIGH-PERFORMANCE GPU INFERENCE ENGINE Michael Andersch, 7 th April 2016 WHAT IS INFERENCE, ANYWAYS? Building a deep neural network based application Step 1: Use data to train the neural network -

TensorRT 2. Setup of the TensorRT inference engine 2. Setup of the TensorRT inference engine 3. I/O

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UberFlow: A GPU-Based UberFlow: A GPU-Based Particle Engine Particle Engine Peter Kipfer Mark

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

Walter Lepore Promises and Perils of Community Based Research: A Workshop May 26, 2018 The

CBR: The Dutch Driving Test Agency Eveline van Oostrom Vienna, 18 October 2018 Pagina 2 Our

COMMUNITY-BASED RESEARCH An Introduction for Faculty Presented by (first) (last), (position),

2012 FULL YEAR Results Presentation Investor Road Show May 2012 IMPORTANT NOTICE AND DISCLAIMER

Learn CBR Call Save Code Repair Write-back Summary Hansruedi Patzen 1 References University

Acquired Tenax Geocomposite product line in 2009 Manufacturer of extruded Civil and

Chemical-Soil Stabilization for Runway Shoulder Widening at Singapore Changi Airport Koh Ming

4 Updated Plan must be fiscally Every constrained Years Plan must conform to air quality

Sambuz

Useful Links

Newsletter

Mail Us

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,