GPU INFERENCE IN THE DATACENTER Drew Farris, Chief Technologist @ - PowerPoint PPT Presentation

BOOZ ALLEN HAMILTON GPU INFERENCE IN THE DATACENTER Drew Farris, Chief Technologist @ Booz | Allen | Hamilton Nvidia GPU Technology Conference, Washington DC NOVEMBER 2017 Eglin AFB, FL

MICROPROCESSORS NO LONGER SCALE AT THE LEVEL OF PERFORMANCE THEY USED TO — THE END OF WHAT YOU WOULD CALL MOORE’S LAW, SEMICONDUCTOR PHYSICS PREVENTS US FROM TAKING DENNARD SCALING ANY FURTHER. - Jen-Hsun Huang, CEO, NVIDIA 1 Booz Allen Hamilton

INTRODUCTION THE DAYS OF EASY PERFORMANCE GAINS ARE GONE We need alternatives to general purpose CPU computation GRAPHIC PROCESSING UNITS PROVIDE AN ALTERNATIVE - Algebraic Strengths of GPU - Enable New Algorithms - Adaptable to a variety of tasks - Readably Available Libraries (CUDA, CV/DL Frameworks) HOW CAN WE LEVERAGE OUR EXISTING INVESTMENTS? - HPC vs. Commodity Hardware in the Datacenter - Evolution vs. Revolution - Scaling Out vs. Scaling Up 2 Booz Allen Hamilton

THE REALITY WE HAVE A PROBLEM How to apply complex algorithms as a part of our ingest process? - Computationally Expensive Algorithms - Heterogeneous Dataflow - Horizontal / Linear Scalability at datacenter level. How to accommodate this within our existing compute fabric? - Hadoop Clusters: HDFS, YARN, etc. - Commodity Nodes - Small Numbers of Special Purpose Nodes - 10G Interconnects - Cost, Power, Space and Cooling NO SUPERCOMPUTERS, NO MODEL TRAINING - Power Efficient GPUs (50-75w) - We could focus on model Inference o Application of Models to new Data, e.g: Classification 3 Booz Allen Hamilton

DATACENTER ARCHITECTURE SINGLE NODE SINGLE RACK DATACENTER - 128G RAM - 40 Nodes - Many Racks - 12x Drives - 10G ToR Switch - Interconnect - 24 Core / 48 HT - 15-16 KW - Rows? - PCI Express Slot? 4 Booz Allen Hamilton

DATA EXTRACTION PIPELINE DATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest pipeline, this system must extract and analyze data in a wide variety of formats and perform normalization in order to prepare for indexing. - Unpacking, Uncompressing Archives o Zip, Tar, etc. - Converting Binary Formats to Text o Word, PDF - Extracting Metadata o EXIF data from images - Classifying Images / Segmenting Images / Detecting Objects o Search Images using Text - Optical Character Recognition o Extract Text from Scanned Documents - Detecting Malware o Executables, PDFs, RTF The heterogeneous nature of this data was a problem, complex data and analysis would disrupt latency across all datatypes. 5 Booz Allen Hamilton

GPU ACCELERATED DATA EXTRACTION PIPELINE DATA IDENTIFICATION, TRANSFORMATION, ANALYSIS Some of these tasks are straightforward to accelerate using GPUs. So we decided to start with the following: - Unpacking, Uncompressing Archives o Zip, Tar, etc. - Converting Binary Formats to Text o Word, PDF - Extracting Metadata o EXIF data from images - Classifying Images / Segmenting Images / Detecting Objects o Search Images using Text - Optical Character Recognition o Extract Text from Scanned Documents - Detecting Malware o Executables, PDFs, RTF 6 Booz Allen Hamilton

DATA EXTRACTION REQUIREMENTS CPU, MEMORY AND THROUGHPUT In order to scale linearly as we add more resources, our system must have the following characteristics - Shared Nothing vs. Share Little - Stateless vs. Minimally State-ful - CPU bound (No Disk IO, Network IO, Bus or Memory Bottlenecks) - Uniformly Fast: Individual Document Processing: 10-100ms - RAM Frugal: No large models in memory - Something that plays well with Java 7 Booz Allen Hamilton

INTEGRATION OPTIONS JAVA VM What plays well with Java? –or- How the heck to we get it to talk JVM HEAP to the CUDA libraries? MEM - Pure Java JAVA LIB o No GPU acceleration - Java Native Interface (or some derivative) THREAD SINGLE NODE o Hand wrapped API calls (JNI, JNA) FORKED EXE o JavaCPP (Java and native C++ bridge) THREAD o Deeplearning4j cuDNN integration (as of 0.9.1) o Tensorflow Java API JNI LIB THREAD CPU - External Processes o Forked Executable THREAD o Shared Memory WRAPPED LIB o Sockets (TCP, UDP, Raw, etc) THREAD LOCAL STORAGE 8 Booz Allen Hamilton

NOTIONAL INTEGRATION JAVA VM What do we want to be able to do? JVM HEAP MEM MEM GPU - Multiple Library or Framework Support o Caffe / Tensorflow / Torch / Others JAVA LIB? CAFFE o Cuda Accelerated OpenCV o TensorRT THREAD SINGLE NODE o Other CUDA Libraries FORKED EXE? OPCV LIB THREAD JNI LIB? TENSORRT THREAD GPU CPU THREAD WRAPPED LIB? CUDA LIB THREAD LOCAL STORAGE 9 Booz Allen Hamilton

SOLUTION So, what components make up the solution? 10 Booz Allen Hamilton

NVIDIA TESLA P4 INFERENCE ACCELERATOR “ULTRA-EFFICIENT DEEP LEARNING IN SCALE-OUT SERVERS” - NVidia PASCAL - 5.5 TeraFLOPS Single-Precision Performance - 22 Tera-Operations Per Second Integer 8 Performance - 8G GPU Memory - 192 GB/s GPU Memory Bandwidth - Low-Profile PCI Express - 50W/75W Max Power - http://www.nvidia.com/object/accelerate-inference.html 11 Booz Allen Hamilton

CAFFE IMAGE CLASSIFICATION WITH ALEXNET USING CAFFE We used CaffeNet, a pre-trained AlexNet model based on the ISRVC 2012 Dataset. o A good stand-in for more complex image models o Evaluated both CPU-only and GPU variants of Caffe to characterize performance difference o One Image Per Batch o Lightly modified to properly handle multithreading and CUDA streams 12 Booz Allen Hamilton

OPEN CV CUDA ACCELERATED COMPUTER VISION LIBRARY Images were resized using GPU resources instead of CPU resources, and as a result it is not necessary to copy the resized image data to the input layer. - AlexNET input layer size is 224px x 224px - Produces a GpuMat object for image data allocated from GPU memory - GpuMat wrapped to use as input layer for network, avoiding the need for an extra copy - Custom GpuMat allocators introduced in OpenCV 3.2.0 13 Booz Allen Hamilton

NVIDIA TENSORRT HIGH PERFORMANCE DEEP LEARNING INFERENCE OPTIMIZER TensorRT can load and optimize Caffe or Tensorflow models for optimized inference performance. In this case, we used it to host the same Caffe model used for the image classification task. - FP32 to INT8 while minimizing accuracy loss - Better GPU Utilization - Kernel Autotuning - Improved Memory Footprint - Multi-stream Execution - Used unchanged Caffe model for Image classification 14 Booz Allen Hamilton

PYTORCH MALCONV: MALWARE DETECTION WITH DEEP LEARNING A convolutional neural network digests entire binaries for malware identification - A Custom Malware Identification Model - Current Ingest framework leverages an un-accelerated predecessor o Can’t use MalConv because it’s too computationally intense - Integration with PyTorch will require some work o Not a great inference layer available for PyTorch o Model Translation with ONNX to Caffe2 (or Other?) o Currently a work-in-progress - How do the ergonomics differ from the image captioning task? 15 Booz Allen Hamilton

NVIDIA GPU REST ENGINE DEEP LEARNINIG INFERENCE VIA REST The GRE provided memory and process isolation and native libraries for hardware access - Multi-threaded HTTP server in Golang - RESTFul interface - Multithreaded Caffe - TensorRT – NVidia’s inference engine - CUDA-Accelerated OpenCV - Containerized in Docker - Framework for other inference engines - https://developer.nvidia.com/gre - https://github.com/NVIDIA/gpu-rest-engine 16 Booz Allen Hamilton

NVIDIA DOCKER SIMPLIFIED PACKAGING AND DEPLOYMENT VIA CONTAINERS Packaging performed in one environment and rapidly deployed to a large number of nodes. - Docker image building / testing on Amazon Elastic Compute Cloud - Test Environment on an Isolated Network - Install Docker, CUDA Libraries / Drivers and NVidia Docker and go. - Portability across Centos 7 Nodes - Supported Laptop Development of new analytics. - Worked out-of-the box for Caffe Models in Caffe / TensorRT - https://github.com/NVIDIA/nvidia-docker 17 Booz Allen Hamilton

INSTRUMENTATION We collected telemetry during evaluation with a suite of components we use for tracking system performance on production systems - CollectD / StatsD API with various plugins for CPU, Disk, Memory, IO - nvidia-smi for GPU information - Timely for metric storage, analysis - Grafana for visualization, daskboarding, analysis. - NVidia Data Center GPU Manager o Active Health Monitoring, Early Fault Detection (SMART for GPUs) o Power Management o Configuration & Reporting 18 Booz Allen Hamilton

FINAL INTEGRATION JAVA VM - REST Calls from Java to GRE NVIDIA DOCKER - Golang Coordinator JVM HEAP MEM MEM GPU o Copy Image to GPUMat o Resize GPUMat in OpenCV o Resized GPUMat becomes input layer THREAD o Calls to framework for inference SINGLE NODE HTTP OPENCV LIB GPU REST ENGINE o Caffe Reference Model Hosted in Caffe or THREAD TensorRT P4 GPU HTTP CAFFE LIB THREAD CPU THREAD HTTP TENSORRT THREAD LOCAL STORAGE 19 Booz Allen Hamilton

EXPERIMENTS AND RESULTS What did we evaluate and observe? 20 Booz Allen Hamilton

BASELINE CONCURRENCY TESTS WITH CAFFE CPU What effect does concurrency have on the ability to classify images? How quickly can we classify images using only the CPU? We processed 9000 images through the ETL framework, GRE and Caffe CPU Only 10 24 32 Java Thread Count 271.65 175.59 416 Total Elapsed Time (Seconds) 239 465.8 619.2 Minimum Processing Time (Msec) 300 100.49 149.87 Mean (Msec) 483 880 1066 Max. (Msec) 83.0 99.8 100.00 CPU Max User (%) 0 0 0 GPU Max Utilization (%) 150 Threads 100 Count 10 24 50 32 0 200 400 600 800 1000 Milliseconds per Image 21 Booz Allen Hamilton

GPU INFERENCE IN THE DATACENTER Drew Farris, Chief Technologist @ - PowerPoint PPT Presentation

BOOZ ALLEN HAMILTON GPU INFERENCE IN THE DATACENTER Drew Farris, Chief Technologist @ Booz | Allen | Hamilton Nvidia GPU Technology Conference, Washington DC NOVEMBER 2017 Eglin AFB, FL MICROPROCESSORS NO LONGER SCALE AT THE LEVEL OF

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Java Extensions for OMNeT++ 5.0 Henning Puttnies, Peter Danielis, Christian Koch, Dirk Timmermann

Synergized nergized Ref eforms orms Mo Move Februar bruary 2019 About t Invest estor or

Changes to the IRS 4506-T What It Means for Your Business Agenda Introductions IRS

TABLE OF CONTENT Who we are? --------------------------- 03 About Us

EuroSim New developments on a proven real-time simulator environment Robert de Vries, Dutch

A Paravirtualized Android for Next Generation Interactive Automotive Systems Soham Sinha, Ahmad

CODING STANDARDS Coding Standards for JAVA Focal3 Softw are Pvt Ltd Confidential F3 _ Doc_ 0 0

Preliminary Results Presentation Year ended 2 July 2016 1 John Browett CEO 2 Strength Our

GPU INFERENCE IN THE DATACENTER Drew Farris, Chief Technologist @ - PowerPoint PPT Presentation

BOOZ ALLEN HAMILTON GPU INFERENCE IN THE DATACENTER Drew Farris, Chief Technologist @ Booz | Allen | Hamilton Nvidia GPU Technology Conference, Washington DC NOVEMBER 2017 Eglin AFB, FL MICROPROCESSORS NO LONGER SCALE AT THE LEVEL OF

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Java Extensions for OMNeT++ 5.0 Henning Puttnies, Peter Danielis, Christian Koch, Dirk Timmermann

Synergized nergized Ref eforms orms Mo Move Februar bruary 2019 About t Invest estor or

Changes to the IRS 4506-T What It Means for Your Business Agenda Introductions IRS

TABLE OF CONTENT Who we are? --------------------------- 03 About Us

EuroSim New developments on a proven real-time simulator environment Robert de Vries, Dutch

A Paravirtualized Android for Next Generation Interactive Automotive Systems Soham Sinha, Ahmad

CODING STANDARDS Coding Standards for JAVA Focal3 Softw are Pvt Ltd Confidential F3 _ Doc_ 0 0

Preliminary Results Presentation Year ended 2 July 2016 1 John Browett CEO 2 Strength Our

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,