gpu inference
play

GPU INFERENCE IN THE DATACENTER Drew Farris, Chief Technologist @ - PowerPoint PPT Presentation

BOOZ ALLEN HAMILTON GPU INFERENCE IN THE DATACENTER Drew Farris, Chief Technologist @ Booz | Allen | Hamilton Nvidia GPU Technology Conference, Washington DC NOVEMBER 2017 Eglin AFB, FL MICROPROCESSORS NO LONGER SCALE AT THE LEVEL OF


  1. BOOZ ALLEN HAMILTON GPU INFERENCE IN THE DATACENTER Drew Farris, Chief Technologist @ Booz | Allen | Hamilton Nvidia GPU Technology Conference, Washington DC NOVEMBER 2017 Eglin AFB, FL

  2. MICROPROCESSORS NO LONGER SCALE AT THE LEVEL OF PERFORMANCE THEY USED TO — THE END OF WHAT YOU WOULD CALL MOORE’S LAW, SEMICONDUCTOR PHYSICS PREVENTS US FROM TAKING DENNARD SCALING ANY FURTHER. - Jen-Hsun Huang, CEO, NVIDIA 1 Booz Allen Hamilton

  3. INTRODUCTION THE DAYS OF EASY PERFORMANCE GAINS ARE GONE We need alternatives to general purpose CPU computation GRAPHIC PROCESSING UNITS PROVIDE AN ALTERNATIVE - Algebraic Strengths of GPU - Enable New Algorithms - Adaptable to a variety of tasks - Readably Available Libraries (CUDA, CV/DL Frameworks) HOW CAN WE LEVERAGE OUR EXISTING INVESTMENTS? - HPC vs. Commodity Hardware in the Datacenter - Evolution vs. Revolution - Scaling Out vs. Scaling Up 2 Booz Allen Hamilton

  4. THE REALITY WE HAVE A PROBLEM How to apply complex algorithms as a part of our ingest process? - Computationally Expensive Algorithms - Heterogeneous Dataflow - Horizontal / Linear Scalability at datacenter level. How to accommodate this within our existing compute fabric? - Hadoop Clusters: HDFS, YARN, etc. - Commodity Nodes - Small Numbers of Special Purpose Nodes - 10G Interconnects - Cost, Power, Space and Cooling NO SUPERCOMPUTERS, NO MODEL TRAINING - Power Efficient GPUs (50-75w) - We could focus on model Inference o Application of Models to new Data, e.g: Classification 3 Booz Allen Hamilton

  5. DATACENTER ARCHITECTURE SINGLE NODE SINGLE RACK DATACENTER - 128G RAM - 40 Nodes - Many Racks - 12x Drives - 10G ToR Switch - Interconnect - 24 Core / 48 HT - 15-16 KW - Rows? - PCI Express Slot? 4 Booz Allen Hamilton

  6. DATA EXTRACTION PIPELINE DATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest pipeline, this system must extract and analyze data in a wide variety of formats and perform normalization in order to prepare for indexing. - Unpacking, Uncompressing Archives o Zip, Tar, etc. - Converting Binary Formats to Text o Word, PDF - Extracting Metadata o EXIF data from images - Classifying Images / Segmenting Images / Detecting Objects o Search Images using Text - Optical Character Recognition o Extract Text from Scanned Documents - Detecting Malware o Executables, PDFs, RTF The heterogeneous nature of this data was a problem, complex data and analysis would disrupt latency across all datatypes. 5 Booz Allen Hamilton

  7. GPU ACCELERATED DATA EXTRACTION PIPELINE DATA IDENTIFICATION, TRANSFORMATION, ANALYSIS Some of these tasks are straightforward to accelerate using GPUs. So we decided to start with the following: - Unpacking, Uncompressing Archives o Zip, Tar, etc. - Converting Binary Formats to Text o Word, PDF - Extracting Metadata o EXIF data from images - Classifying Images / Segmenting Images / Detecting Objects o Search Images using Text - Optical Character Recognition o Extract Text from Scanned Documents - Detecting Malware o Executables, PDFs, RTF 6 Booz Allen Hamilton

  8. DATA EXTRACTION REQUIREMENTS CPU, MEMORY AND THROUGHPUT In order to scale linearly as we add more resources, our system must have the following characteristics - Shared Nothing vs. Share Little - Stateless vs. Minimally State-ful - CPU bound (No Disk IO, Network IO, Bus or Memory Bottlenecks) - Uniformly Fast: Individual Document Processing: 10-100ms - RAM Frugal: No large models in memory - Something that plays well with Java 7 Booz Allen Hamilton

  9. INTEGRATION OPTIONS JAVA VM What plays well with Java? –or- How the heck to we get it to talk JVM HEAP to the CUDA libraries? MEM - Pure Java JAVA LIB o No GPU acceleration - Java Native Interface (or some derivative) THREAD SINGLE NODE o Hand wrapped API calls (JNI, JNA) FORKED EXE o JavaCPP (Java and native C++ bridge) THREAD o Deeplearning4j cuDNN integration (as of 0.9.1) o Tensorflow Java API JNI LIB THREAD CPU - External Processes o Forked Executable THREAD o Shared Memory WRAPPED LIB o Sockets (TCP, UDP, Raw, etc) THREAD LOCAL STORAGE 8 Booz Allen Hamilton

  10. NOTIONAL INTEGRATION JAVA VM What do we want to be able to do? JVM HEAP MEM MEM GPU - Multiple Library or Framework Support o Caffe / Tensorflow / Torch / Others JAVA LIB? CAFFE o Cuda Accelerated OpenCV o TensorRT THREAD SINGLE NODE o Other CUDA Libraries FORKED EXE? OPCV LIB THREAD JNI LIB? TENSORRT THREAD GPU CPU THREAD WRAPPED LIB? CUDA LIB THREAD LOCAL STORAGE 9 Booz Allen Hamilton

  11. SOLUTION So, what components make up the solution? 10 Booz Allen Hamilton

  12. NVIDIA TESLA P4 INFERENCE ACCELERATOR “ULTRA-EFFICIENT DEEP LEARNING IN SCALE-OUT SERVERS” - NVidia PASCAL - 5.5 TeraFLOPS Single-Precision Performance - 22 Tera-Operations Per Second Integer 8 Performance - 8G GPU Memory - 192 GB/s GPU Memory Bandwidth - Low-Profile PCI Express - 50W/75W Max Power - http://www.nvidia.com/object/accelerate-inference.html 11 Booz Allen Hamilton

  13. CAFFE IMAGE CLASSIFICATION WITH ALEXNET USING CAFFE We used CaffeNet, a pre-trained AlexNet model based on the ISRVC 2012 Dataset. o A good stand-in for more complex image models o Evaluated both CPU-only and GPU variants of Caffe to characterize performance difference o One Image Per Batch o Lightly modified to properly handle multithreading and CUDA streams 12 Booz Allen Hamilton

  14. OPEN CV CUDA ACCELERATED COMPUTER VISION LIBRARY Images were resized using GPU resources instead of CPU resources, and as a result it is not necessary to copy the resized image data to the input layer. - AlexNET input layer size is 224px x 224px - Produces a GpuMat object for image data allocated from GPU memory - GpuMat wrapped to use as input layer for network, avoiding the need for an extra copy - Custom GpuMat allocators introduced in OpenCV 3.2.0 13 Booz Allen Hamilton

  15. NVIDIA TENSORRT HIGH PERFORMANCE DEEP LEARNING INFERENCE OPTIMIZER TensorRT can load and optimize Caffe or Tensorflow models for optimized inference performance. In this case, we used it to host the same Caffe model used for the image classification task. - FP32 to INT8 while minimizing accuracy loss - Better GPU Utilization - Kernel Autotuning - Improved Memory Footprint - Multi-stream Execution - Used unchanged Caffe model for Image classification 14 Booz Allen Hamilton

  16. PYTORCH MALCONV: MALWARE DETECTION WITH DEEP LEARNING A convolutional neural network digests entire binaries for malware identification - A Custom Malware Identification Model - Current Ingest framework leverages an un-accelerated predecessor o Can’t use MalConv because it’s too computationally intense - Integration with PyTorch will require some work o Not a great inference layer available for PyTorch o Model Translation with ONNX to Caffe2 (or Other?) o Currently a work-in-progress - How do the ergonomics differ from the image captioning task? 15 Booz Allen Hamilton

  17. NVIDIA GPU REST ENGINE DEEP LEARNINIG INFERENCE VIA REST The GRE provided memory and process isolation and native libraries for hardware access - Multi-threaded HTTP server in Golang - RESTFul interface - Multithreaded Caffe - TensorRT – NVidia’s inference engine - CUDA-Accelerated OpenCV - Containerized in Docker - Framework for other inference engines - https://developer.nvidia.com/gre - https://github.com/NVIDIA/gpu-rest-engine 16 Booz Allen Hamilton

  18. NVIDIA DOCKER SIMPLIFIED PACKAGING AND DEPLOYMENT VIA CONTAINERS Packaging performed in one environment and rapidly deployed to a large number of nodes. - Docker image building / testing on Amazon Elastic Compute Cloud - Test Environment on an Isolated Network - Install Docker, CUDA Libraries / Drivers and NVidia Docker and go. - Portability across Centos 7 Nodes - Supported Laptop Development of new analytics. - Worked out-of-the box for Caffe Models in Caffe / TensorRT - https://github.com/NVIDIA/nvidia-docker 17 Booz Allen Hamilton

  19. INSTRUMENTATION We collected telemetry during evaluation with a suite of components we use for tracking system performance on production systems - CollectD / StatsD API with various plugins for CPU, Disk, Memory, IO - nvidia-smi for GPU information - Timely for metric storage, analysis - Grafana for visualization, daskboarding, analysis. - NVidia Data Center GPU Manager o Active Health Monitoring, Early Fault Detection (SMART for GPUs) o Power Management o Configuration & Reporting 18 Booz Allen Hamilton

  20. FINAL INTEGRATION JAVA VM - REST Calls from Java to GRE NVIDIA DOCKER - Golang Coordinator JVM HEAP MEM MEM GPU o Copy Image to GPUMat o Resize GPUMat in OpenCV o Resized GPUMat becomes input layer THREAD o Calls to framework for inference SINGLE NODE HTTP OPENCV LIB GPU REST ENGINE o Caffe Reference Model Hosted in Caffe or THREAD TensorRT P4 GPU HTTP CAFFE LIB THREAD CPU THREAD HTTP TENSORRT THREAD LOCAL STORAGE 19 Booz Allen Hamilton

  21. EXPERIMENTS AND RESULTS What did we evaluate and observe? 20 Booz Allen Hamilton

  22. BASELINE CONCURRENCY TESTS WITH CAFFE CPU What effect does concurrency have on the ability to classify images? How quickly can we classify images using only the CPU? We processed 9000 images through the ETL framework, GRE and Caffe CPU Only 10 24 32 Java Thread Count 271.65 175.59 416 Total Elapsed Time (Seconds) 239 465.8 619.2 Minimum Processing Time (Msec) 300 100.49 149.87 Mean (Msec) 483 880 1066 Max. (Msec) 83.0 99.8 100.00 CPU Max User (%) 0 0 0 GPU Max Utilization (%) 150 Threads 100 Count 10 24 50 32 0 200 400 600 800 1000 Milliseconds per Image 21 Booz Allen Hamilton

Recommend


More recommend