Deep Learning/AI Lifecycle with Dell EMC and bitfusion Bhavesh - PowerPoint PPT Presentation

Deep Learning/AI Lifecycle with Dell EMC and bitfusion Bhavesh Patel Dell EMC Server Advanced Engineering Mazhar Memon CTO Bitfusion

Abstract This talk gives an overview of the end to end application life cycle of deep learning in the enterprise along with numerous use cases and summarizes studies done by Bitfusion and Dell on a high performance heterogeneous elastic rack of DellEMC PowerEdge C4130s with Nvidia GPUs. Some of the use cases that will be talked about in detail will be ability to bring on-demand GPU acceleration beyond the rack across the enterprise with easy attachable elastic GPUs for deep learning development, as well as the creation of a cost effective software defined high performance elastic multi-GPU system combining multiple DellEMC C4130 servers at runtime for deep learning training.

Deep Learning and AI Are being adopted across a wide range of market segments

Industry/Function AI Revolution Computer Vision & Speech, Drones, Droids ROBOTICS Interactive Virtual & Mixed Reality ENTERTAINMENT Self-Driving Cars, Co-Pilot Advisor AUTOMOTIVE Predictive Price Analysis, Dynamic Decision Support FINANCE Drug Discovery, Protein Simulation PHARMA Predictive Diagnosis, Wearable Intelligence HEALTHCARE Geo-Seismic Resource Discovery ENERGY Adaptive Learning Courses EDUCATION Adaptive Product Recommendations SALES Dynamic Routing Optimization SUPPLY CHAIN Bots And Fully-Automated Service CUSTOMER SERVICE Dynamic Risk Mitigation And Yield Optimization MAINTENANCE

...but few people have the time, knowledge, resources to even get started

PROBLEM 1 : HARDWARE INFRASTRUCTURE LIMITATIONS ● Increased cost with dense servers ● TOR bottleneck, limited scalability ● Limited multi-tenancy on GPU servers (limited CPU and memory per user) ● Limited to 8-GPU applications ● Does not support GPU apps with: ○ High storage, CPU, Memory requirements

PROBLEM 2 : SOFTWARE COMPLEXITY OVERLOAD Software Management Model Management GPU Driver Management Code Version Management Framework & Library Installation Hyperparameter Optimization Deep Learning Framework Configuration Experiment Tracking Package Manager Deployment Automation Jupyter Server or IDE Setup Deployment Continuous Integration Infrastructure Management Data Management Workload Management Cloud or Server Orchestration Data Uploader Job Scheduler GPU Hardware Setup Shared Local File System Log Management GPU Resource Allocation Data Volume Management User & Group Management Container Orchestration Data Integrations & Pipelining Inference Autoscaling Networking Direct Bypass MPI / RDMA / RPI / gRPC Monitoring

Need to Simplify and Scale

SOLUTION 1/2: CONVERGED RACK SOLUTION ● Up to 64 GPUs per application ● GPU applications with varied storage, memory, CPU requirements ● 30-50% less cost per GPU ● > {cores, memory} / GPU ● >> intra-rack networking bandwidth ● Less inter-rack load ● Composable - Add-as-you-go Composable compute bundle

SOLUTION 2/2: COMPLETE, STREAMLINED AI DEVELOPMENT GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU Develop on pre-installed, quick Transition from development to Push trained, finalized models start deep learning containers. training with multiple GPUs. into production. Get to work quickly with workspaces with • Seamlessly scale out to more GPUs on Deploy a trained neural network into • • optimized pre-configured drivers, a shared training cluster to train larger production and perform real-time frameworks, libraries, and notebooks. models quickly and cost-effectively. inference across different hardware. Start with CPUs, and attach Elastic GPUs • Support and manage multiple users, Manage multiple AI applications and • • on-demand. teams, and projects. inference endpoints corresponding to different trained models. All your code and data is saved • Train multiple models in parallel for • automatically and sharable with others. massive productivity improvements.

C4130 DEEP LEARNING Server CPU sockets (under heat sinks) Power Supplies Front 8 fans GPU accelerators (4) Dual SSD boot Back drives 2x 1Gb NIC IDRAC NIC (optional) Redundant Front Power Supplies

GPU DEEP LEARNING RACK SOLUTION Configuration Details - Pre-Built App Containers - GPU and Workspace Management Features R730 C4130 - Elastic GPUs across the CPU E5-2669 v3@2.1GHz E5-2630 v3@ 2.4Ghz Datacenter - Software defined Scaled out GPU Servers Memory 4GB 1TB/node; 64G DIMM Storage Intel PCIe NVME Intel PCIe NVME Networking IO CX3 FDR InfiniBand CX3 FDR InfiniBand GPU NA M40-24GB R730 TOR Switch Mellanox SX6036- FDR Switch Cables FDR 56G DCA Cables C4130

GPU DEEP LEARNING RACK SOLUTION End to End Deep Learning Application Life Cycle - Pre-Built App Containers 2 Train 1 Develop 3 Deploy - GPU and Workspace Management - Elastic GPUs across the GPU GPU GPU GPU GPU GPU GPU GPU Datacenter GPU GPU GPU GPU GPU GPU GPU GPU - Software defined Scaled GPU GPU GPU GPU GPU GPU GPU GPU out GPU Servers GPU GPU GPU GPU GPU GPU GPU GPU GPU Nodes CPU Nodes R730 C4130 #1 C4130 #2 R730 C4130 #3 C4130 #4 Infiniband Switch

PRODUCT ARCHITECTURE Manage Code Get started quickly with 1 Development pre-built deep learning containers or create your own. Start initial Local development locally or on Environment shared CPUs with interactive workspaces. Attach one or many GPUs 2 on-demand for accelerated training. MASTER Shared Cluster Environment Perform batch scheduling 3 for maximum resource Batch Scheduling & Inference Server Parallel Training efficiency and parallel training for ultimate CPU GPU development speed. NODES NODES Elastic GPU attachment Expose finalized 4 models for production inference. Batch Scheduling & Parallel Training Manage cluster 5 resources, containers, and users.

VALUE PROPOSITION Deep Learning with “State of the Art” Deep Learning with “Streamlined Flow and Converged Infra”

…but wait, ‘converged compute’ requires network attached GPUs... R730 C4130

BITFUSION CORE VIRTUALIZATION GPU Device Virtualization ● Allows dynamic GPU attach on a per- application basis Features ● APIs: CUDA, OpenCL ● Distribution: scale-out to remote GPUs ● Pooling: Oversubscribe GPUs ● Resource Provisioning: Fractional vGPUs ● High Availability: Automatic DMR ● Manageability: Remote nvidia-smi ● Distributed CUDA Unified Memory ● Native support for IB, GPUDirect RDMA ● Feature complete with CUDA 8.0

USE AND MANAGE GPUs IN EXACTLY THE SAME WAY ● Use your favorite tools: ○ All common tools e.g. nvidia-smi work across virtual clusters

PUTTING IT ALL TOGETHER CLIENT SERVER Bitfusion Flex, Bitfusion Client Library managed containers Bitfusion Service Daemon GPU GPU GPU SERVER SERVER SERVER

NATIVE VS. REMOTE GPUs CPU CPU CPU PCIe PCIe PCIe GPU 0 GPU 1 GPU 0 HCA HCA GPU 1 Completely transparent: All CUDA Apps see local and remote GPUs as if directly connected

Results

REMOTE GPUs - LATENCY AND BANDWIDTH • Data movement overheads is the primary scaling limiter • Measurements done at application level – cudaMemcpy Fast Local GPU copies PCIe Intranode copies

16 GPU virtual system: Naive implementation w/ TCP/IP node 0 C4130 Fast local GPU copies node 1 Intranode copies via PCIe node 2 node 3 Low BW, High Latency remote copies OS Bypass needed to avoid primary TCP/IP overheads AI apps are very latency sensitive

16 GPU virtual system: Bitfusion optimized transport and runtime Remote =~ Native Local GPUs Minimal NUMA effects

SLICE & DICE - MORE THAN ONE WAY TO GET 4 GPUs R730 C4130 Native GPU performance with network attached GPUs Run time comparison (lower is better) → Multiple ways to create a virtual 4 GPU node, with native efficiency (secs to train Caffe GoogleNet, batch size: 128) TensorFlow Caffe GoogleNet Pixel-CNN

TRAINING PERFORMANCE R730 C4130 Continued Strong Scaling Caffe GoogleNet Weak-scaling Accelerate Hyper parameter Optimization PCIe host bridge limit Caffe GoogleNet TensorFlow 1.0 with Pixel-CNN 53% 55% 73% 74% 86% 1 2 4 8 16 native remote

Other PCIe GPU Configurations Available Currently Testing Further reading: Config ‘G’ http://en.community.dell.com/techcenter/high-performance- computing/b/general_hpc/archive/2016/11/11/deep-learning-performance-with- p100-gpus http://en.community.dell.com/techcenter/high-performance- computing/b/general_hpc/archive/2017/03/22/deep-learning-inference-on-p40- gpus

NvLink Configuration • 4 P100-16GB SXM2 GPU • 2 CPU • PCIe switch • 1 PCIe slot – EDR IB Config ‘K’ • Memory : 256GB w/16GB @ 2133 SXM2 SXM2 • OS: Ubuntu 16.04 #2 #3 • CUDA: 8.1 SXM2 SXM2 #4 #1 29 of Y

NvLink Configuration • 4 P100-16GB SXM2 GPU • 2 CPU • PCIe switch PCIe Switch • 1 PCIe slot – EDR IB SXM2 SXM2 • Memory : 256GB w/16GB #2 #3 @ 2133 • OS: Ubuntu 16.04 SXM2 SXM2 #4 #1 • CUDA: 8.1 Config ‘L’ 30 of Y

Come visit us Dell Booth #110 Request access or schedule a Bitfusion Booth #103 demo for Bitfusion Flex at bitfusion.io Scheduled live demos: 12-12:30 Dell Booth 5-7 Dell Booth ongoing Bitfusion Booth

Deep Learning/AI Lifecycle with Dell EMC and bitfusion Bhavesh - PowerPoint PPT Presentation

Deep Learning/AI Lifecycle with Dell EMC and bitfusion Bhavesh Patel Dell EMC Server Advanced Engineering Mazhar Memon CTO Bitfusion Abstract This talk gives an overview of the end to end application life cycle of deep learning in the

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Reproducibility and Replicability in Deep Reinforcement Learning (and Other Deep Learning

Image Classification with DIGITS NVIDIA Deep Learning Institute 1 DEEP LEARNING INSTITUTE DLI

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Natural Language Processing with Deep Learning CS224N The Future of Deep Learning + NLP Kevin

An Overview of Deep Residual Learning Semih Yagcioglu 01.03.2016 Deep Residual Learning

Deep Reinforcement Learning [Mastering the Game of Go with Deep Reinforcement Learning and Tree

Mocha.jl Deep Learning in Julia Chiyuan Zhang (@pluskid) CSAIL, MIT Deep Learning Learning

Minjie Wang Deep Learning Deep Learning trend in the past 10 years Caffe State-of-art DL

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Deep Learning: State of the Art (2020) Deep Learning Lecture Series https://deeplearning.mit.edu

12. Unsupervised Deep Learning CS 535 Deep Learning, Winter 2018 Fuxin Li With materials from

Regularization for Deep Learning Lecture slides for Chapter 7 of Deep Learning

Introduction to Deep Learning Outline Deep Learning RNN CNN Attention

Deep learning for NLP: Introduction CS 6956: Deep Learning for NLP Words are a very fantastical

Deep Reinforcement Learning and Complex Environments Raia Hadsell End-to-end Deep Learning

Bayesian Deep Learning Mohd Adnan Problems With Deep Learning What does a model not know?

Deep Learning on Massively Parallel Processing Databases Frank McQuillan Feb 2019 2 A Brief

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

DSSTNE: Deep Learning At Scale For Large Sparse Datasets

Deep Learning With Differential Privacy Presenter: Xiaojun Xu Deep Learning Framework

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.