deep learning ai
play

Deep Learning/AI Lifecycle with Dell EMC and bitfusion Bhavesh - PowerPoint PPT Presentation

Deep Learning/AI Lifecycle with Dell EMC and bitfusion Bhavesh Patel Dell EMC Server Advanced Engineering Mazhar Memon CTO Bitfusion Abstract This talk gives an overview of the end to end application life cycle of deep learning in the


  1. Deep Learning/AI Lifecycle with Dell EMC and bitfusion Bhavesh Patel Dell EMC Server Advanced Engineering Mazhar Memon CTO Bitfusion

  2. Abstract This talk gives an overview of the end to end application life cycle of deep learning in the enterprise along with numerous use cases and summarizes studies done by Bitfusion and Dell on a high performance heterogeneous elastic rack of DellEMC PowerEdge C4130s with Nvidia GPUs. Some of the use cases that will be talked about in detail will be ability to bring on-demand GPU acceleration beyond the rack across the enterprise with easy attachable elastic GPUs for deep learning development, as well as the creation of a cost effective software defined high performance elastic multi-GPU system combining multiple DellEMC C4130 servers at runtime for deep learning training.

  3. Deep Learning and AI Are being adopted across a wide range of market segments

  4. Industry/Function AI Revolution Computer Vision & Speech, Drones, Droids ROBOTICS Interactive Virtual & Mixed Reality ENTERTAINMENT Self-Driving Cars, Co-Pilot Advisor AUTOMOTIVE Predictive Price Analysis, Dynamic Decision Support FINANCE Drug Discovery, Protein Simulation PHARMA Predictive Diagnosis, Wearable Intelligence HEALTHCARE Geo-Seismic Resource Discovery ENERGY Adaptive Learning Courses EDUCATION Adaptive Product Recommendations SALES Dynamic Routing Optimization SUPPLY CHAIN Bots And Fully-Automated Service CUSTOMER SERVICE Dynamic Risk Mitigation And Yield Optimization MAINTENANCE

  5. ...but few people have the time, knowledge, resources to even get started

  6. PROBLEM 1 : HARDWARE INFRASTRUCTURE LIMITATIONS ● Increased cost with dense servers ● TOR bottleneck, limited scalability ● Limited multi-tenancy on GPU servers (limited CPU and memory per user) ● Limited to 8-GPU applications ● Does not support GPU apps with: ○ High storage, CPU, Memory requirements

  7. PROBLEM 2 : SOFTWARE COMPLEXITY OVERLOAD Software Management Model Management GPU Driver Management Code Version Management Framework & Library Installation Hyperparameter Optimization Deep Learning Framework Configuration Experiment Tracking Package Manager Deployment Automation Jupyter Server or IDE Setup Deployment Continuous Integration Infrastructure Management Data Management Workload Management Cloud or Server Orchestration Data Uploader Job Scheduler GPU Hardware Setup Shared Local File System Log Management GPU Resource Allocation Data Volume Management User & Group Management Container Orchestration Data Integrations & Pipelining Inference Autoscaling Networking Direct Bypass MPI / RDMA / RPI / gRPC Monitoring

  8. Need to Simplify and Scale

  9. SOLUTION 1/2: CONVERGED RACK SOLUTION ● Up to 64 GPUs per application ● GPU applications with varied storage, memory, CPU requirements ● 30-50% less cost per GPU ● > {cores, memory} / GPU ● >> intra-rack networking bandwidth ● Less inter-rack load ● Composable - Add-as-you-go Composable compute bundle

  10. SOLUTION 2/2: COMPLETE, STREAMLINED AI DEVELOPMENT GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU Develop on pre-installed, quick Transition from development to Push trained, finalized models start deep learning containers. training with multiple GPUs. into production. Get to work quickly with workspaces with • Seamlessly scale out to more GPUs on Deploy a trained neural network into • • optimized pre-configured drivers, a shared training cluster to train larger production and perform real-time frameworks, libraries, and notebooks. models quickly and cost-effectively. inference across different hardware. Start with CPUs, and attach Elastic GPUs • Support and manage multiple users, Manage multiple AI applications and • • on-demand. teams, and projects. inference endpoints corresponding to different trained models. All your code and data is saved • Train multiple models in parallel for • automatically and sharable with others. massive productivity improvements.

  11. C4130 DEEP LEARNING Server CPU sockets (under heat sinks) Power Supplies Front 8 fans GPU accelerators (4) Dual SSD boot Back drives 2x 1Gb NIC IDRAC NIC (optional) Redundant Front Power Supplies

  12. GPU DEEP LEARNING RACK SOLUTION Configuration Details - Pre-Built App Containers - GPU and Workspace Management Features R730 C4130 - Elastic GPUs across the CPU E5-2669 v3@2.1GHz E5-2630 v3@ 2.4Ghz Datacenter - Software defined Scaled out GPU Servers Memory 4GB 1TB/node; 64G DIMM Storage Intel PCIe NVME Intel PCIe NVME Networking IO CX3 FDR InfiniBand CX3 FDR InfiniBand GPU NA M40-24GB R730 TOR Switch Mellanox SX6036- FDR Switch Cables FDR 56G DCA Cables C4130

  13. GPU DEEP LEARNING RACK SOLUTION End to End Deep Learning Application Life Cycle - Pre-Built App Containers 2 Train 1 Develop 3 Deploy - GPU and Workspace Management - Elastic GPUs across the GPU GPU GPU GPU GPU GPU GPU GPU Datacenter GPU GPU GPU GPU GPU GPU GPU GPU - Software defined Scaled GPU GPU GPU GPU GPU GPU GPU GPU out GPU Servers GPU GPU GPU GPU GPU GPU GPU GPU GPU Nodes CPU Nodes R730 C4130 #1 C4130 #2 R730 C4130 #3 C4130 #4 Infiniband Switch

  14. PRODUCT ARCHITECTURE Manage Code Get started quickly with 1 Development pre-built deep learning containers or create your own. Start initial Local development locally or on Environment shared CPUs with interactive workspaces. Attach one or many GPUs 2 on-demand for accelerated training. MASTER Shared Cluster Environment Perform batch scheduling 3 for maximum resource Batch Scheduling & Inference Server Parallel Training efficiency and parallel training for ultimate CPU GPU development speed. NODES NODES Elastic GPU attachment Expose finalized 4 models for production inference. Batch Scheduling & Parallel Training Manage cluster 5 resources, containers, and users.

  15. VALUE PROPOSITION Deep Learning with “State of the Art” Deep Learning with “Streamlined Flow and Converged Infra”

  16. …but wait, ‘converged compute’ requires network attached GPUs... R730 C4130

  17. BITFUSION CORE VIRTUALIZATION GPU Device Virtualization ● Allows dynamic GPU attach on a per- application basis Features ● APIs: CUDA, OpenCL ● Distribution: scale-out to remote GPUs ● Pooling: Oversubscribe GPUs ● Resource Provisioning: Fractional vGPUs ● High Availability: Automatic DMR ● Manageability: Remote nvidia-smi ● Distributed CUDA Unified Memory ● Native support for IB, GPUDirect RDMA ● Feature complete with CUDA 8.0

  18. USE AND MANAGE GPUs IN EXACTLY THE SAME WAY ● Use your favorite tools: ○ All common tools e.g. nvidia-smi work across virtual clusters

  19. PUTTING IT ALL TOGETHER CLIENT SERVER Bitfusion Flex, Bitfusion Client Library managed containers Bitfusion Service Daemon GPU GPU GPU SERVER SERVER SERVER

  20. NATIVE VS. REMOTE GPUs CPU CPU CPU PCIe PCIe PCIe GPU 0 GPU 1 GPU 0 HCA HCA GPU 1 Completely transparent: All CUDA Apps see local and remote GPUs as if directly connected

  21. Results

  22. REMOTE GPUs - LATENCY AND BANDWIDTH • Data movement overheads is the primary scaling limiter • Measurements done at application level – cudaMemcpy Fast Local GPU copies PCIe Intranode copies

  23. 16 GPU virtual system: Naive implementation w/ TCP/IP node 0 C4130 Fast local GPU copies node 1 Intranode copies via PCIe node 2 node 3 Low BW, High Latency remote copies OS Bypass needed to avoid primary TCP/IP overheads AI apps are very latency sensitive

  24. 16 GPU virtual system: Bitfusion optimized transport and runtime Remote =~ Native Local GPUs Minimal NUMA effects

  25. SLICE & DICE - MORE THAN ONE WAY TO GET 4 GPUs R730 C4130 Native GPU performance with network attached GPUs Run time comparison (lower is better) → Multiple ways to create a virtual 4 GPU node, with native efficiency (secs to train Caffe GoogleNet, batch size: 128) TensorFlow Caffe GoogleNet Pixel-CNN

  26. TRAINING PERFORMANCE R730 C4130 Continued Strong Scaling Caffe GoogleNet Weak-scaling Accelerate Hyper parameter Optimization PCIe host bridge limit Caffe GoogleNet TensorFlow 1.0 with Pixel-CNN 53% 55% 73% 74% 86% 1 2 4 8 16 native remote

  27. Other PCIe GPU Configurations Available Currently Testing Further reading: Config ‘G’ http://en.community.dell.com/techcenter/high-performance- computing/b/general_hpc/archive/2016/11/11/deep-learning-performance-with- p100-gpus http://en.community.dell.com/techcenter/high-performance- computing/b/general_hpc/archive/2017/03/22/deep-learning-inference-on-p40- gpus

  28. NvLink Configuration • 4 P100-16GB SXM2 GPU • 2 CPU • PCIe switch • 1 PCIe slot – EDR IB Config ‘K’ • Memory : 256GB w/16GB @ 2133 SXM2 SXM2 • OS: Ubuntu 16.04 #2 #3 • CUDA: 8.1 SXM2 SXM2 #4 #1 29 of Y

  29. NvLink Configuration • 4 P100-16GB SXM2 GPU • 2 CPU • PCIe switch PCIe Switch • 1 PCIe slot – EDR IB SXM2 SXM2 • Memory : 256GB w/16GB #2 #3 @ 2133 • OS: Ubuntu 16.04 SXM2 SXM2 #4 #1 • CUDA: 8.1 Config ‘L’ 30 of Y

  30. Come visit us Dell Booth #110 Request access or schedule a Bitfusion Booth #103 demo for Bitfusion Flex at bitfusion.io Scheduled live demos: 12-12:30 Dell Booth 5-7 Dell Booth ongoing Bitfusion Booth

Recommend


More recommend