s9334 building and managing scalable ai infrastructure
play

S9334: Building And Managing Scalable AI Infrastructure With NVIDIA - PowerPoint PPT Presentation

S9334: Building And Managing Scalable AI Infrastructure With NVIDIA DGX Pod And DGX Pod Management Software Building your AI Data Center with DGX Reference Architectures Agenda Creating Network Topologies DGX POD Management


  1. S9334: Building And Managing Scalable AI Infrastructure With NVIDIA DGX Pod And DGX Pod Management Software

  2. • Building your AI Data Center with DGX Reference Architectures Agenda • Creating Network Topologies • DGX POD Management Software (DeepOps) 2

  3. Why Infrastructure Matters to AI 3

  4. Considerations For On-prem Vs Cloud Keep Compute Where the Data Lives TRAIN CLOSEST TO WHERE YOUR DATA LIVES CLOUD ON-PREM ✓ Data Sovereignty • Early exploration • Deep Learning Enterprise and Security Small datasets in cloud • • Large datasets on-premises ✓ Lowest Cost per Few experiments • Training Run • Frequent, rapid experimentation • Careful prep for each ✓ Fail fast, learn • Creative exploration, frequent run to save costs training runs FASTER Fixed cost infrastructure = • experiment freely 4

  5. AI Adopters Impeded By Infrastructure 40% see infrastructure AI Boosts Profit as impeding AI Margins up to 15% source: 2018 CTA Market Research 5

  6. Considerations When Selecting An AI Platform 6

  7. AI Platform Considerations Factors impacting deep learning platform decisions TOTAL COST OF DEVELOPER SCALING OWNERSHIP PRODUCTIVITY PERFORMANCE “ “ “ I have limited budget, Must get started now, I want the most GPU need lowest up-front line of business wants to bang for the buck cost possible deliver results yesterday 7

  8. Comparing AI Compute Alternatives Looking beyond the “spec sheet” AI/DL Expertise & Innovation Evaluation Criteria AI/DL Software Stack Operating System Image Hardware Architecture 8

  9. The Value Of AI Infrastructure With DGX Reference Architectures FASTER, SIMPLIFIED TRUSTED EXPERTISE SCALABLE PERFORMANCE DEPLOYMENT AND SUPPORT DGX RA Solution Storage Reference architectures from Simplified, validated, Available through select NPN NVIDIA and leading storage partners converged infrastructure offers partners as a turnkey solution 9

  10. Simplifying Deployment 10

  11. AI Success Delayed By Deployment Complexity “DIY” TCO Time and budget spent on things other than data science CAPEX Day Month 1 3 OPEX Design Software Study & Software Platform HW & SW Trouble- Software Productive Training Insights and Build Integra- optimiz- re- exploration eng’g Design shooting Experi- at Scale for Scale tion ation optimiz- mentation ation Designing, Building and Supporting an AI Infrastructure – from Scratch 11

  12. The Impact Of DGX R/A Solutions On Timeline “DIY” TCO Wasted time/effort - eliminated DGX deployment TCO cycle shortened Day Month CAPEX 1 3 Install Design Softwar Study & Softwar Platform and Trouble- Software Productive Training Insights and Build optimiz- e re- exploration e eng’g Design Deploy shooting Experi- at Scale for Scale ation optimiz- DGX RA mentation ation SOLUTION 2. Deploying an Integrated, Full-Stack AI Solution using a DGX Reference Architecture 12

  13. The Impact Of DGX R/A Solutions On Timeline “DIY” TCO DGX TCO Week Day CAPEX 1 1 Install Insights and Study & Productive Training Deploy exploration Experi- at Scale DGX RA mentation SOLUTION 2. Deploying an Integrated, Full-Stack AI Solution using a DGX Reference Architecture 13

  14. Supporting AI Infrastructure 14

  15. Supporting AI: Alternative Approaches “My PyTorch CNN model is running 30% slower than yesterday!” IT Admin Installed/ Problem! “OK let me look into it” running 15

  16. Supporting AI: Alternative Approaches Multiple paths to problem resolution Framework? Libraries? Open source / forum O/S? Open source / forum GPU? Drivers? Installed/ Problem! Server? running Network? Storage? Server, Storage & Network Solution Providers 16

  17. Supporting AI With DGX Reference Architecture Solutions “My PyTorch CNN model is running 30% slower NPN than yesterday!” Partner AI Expertise IT Admin DGX RA DGX RA Solution Solution Storage Storage Problem! Running! “Update to PyTorch container XX.XX” 17

  18. Creating Networking Topologies 18

  19. DGX-1 POD Storage Partner Solutions 19 19

  20. NetApp ONTAP AI Simplify, Accelerate, and Scale the Data Pipeline for Deep Learning HARDWARE • NVIDIA DGX-1 | 5 x DGX-1 Systems | 5 PFLOPS NETAPP AFF A800 | HA Pair | 364TB | 1M IOPS • CISCO | 2x 100Gb Ethernet Switches with RDMA • SOFTWARE NVIDIA GPU CLOUD DEEP LEARNING STACK | • NVIDIA Optimized Frameworks • NETAPP ONTAP 9 | Simplified Data Management TRIDENT | Provision Persistent Storage for DL • SUPPORT Single point of contact support • Proven support model • 20 20

  21. NetApp Network Switch Port Configuration 21

  22. 22 22

  23. 23

  24. 24

  25. NetApp VLAN Connectivity for DGX-1 Servers and Storage System Ports 25

  26. NetApp Storage System Configuration 26

  27. NetApp Host Configuration 27

  28. AIRI: AI-Ready Infrastructure Extending the power of DGX-1 at-scale in every enterprise HARDWARE • NVIDIA DGX-1 | 4x DGX-1 Systems | 4 PFLOPS PURE FLASHBLADE™ | 15x 17TB Blades | 1.5M IOPS • CISCO or ARISTA | 2x 100Gb Ethernet Switches • with RDMA SOFTWARE NVIDIA GPU CLOUD DEEP LEARNING STACK | NVIDIA • Optimized Frameworks • AIRI SCALING TOOLKIT | Multi-node Training Made Simple 28 28

  29. Pure Storage Network Topology 29

  30. Rack Design & Builds

  31. DDN A3I with DGX-1 Making AI-Powered Innovation Easier HARDWARE • NVIDIA DGX-1 | 4 x DGX-1 Systems | 4 PFLOPS • DDN AI200, AI7990 | 20GB/s | from 30TB | 350K IOPS • NETWORK: 2 x EDR IB or 100GbE Switches with RDMA SOFTWARE NVIDIA GPU CLOUD DEEP LEARNING STACK • | NVIDIA Optimized Frameworks DDN: High performance, low latency, • parallel file system DDN: In-container client for easy • deployment, efficiency, performance and reliability 31 31

  32. DDN A3I Reference Architecture 9:1 Configuration

  33. Network Diagram of DDN A3I Benchmark Testing Environment

  34. Optimized Data Delivery for DGX-1 server with DDN A3I

  35. DDN Network Diagram of Port-Level Connectivity 1:1 configuration

  36. DDN Network Diagram of Port-Level Connectivity 4:1 configuration

  37. DDN Network Diagram of Port-Level Connectivity 9:1 configuration

  38. DGX POD Management Software (DeepOps) 38

  39. You've Got A Shiny New DGX POD! What now? Cluster Deployment & Maintenance Security Job Monitoring OS Firmware / Access Scheduling Other important considerations Network NGC Airgap Containers Storage 39

  40. DeepOps What is it? • Open-source project to facilitate deployment of multi-node GPU clusters for Deep Learning and HPC environments, in an on-premise, optionally air-gapped datacenter or in the cloud • DeepOps is also recognized as the DGX POD Management Software The modular nature of the project also allows more experienced administrators to pick and choose • items that may be useful, making the process compatible with their existing software or infrastructure • GitHub: https://github.com/NVIDIA/deepops Note: You can use DeepOps to configure any NVIDIA GPU-Accelerated platform (and not just DGX servers). 40

  41. Building out your GPU cluster DeepOps Components Automated Provisioning Firmware management ● PXE Server for OS ● Automated, cluster-wide installation across cluster firmware management ● Automated configuration DeepOps: management Components Package repository Docker registry ● Deployment of internal Apt- ● Deployment of internal repository registry ● Mirror packages for air- ● Automated mirroring of NGC gapped environments containers Job Scheduling Logging Monitoring ● Filebeat ● DCGM ● Kubernetes ● Elasticsearch ● Prometheus ● Slurm ● Kibana ● Grafana 41

  42. Here’s What We’ll Build Today Deploy Kubeflow To cluster and beyond! Run GPU-Accelerated jobs Deploy Kubernetes on compute node(s) GPU Compute node(s): ● For high-performance Deploy additional services on management compute workloads node(s) DeepOps Provision compute node(s) Management node(s): ● Used for cluster Deploy basic services on management management node(s) Network Deploy Kubernetes on management node(s) Prepare management node(s) Provisioning node: ● Orchestrates the initial Storage setup of the cluster Prepare provisioning node 42

  43. Architectural Considerations 43

  44. ARCHITECTURE Building Multi-node GPU Clusters with DeepOps Datacenter Network Login node(s) Mgmt. node(s) ● 1x CPU-only login node ● Odd number of CPU-only management nodes required for etcd key-value store ○ Management / Communication 1/10Gb Ethernet ● 1/10Gb Ethernet control & management Compute node(s) Storage networks ○ Management, connectivity, command & Kubernetes Nodes Slurm Nodes control ● Fully non-blocking fat-tree 100Gb EDR Infiniband topology ○ Use the biggest EDR IB core switch that fits 100Gb EDR InfiniBand / RoCE 44

Recommend


More recommend