red hat and the nvidia dgx tried tested trusted
play

Red Hat and the NVIDIA DGX: Tried, Tested, Trusted NVIDIA GTC 2019 - PowerPoint PPT Presentation

Red Hat and the NVIDIA DGX: Tried, Tested, Trusted NVIDIA GTC 2019 Jeremy Eder, Andre Beausoleil, Red Hat Agenda Red Hat + NVIDIA Partnership Overview Announcements / Whats New OpenShift + GPU Integration Details NVIDIA GTC 2019:


  1. Red Hat and the NVIDIA DGX: Tried, Tested, Trusted NVIDIA GTC 2019 Jeremy Eder, Andre Beausoleil, Red Hat

  2. Agenda ● Red Hat + NVIDIA Partnership Overview ● Announcements / What’s New ● OpenShift + GPU Integration Details NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 2

  3. Where Red Hat Partners with NVIDIA ● GPU accelerated workloads in the enterprise ○ AI/ML and HPC ● Deploy and manage NGC containers ○ On-prem or public cloud ● Managing virtualized resources in the data center ○ vGPU for technical workstation ● Fast deployment of GPU resources with Red Hat ○ Easy to use driver framework NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted

  4. Red Hat/NVIDIA Technology Partnership Timeline RH Summit - AI booth & vGPU/RHV - Nvidia GTC 2018 & OpenShift Commons/ Nvidia GTC 2017: OpenShift Partner Joint Webinar - Kubernetes WG mtg - RH KubeCon; Deep SC2017 Red Hat vGPU Theatre & RH AI/ML Oil & Gas Use vGPU & Kubernetes Learning on RH/Nvidia (booth Roadmap Update Strategy sessions Case sessions, RH sponsorship OpenShift w/GPUs demos, talks) Nov’17 Mar’18 Apr’18 May’18 Oct’18 Mar’19 May‘17 Nov’17 Mar’18 May’18 Jun’18 Dec’18 NVIDIA GTC 2019; RHV4.2/vGPU 6.1 NVIDIA GTC DC; LSF & MM Summit: STAC-A2 Benchmark 2018 Rice Oil & Gas RHEL & OpenShift & CUDA9.2 Annc. RHEL & OpenShift Nouveau Driver (Nvidia/HPe/RHEL-STAC HPC Conf Certification on Certification on demo Conf NYC, RH & Nvidia (vGPU/RHV) DGX-2 / T4 GPU DGX-1 Blogs) Server Configs, RH Sponsorship NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted

  5. Red Hat + NVIDIA: What’s New? ● Red Hat Enterprise Linux Certification on DGX-1 & DGX-2 systems ○ Support for Kubernetes-based, OpenShift Container Platform ○ NVIDIA GPU Cloud (NGC) containers to run on RHEL and OpenShift ● Red Hat’s OpenShift provides advanced ways of managing hardware to best leverage GPUs in container environments ● NVIDIA developed precompiled driver packages to simplify GPU deployments on Red Hat products ● NVIDIA’s latest T4 GPUs are available on Red Hat Enterprise Linux ○ T4 Server with RHEL support from most major OEM server vendors ○ T4 servers are “NGC-Ready” to run GPU containers NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted

  6. Red Hat + NVIDIA: Open Source Collaboration Open Source Projects • Heterogeneous Memory Management (HMM) • Memory management between device and CPU • Nouveau Driver • Graphics device driver for NVIDIA GPU • Mediated Devices (mdev) • Enabling vGPU through the Linux kernel framework • Kubernetes Device Plugins • Fast and direct access to GPU hardware • Run GPU enabled containers in Kubernetes cluster NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted

  7. Red Hat OpenShift Container Platform

  8. OPENSHIFT - CONTAINER PLATFORM FOR AI Enable Kubernetes clusters to seamlessly run accelerated AI workloads in containers Red Hat is delivering required functionality to efficiently run OCP NODE OCP NODE OCP NODE OCP MASTER AI/ML workloads on OpenShift c C C API/AUTHENTICATIO N 3.10, 3.11 C C C ● DATA STORE RHEL RHEL RHEL Device plugins provide access to FPGAs, GPGPUs, SoC and ○ SCHEDULER OCP NODE OCP NODE OCP NODE other specialized HW to applications running in containers CPU Manager provides containers with exclusive access to HEALTH/SCALING ○ C C C C compute resources, like CPU cores, for better utilization RED HAT C ENTERPRISE LINUX Huge Pages Support enables containers with large ○ RHEL RHEL RHEL memory requirements to run more efficiently 4.0 GPU-enabled server ● with Red Hat Enterprise Linux and Multi-network feature allows more than one network ○ interface per container for better traffic management OpenShift Container platform (OCP) 8

  9. One Platform to... NFV OpenShift is the single platform Machine FSI to run any application: Learning ● Old or new HPC ISVs ● Monolithic/Microservice Big Data Animation NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 9 9

  10. Data Scientist User Experience (Service Catalog) NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted

  11. Upstream First: Kubernetes Working Groups ● Resource Management Working Group Features Delivered ○ Device Plugins (GPU/Bypass/FPGA) ■ CPU Manager (exclusive cores) ■ Huge Pages Support ■ Extensive Roadmap ○ Intel, IBM, Google, NVIDIA, Red Hat, many more... ● NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 11

  12. Upstream First: Kubernetes Working Groups ● Network Plumbing Working Group Formalized Dec 2017 ○ Implemented a multi-network specification: ● https://github.com/K8sNetworkPlumbingWG/multi-net-spec (collection of CRDs for multiple networks, owned by sig-network) Reference Design implemented in Multus CNI by Red Hat ● Separate control- and data-plane, Overlapping IPs, Fast Data-plane ● IBM, Intel, Red Hat, Huawei, Cisco, Tigera...at least. ● NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 12

  13. GPU Cluster Topology

  14. What does an OpenShift (OCP) Cluster look like? Control Plane Infrastructure LB registry registry registry master master master and and and and etcd and etcd and etcd router router router Compute and GPU Nodes GPU GPU GPU GPU NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 14

  15. OpenShift Cluster Topology ● How to enable software to take advantage of “special” hardware ● Create Node Pools ○ MachineSets ○ Mark them as “special” ○ Taints/Tolerations Compute and GPU Nodes ○ Priority/Preemption GPU GPU ○ ExtendedResourceTole GPU GPU ration NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 15

  16. OpenShift Cluster Topology ● How to enable software to take advantage of “special” hardware ● Tune/Configure the OS ○ Tuned Profiles ○ CPU Isolation ○ sysctls Compute and GPU Nodes GPU GPU GPU GPU NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 16

  17. OpenShift Cluster Topology ● How to enable software to take advantage of “special” hardware ● Optimize your workload ○ Dedicate CPU cores ○ Consume hugepages Compute and GPU Nodes GPU GPU GPU GPU NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 17

  18. OpenShift Cluster Topology ● How to enable software to take advantage of “special” hardware ● Enable the Hardware ○ Install drivers ○ Deploy Device Plugin ○ Deploy monitoring Compute and GPU Nodes GPU GPU GPU GPU NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 18

  19. OpenShift Cluster Topology ● How to enable software to take advantage of “special” hardware ● Consume the Device ○ KubeFlow Template deployment Compute and GPU Nodes GPU GPU GPU GPU NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 19

  20. Support Components

  21. Cluster Node Tuning Operator (tuned) OpenShift node-level tuning operator ● Consolidate/Centralize node-level tuning (openshift-ansible) ● Set tunings for Elastic/Router/SDN ● Add more flexibility to add custom tuning specified by customers ● NVIDIA DGX-1 & DGX-2 Tuned Profiles 21 INSERT DESIGNATOR, IF NEEDED

  22. Node Feature Discovery Operator (NFD) Steer workloads based on infrastructure capabilities Git Repos: ● Labels: feature.node.kubernetes.io/cpu-hardware_multithreading=true feature.node.kubernetes.io/cpuid-AVX2=true Upstream ○ feature.node.kubernetes.io/cpuid-SSE4.2=true Downstream ○ feature.node.kubernetes.io/kernel-selinux.enabled=true Client/Server model ● feature.node.kubernetes.io/kernel-version.full=3.10.0-957.5.1.el7.x86_64 Customize with “hooks” ● feature.node.kubernetes.io/pci-0300_10de.present=true feature.node.kubernetes.io/storage-nonrotationaldisk=true feature.node.kubernetes.io/system-os_release.ID=rhcos feature.node.kubernetes.io/system-os_release.VERSION_ID=4.0 feature.node.kubernetes.io/system-os_release.VERSION_ID.major=4 feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=0 NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted

  23. NFV Partner Engineering along with the Network Plumbing Working Group is using Multus as part of a reference implementation. Multus CNI is a “meta plugin” for https://github.com/intel/multus-cni Kubernetes CNI which enables one to attach multiple network interfaces on each pod. It allows one to assign a CNI plugin to each interface created in the pod. 23

  24. THE PROBLEM (Today) #1 Each pod only has one #2 Each master/node has only network interface one static CNI configuration Kubernetes Kubernetes Master/Node Master/Node so. static. Pod A eth0 flannel 24

  25. THE SOLUTION (Today) Static CNI configuration points Each subsequent CNI plugin, as called by Multus, has to Multus configurations which are defined in CRD objects Kubernetes Kubernetes Master/Node Master/Node CRDs Pod annotation I’d like a flannel interface, and a macvlan interface please. flannel macvlan Pod C eth0 net0 flannel macvlan Sure thing bud, I’ll pull up the configurations stored in CRD objects. 25

Recommend


More recommend