Red Hat and the NVIDIA DGX: Tried, Tested, Trusted NVIDIA GTC 2019 Jeremy Eder, Andre Beausoleil, Red Hat
Agenda ● Red Hat + NVIDIA Partnership Overview ● Announcements / What’s New ● OpenShift + GPU Integration Details NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 2
Where Red Hat Partners with NVIDIA ● GPU accelerated workloads in the enterprise ○ AI/ML and HPC ● Deploy and manage NGC containers ○ On-prem or public cloud ● Managing virtualized resources in the data center ○ vGPU for technical workstation ● Fast deployment of GPU resources with Red Hat ○ Easy to use driver framework NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted
Red Hat/NVIDIA Technology Partnership Timeline RH Summit - AI booth & vGPU/RHV - Nvidia GTC 2018 & OpenShift Commons/ Nvidia GTC 2017: OpenShift Partner Joint Webinar - Kubernetes WG mtg - RH KubeCon; Deep SC2017 Red Hat vGPU Theatre & RH AI/ML Oil & Gas Use vGPU & Kubernetes Learning on RH/Nvidia (booth Roadmap Update Strategy sessions Case sessions, RH sponsorship OpenShift w/GPUs demos, talks) Nov’17 Mar’18 Apr’18 May’18 Oct’18 Mar’19 May‘17 Nov’17 Mar’18 May’18 Jun’18 Dec’18 NVIDIA GTC 2019; RHV4.2/vGPU 6.1 NVIDIA GTC DC; LSF & MM Summit: STAC-A2 Benchmark 2018 Rice Oil & Gas RHEL & OpenShift & CUDA9.2 Annc. RHEL & OpenShift Nouveau Driver (Nvidia/HPe/RHEL-STAC HPC Conf Certification on Certification on demo Conf NYC, RH & Nvidia (vGPU/RHV) DGX-2 / T4 GPU DGX-1 Blogs) Server Configs, RH Sponsorship NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted
Red Hat + NVIDIA: What’s New? ● Red Hat Enterprise Linux Certification on DGX-1 & DGX-2 systems ○ Support for Kubernetes-based, OpenShift Container Platform ○ NVIDIA GPU Cloud (NGC) containers to run on RHEL and OpenShift ● Red Hat’s OpenShift provides advanced ways of managing hardware to best leverage GPUs in container environments ● NVIDIA developed precompiled driver packages to simplify GPU deployments on Red Hat products ● NVIDIA’s latest T4 GPUs are available on Red Hat Enterprise Linux ○ T4 Server with RHEL support from most major OEM server vendors ○ T4 servers are “NGC-Ready” to run GPU containers NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted
Red Hat + NVIDIA: Open Source Collaboration Open Source Projects • Heterogeneous Memory Management (HMM) • Memory management between device and CPU • Nouveau Driver • Graphics device driver for NVIDIA GPU • Mediated Devices (mdev) • Enabling vGPU through the Linux kernel framework • Kubernetes Device Plugins • Fast and direct access to GPU hardware • Run GPU enabled containers in Kubernetes cluster NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted
Red Hat OpenShift Container Platform
OPENSHIFT - CONTAINER PLATFORM FOR AI Enable Kubernetes clusters to seamlessly run accelerated AI workloads in containers Red Hat is delivering required functionality to efficiently run OCP NODE OCP NODE OCP NODE OCP MASTER AI/ML workloads on OpenShift c C C API/AUTHENTICATIO N 3.10, 3.11 C C C ● DATA STORE RHEL RHEL RHEL Device plugins provide access to FPGAs, GPGPUs, SoC and ○ SCHEDULER OCP NODE OCP NODE OCP NODE other specialized HW to applications running in containers CPU Manager provides containers with exclusive access to HEALTH/SCALING ○ C C C C compute resources, like CPU cores, for better utilization RED HAT C ENTERPRISE LINUX Huge Pages Support enables containers with large ○ RHEL RHEL RHEL memory requirements to run more efficiently 4.0 GPU-enabled server ● with Red Hat Enterprise Linux and Multi-network feature allows more than one network ○ interface per container for better traffic management OpenShift Container platform (OCP) 8
One Platform to... NFV OpenShift is the single platform Machine FSI to run any application: Learning ● Old or new HPC ISVs ● Monolithic/Microservice Big Data Animation NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 9 9
Data Scientist User Experience (Service Catalog) NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted
Upstream First: Kubernetes Working Groups ● Resource Management Working Group Features Delivered ○ Device Plugins (GPU/Bypass/FPGA) ■ CPU Manager (exclusive cores) ■ Huge Pages Support ■ Extensive Roadmap ○ Intel, IBM, Google, NVIDIA, Red Hat, many more... ● NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 11
Upstream First: Kubernetes Working Groups ● Network Plumbing Working Group Formalized Dec 2017 ○ Implemented a multi-network specification: ● https://github.com/K8sNetworkPlumbingWG/multi-net-spec (collection of CRDs for multiple networks, owned by sig-network) Reference Design implemented in Multus CNI by Red Hat ● Separate control- and data-plane, Overlapping IPs, Fast Data-plane ● IBM, Intel, Red Hat, Huawei, Cisco, Tigera...at least. ● NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 12
GPU Cluster Topology
What does an OpenShift (OCP) Cluster look like? Control Plane Infrastructure LB registry registry registry master master master and and and and etcd and etcd and etcd router router router Compute and GPU Nodes GPU GPU GPU GPU NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 14
OpenShift Cluster Topology ● How to enable software to take advantage of “special” hardware ● Create Node Pools ○ MachineSets ○ Mark them as “special” ○ Taints/Tolerations Compute and GPU Nodes ○ Priority/Preemption GPU GPU ○ ExtendedResourceTole GPU GPU ration NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 15
OpenShift Cluster Topology ● How to enable software to take advantage of “special” hardware ● Tune/Configure the OS ○ Tuned Profiles ○ CPU Isolation ○ sysctls Compute and GPU Nodes GPU GPU GPU GPU NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 16
OpenShift Cluster Topology ● How to enable software to take advantage of “special” hardware ● Optimize your workload ○ Dedicate CPU cores ○ Consume hugepages Compute and GPU Nodes GPU GPU GPU GPU NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 17
OpenShift Cluster Topology ● How to enable software to take advantage of “special” hardware ● Enable the Hardware ○ Install drivers ○ Deploy Device Plugin ○ Deploy monitoring Compute and GPU Nodes GPU GPU GPU GPU NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 18
OpenShift Cluster Topology ● How to enable software to take advantage of “special” hardware ● Consume the Device ○ KubeFlow Template deployment Compute and GPU Nodes GPU GPU GPU GPU NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 19
Support Components
Cluster Node Tuning Operator (tuned) OpenShift node-level tuning operator ● Consolidate/Centralize node-level tuning (openshift-ansible) ● Set tunings for Elastic/Router/SDN ● Add more flexibility to add custom tuning specified by customers ● NVIDIA DGX-1 & DGX-2 Tuned Profiles 21 INSERT DESIGNATOR, IF NEEDED
Node Feature Discovery Operator (NFD) Steer workloads based on infrastructure capabilities Git Repos: ● Labels: feature.node.kubernetes.io/cpu-hardware_multithreading=true feature.node.kubernetes.io/cpuid-AVX2=true Upstream ○ feature.node.kubernetes.io/cpuid-SSE4.2=true Downstream ○ feature.node.kubernetes.io/kernel-selinux.enabled=true Client/Server model ● feature.node.kubernetes.io/kernel-version.full=3.10.0-957.5.1.el7.x86_64 Customize with “hooks” ● feature.node.kubernetes.io/pci-0300_10de.present=true feature.node.kubernetes.io/storage-nonrotationaldisk=true feature.node.kubernetes.io/system-os_release.ID=rhcos feature.node.kubernetes.io/system-os_release.VERSION_ID=4.0 feature.node.kubernetes.io/system-os_release.VERSION_ID.major=4 feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=0 NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted
NFV Partner Engineering along with the Network Plumbing Working Group is using Multus as part of a reference implementation. Multus CNI is a “meta plugin” for https://github.com/intel/multus-cni Kubernetes CNI which enables one to attach multiple network interfaces on each pod. It allows one to assign a CNI plugin to each interface created in the pod. 23
THE PROBLEM (Today) #1 Each pod only has one #2 Each master/node has only network interface one static CNI configuration Kubernetes Kubernetes Master/Node Master/Node so. static. Pod A eth0 flannel 24
THE SOLUTION (Today) Static CNI configuration points Each subsequent CNI plugin, as called by Multus, has to Multus configurations which are defined in CRD objects Kubernetes Kubernetes Master/Node Master/Node CRDs Pod annotation I’d like a flannel interface, and a macvlan interface please. flannel macvlan Pod C eth0 net0 flannel macvlan Sure thing bud, I’ll pull up the configurations stored in CRD objects. 25
Recommend
More recommend