Red Hat and the NVIDIA DGX: Tried, Tested, Trusted NVIDIA GTC 2019 - PowerPoint PPT Presentation

Red Hat and the NVIDIA DGX: Tried, Tested, Trusted NVIDIA GTC 2019 Jeremy Eder, Andre Beausoleil, Red Hat

Agenda ● Red Hat + NVIDIA Partnership Overview ● Announcements / What’s New ● OpenShift + GPU Integration Details NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 2

Where Red Hat Partners with NVIDIA ● GPU accelerated workloads in the enterprise ○ AI/ML and HPC ● Deploy and manage NGC containers ○ On-prem or public cloud ● Managing virtualized resources in the data center ○ vGPU for technical workstation ● Fast deployment of GPU resources with Red Hat ○ Easy to use driver framework NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted

Red Hat/NVIDIA Technology Partnership Timeline RH Summit - AI booth & vGPU/RHV - Nvidia GTC 2018 & OpenShift Commons/ Nvidia GTC 2017: OpenShift Partner Joint Webinar - Kubernetes WG mtg - RH KubeCon; Deep SC2017 Red Hat vGPU Theatre & RH AI/ML Oil & Gas Use vGPU & Kubernetes Learning on RH/Nvidia (booth Roadmap Update Strategy sessions Case sessions, RH sponsorship OpenShift w/GPUs demos, talks) Nov’17 Mar’18 Apr’18 May’18 Oct’18 Mar’19 May‘17 Nov’17 Mar’18 May’18 Jun’18 Dec’18 NVIDIA GTC 2019; RHV4.2/vGPU 6.1 NVIDIA GTC DC; LSF & MM Summit: STAC-A2 Benchmark 2018 Rice Oil & Gas RHEL & OpenShift & CUDA9.2 Annc. RHEL & OpenShift Nouveau Driver (Nvidia/HPe/RHEL-STAC HPC Conf Certification on Certification on demo Conf NYC, RH & Nvidia (vGPU/RHV) DGX-2 / T4 GPU DGX-1 Blogs) Server Configs, RH Sponsorship NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted

Red Hat + NVIDIA: What’s New? ● Red Hat Enterprise Linux Certification on DGX-1 & DGX-2 systems ○ Support for Kubernetes-based, OpenShift Container Platform ○ NVIDIA GPU Cloud (NGC) containers to run on RHEL and OpenShift ● Red Hat’s OpenShift provides advanced ways of managing hardware to best leverage GPUs in container environments ● NVIDIA developed precompiled driver packages to simplify GPU deployments on Red Hat products ● NVIDIA’s latest T4 GPUs are available on Red Hat Enterprise Linux ○ T4 Server with RHEL support from most major OEM server vendors ○ T4 servers are “NGC-Ready” to run GPU containers NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted

Red Hat + NVIDIA: Open Source Collaboration Open Source Projects • Heterogeneous Memory Management (HMM) • Memory management between device and CPU • Nouveau Driver • Graphics device driver for NVIDIA GPU • Mediated Devices (mdev) • Enabling vGPU through the Linux kernel framework • Kubernetes Device Plugins • Fast and direct access to GPU hardware • Run GPU enabled containers in Kubernetes cluster NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted

Red Hat OpenShift Container Platform

OPENSHIFT - CONTAINER PLATFORM FOR AI Enable Kubernetes clusters to seamlessly run accelerated AI workloads in containers Red Hat is delivering required functionality to efficiently run OCP NODE OCP NODE OCP NODE OCP MASTER AI/ML workloads on OpenShift c C C API/AUTHENTICATIO N 3.10, 3.11 C C C ● DATA STORE RHEL RHEL RHEL Device plugins provide access to FPGAs, GPGPUs, SoC and ○ SCHEDULER OCP NODE OCP NODE OCP NODE other specialized HW to applications running in containers CPU Manager provides containers with exclusive access to HEALTH/SCALING ○ C C C C compute resources, like CPU cores, for better utilization RED HAT C ENTERPRISE LINUX Huge Pages Support enables containers with large ○ RHEL RHEL RHEL memory requirements to run more efficiently 4.0 GPU-enabled server ● with Red Hat Enterprise Linux and Multi-network feature allows more than one network ○ interface per container for better traffic management OpenShift Container platform (OCP) 8

One Platform to... NFV OpenShift is the single platform Machine FSI to run any application: Learning ● Old or new HPC ISVs ● Monolithic/Microservice Big Data Animation NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 9 9

Data Scientist User Experience (Service Catalog) NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted

Upstream First: Kubernetes Working Groups ● Resource Management Working Group Features Delivered ○ Device Plugins (GPU/Bypass/FPGA) ■ CPU Manager (exclusive cores) ■ Huge Pages Support ■ Extensive Roadmap ○ Intel, IBM, Google, NVIDIA, Red Hat, many more... ● NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 11

Upstream First: Kubernetes Working Groups ● Network Plumbing Working Group Formalized Dec 2017 ○ Implemented a multi-network specification: ● https://github.com/K8sNetworkPlumbingWG/multi-net-spec (collection of CRDs for multiple networks, owned by sig-network) Reference Design implemented in Multus CNI by Red Hat ● Separate control- and data-plane, Overlapping IPs, Fast Data-plane ● IBM, Intel, Red Hat, Huawei, Cisco, Tigera...at least. ● NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 12

GPU Cluster Topology

What does an OpenShift (OCP) Cluster look like? Control Plane Infrastructure LB registry registry registry master master master and and and and etcd and etcd and etcd router router router Compute and GPU Nodes GPU GPU GPU GPU NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 14

OpenShift Cluster Topology ● How to enable software to take advantage of “special” hardware ● Create Node Pools ○ MachineSets ○ Mark them as “special” ○ Taints/Tolerations Compute and GPU Nodes ○ Priority/Preemption GPU GPU ○ ExtendedResourceTole GPU GPU ration NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 15

OpenShift Cluster Topology ● How to enable software to take advantage of “special” hardware ● Tune/Configure the OS ○ Tuned Profiles ○ CPU Isolation ○ sysctls Compute and GPU Nodes GPU GPU GPU GPU NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 16

OpenShift Cluster Topology ● How to enable software to take advantage of “special” hardware ● Optimize your workload ○ Dedicate CPU cores ○ Consume hugepages Compute and GPU Nodes GPU GPU GPU GPU NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 17

OpenShift Cluster Topology ● How to enable software to take advantage of “special” hardware ● Enable the Hardware ○ Install drivers ○ Deploy Device Plugin ○ Deploy monitoring Compute and GPU Nodes GPU GPU GPU GPU NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 18

OpenShift Cluster Topology ● How to enable software to take advantage of “special” hardware ● Consume the Device ○ KubeFlow Template deployment Compute and GPU Nodes GPU GPU GPU GPU NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted 19

Support Components

Cluster Node Tuning Operator (tuned) OpenShift node-level tuning operator ● Consolidate/Centralize node-level tuning (openshift-ansible) ● Set tunings for Elastic/Router/SDN ● Add more flexibility to add custom tuning specified by customers ● NVIDIA DGX-1 & DGX-2 Tuned Profiles 21 INSERT DESIGNATOR, IF NEEDED

Node Feature Discovery Operator (NFD) Steer workloads based on infrastructure capabilities Git Repos: ● Labels: feature.node.kubernetes.io/cpu-hardware_multithreading=true feature.node.kubernetes.io/cpuid-AVX2=true Upstream ○ feature.node.kubernetes.io/cpuid-SSE4.2=true Downstream ○ feature.node.kubernetes.io/kernel-selinux.enabled=true Client/Server model ● feature.node.kubernetes.io/kernel-version.full=3.10.0-957.5.1.el7.x86_64 Customize with “hooks” ● feature.node.kubernetes.io/pci-0300_10de.present=true feature.node.kubernetes.io/storage-nonrotationaldisk=true feature.node.kubernetes.io/system-os_release.ID=rhcos feature.node.kubernetes.io/system-os_release.VERSION_ID=4.0 feature.node.kubernetes.io/system-os_release.VERSION_ID.major=4 feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=0 NVIDIA GTC 2019: Red hat and the NVIDIA DGX Tried, Tested, Trusted

NFV Partner Engineering along with the Network Plumbing Working Group is using Multus as part of a reference implementation. Multus CNI is a “meta plugin” for https://github.com/intel/multus-cni Kubernetes CNI which enables one to attach multiple network interfaces on each pod. It allows one to assign a CNI plugin to each interface created in the pod. 23

THE PROBLEM (Today) #1 Each pod only has one #2 Each master/node has only network interface one static CNI configuration Kubernetes Kubernetes Master/Node Master/Node so. static. Pod A eth0 flannel 24

THE SOLUTION (Today) Static CNI configuration points Each subsequent CNI plugin, as called by Multus, has to Multus configurations which are defined in CRD objects Kubernetes Kubernetes Master/Node Master/Node CRDs Pod annotation I’d like a flannel interface, and a macvlan interface please. flannel macvlan Pod C eth0 net0 flannel macvlan Sure thing bud, I’ll pull up the configurations stored in CRD objects. 25

Red Hat and the NVIDIA DGX: Tried, Tested, Trusted NVIDIA GTC 2019 - PowerPoint PPT Presentation

Red Hat and the NVIDIA DGX: Tried, Tested, Trusted NVIDIA GTC 2019 Jeremy Eder, Andre Beausoleil, Red Hat Agenda Red Hat + NVIDIA Partnership Overview Announcements / Whats New OpenShift + GPU Integration Details NVIDIA GTC 2019:

FOR THE BEST VDI USER EXPERIENCE NVIDIA VIRTUAL GPU PRODUCT POSITIONING NVIDIA GRID NVIDIA

GET TO KNOW THE NVIDIA GRID TM SDK Shounak Deshpande, NVIDIA Background NVIDIA GRID SDK AGENDA

GENERATION OF GAMING TECHNOLOGY Samuel Lo, NVIDIA AI Technology Centre samuell@nvidia.com NVIDIA

NVIDIA Quadro and NVS Video Walls NVIDIA Quadro and NVS Video Walls Using NVIDIA technology to

Who We Are Nathan Reed NVIDIA DevTech 2 yrs Previously: game graphics programmer at Sucker

NVIDIA DESIGNWORKS Ankit Patel - ankitp@nvidia.com Prerna Dogra - pdogra@nvidia.com 1 Autonomous

NVIDIA NSIGHT ECLIPSE EDITION CHRISTOPH ANGERER, NVIDIA JULIEN DEMOUTH, NVIDIA WHAT YOU WILL

Testing If its not tested, it doesnt work Why Unit Testing? If it is not tested,

NVIDIA VGPU LINUX KVM Neo Jia, Dec 19th 2019 AGENDA NVIDIA vGPU

S9226 Fast singular value decomposition on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Samuel

NVIDIA INDEX IMPLEMENTING CLOUD SERVICES FOR MASSIVE DATA VISUALIZATION Marc Nienhaus (NVIDIA),

NVIDIA INDEX IMPLEMENTING ADVANCED DATA VISUALIZATION WITH NVIDIA INDEX Alexander Kuhn and Marc

Scaling-Up Deep Learning For Autonomous Vehicles JOSE M. ALVAREZ | | San Jose 2019 1 NVIDIA

IMAGE CLASSIFICATION WITH NVIDIA DIGITS Pedro Mario Cruz e Silva (pcruzesilva@nvidia.com)

HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1 Max Lv , NVIDIA Brant Zhao, NVIDIA April 7

TRUSTED MEMORY Software-Based O-Chip Memory Protection for RISC-V Trusted Execution

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION CHRISTOPH ANGERER, NVIDIA JULIEN

Working with Practitioners to update our Homelessness system Tried and tested Tried and

Trusted Caregivers, On Demand An app that provides a marketplace of trusted, convenient, and

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 5/8/2017 NVIDIA Video Technologies New SDK Release

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/20/2019 NVIDIA Video Technologies Overview Turing

Porting Nouveau to Tegra K1 How NVIDIA became a Nouveau contributor Alexandre Courbot, NVIDIA

Cutting Edge Tools and Techniques for Real-Time Rendering with NVIDIA GameWorks David Coombes,

SIGGRAPH 2013 Shaping the Future of Visual Computing NVIDIA IndeX Enabling Interactive