DATA CENTER GPU MANAGER Brent Stolle and David Beer March 2018
TOOLS FOR MANAGING GPUs Out-of-Band In-Band GPU Metrics and Monitoring via Tools use the NVIDIA driver to BMC (SMBPBI) provide GPU and NVSwitch metrics Provide metrics (thermals, power, etc.) without the NVIDIA driver DCGM, NVML (smi) are in-band tools Typically used at public CSPs (i.e. multi-tenant environments) Typically used at single tenant environments 2
NVIDIA IN-BAND TOOLS ECOSYSTEM Cluster managers, Job ▶ schedulers, TSDBs, 3rd Party Tools Visualization tools DCGM Customers integrating DCGM; ▶ CSPs for system validation NVML Customers building their own ▶ GPU metrics/monitoring stack using NVML 3
HOW SHOULD I MANAGE MY GPUS? 3 RD PARTY NVML DCGM TOOLS Stateless queries. Can only Can query a few hours of Provide database, graphs, query current data metrics and a nice UI Low overhead while Provides health checks Need management node(s) running, high overhead to and diagnostics develop Development already Can batch done. You just have to Low-level control of GPUs queries/operations to configure the tools. groups of GPUs Management app must run on same box as GPUs Can be remote or local 4
DATA CENTER GPU MANAGER (DCGM) GPU DIAGNOSTICS ACTIVE HEALTH MONITORING Runtime Health Checks Software Deployment Tests ▶ ▶ Stress Tests Prologue Checks ▶ ▶ Epilogue Checks Hardware Issues and Interface Tests ▶ ▶ (PCIe, NVLink) POLICY AND ALERTING CONFIGURATION MANAGEMENT Pre-configured Policies Dynamic Power Capping ▶ ▶ Job Level Statistics Synchronous Clock Boost ▶ ▶ Stateful Configuration Fixed Clocks ▶ ▶ 5
DCGM OVERVIEW GPU Management in the Accelerated Data Center Supported NVIDIA Hardware Fully supported on Tesla GPUs (Kepler+) ● Supported on Quadro, GeForce, and Titan GPUs (Maxwell+) ● ● Supports NvSwitch and DGX-2 ● Driver R384 or Later (Linux only) SDK Installer Packages ● .deb and .rpm Packages Includes Binaries – CLI ( dcgmi ) and daemon ( nv-hostengine ) ● Libraries and Headers (includes NVML) ● ● C and Python Bindings and Code samples ● Documentation - User Guides and API docs https://developer.nvidia.com/data-center-gpu-manager-dcgm Latest Release: v1.3.3 (Jan 2018) 6
AVAILABLE NVIDIA MANAGEMENT TOOLS Software Stack Data Center GPU Manager (DCGM) DCGM-Based DCGMI 3 rd Party Tools Additional diagnostics (aka NVVS) and Client Lib ▶ Client Lib active health monitoring GPU Policy management and more DCGM ▶ Diagnostics Daemon (NVVS) NVIDIA Management Library CUDA NVML (NVML) Low level control of GPUs ▶ Included as part of driver NVIDIA Driver ▶ Header is part of CUDA Toolkit / DCGM ▶ 7
ACTIVE HEALTH MONITORING & ANALYSIS Run Health Check : Healthy System NON INVASIVE dcgmi health --check -g 1 CHECKS Health Monitor Report +------------------+---------------------------------------------------------+ | Overall Health: Healthy | +==================+=========================================================+ Real-time monitoring & aggregated health Run Health Check : System with problems indicator dcgmi health -g 1 –c Health Monitor Report Checks health of all +----------------------------------------------------------------------------+ GPUs and NVSwitch | Group 1 | Overall Health: Warning | +==================+=========================================================+ subsystems | GPU ID: 0 | Warning | | | PCIe system: Warning - Detected more than 8 PCIe | • PCIe, ECC, Inforom, Power | | replays per minute for GPU 0: 13 | Thermal, NVLink +------------------+---------------------------------------------------------+ | GPU ID: 1 | Warning | | | InfoROM system: Warning - A corrupt InfoROM has been | | | detected in GPU 1. | +------------------+---------------------------------------------------------+ 8
Demo: Health Checks 9
GPU DIAGNOSTICS (NVVS) – COVERAGE AREAS HARDWARE ISSUES AND DIAGNOSTICS DEPLOYMENT AND SOFTWARE ISSUES NVML library access and versioning PCIe and NVLink interface checks ▶ ▶ Framebuffer and memory checks CUDA library access and versioning ▶ ▶ Software conflicts Compute engine checks ▶ ▶ STRESS CHECKS INTEGRATION ISSUES Power and thermal stress PCIe and NVLink replay counter checks ▶ ▶ Throughput stress Topological limitations ▶ ▶ Constant relative system performance Permissions, driver and cgroups checks ▶ ▶ Maximum relative system performance Basic power and thermal constraint ▶ ▶ checks 10
COMPREHENSIVE DIAGNOSTICS dcgmi diag -r 3 +---------------------------+-------------+ | Diagnostic | Result | ACTIVE HEALTH CHECKS +===========================+=============+ |----- Deployment --------+-------------| | Blacklist | Pass | | NVML Library | Pass | | CUDA Main Library | Pass | | CUDA Toolkit Library | Pass | Identification, recovery & isolation | Permissions and OS Blocks | Pass | of failed GPUs and NVSwitches. | Persistence Mode | Pass | | Environment Variables | Pass | | Page Retirement | Pass | Diagnostics to root cause failures, | Graphics Processes | Pass | Pre & post job GPU health checks | Inforom | Pass | +----- Hardware ----------+-------------+ | GPU Memory | Pass - All | System sanity to stress performance, | Diagnostic | Pass - All | +----- Integration -------+-------------+ bandwidth, power and thermal | PCIe | Pass - All | characteristics +----- Stress ------------+-------------+ | SM Stress | Pass - All | | Targeted Stress | Pass - All | Multi-level diagnostic options from | Targeted Power | Warn - All | | Memory Bandwidth | Pass - All | few seconds to minutes +---------------------------+-------------+ 11
FLEXIBLE GPU GOVERNANCE POLICIES With Existing Tools Using DCGM Continuous monitoring by the user Condition Notification Action Identify GPUs with double bit errors Condition : Watch for Auto-detects double bit DBE errors, performs page Manually perform Action : Page retirement retirement, and notifies GPU reset to Notification : Callback the user correct problems 12
Demo: Policy Alerting 13
MANAGING JOB LIFECYCLE Create GPU group and check health Which GPUs did my job run on? How much of the GPUs did my job Start Job Stats use? Any error or warning conditions Run Job during my job (ECC errors, clock throttling, etc) Are the GPUs healthy and ready Stop Job Stats for the next job? Display Job Stats 14
Recommend
More recommend