(DCGM) Brent Stolle and Rajat Phull, 4/5/2016 DATA CENTER - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley DATA CENTER GPU MANAGER (DCGM) Brent Stolle and Rajat Phull, 4/5/2016

DATA CENTER INFRASTRUCTURE CHALLENGES Resource Availability & Uptime Under-utilized Resources & Efficiency Administrative Overhead 2

DATA CENTER GPU MANAGER Tesla GPUs Only DCGM Existing Tools Policy & Active Diagnostics Configuration Device Management and Health Checks Management Per GPU Configuration & Increases Reliability Lower Admin overhead Monitoring • Device Identification • Configuration & Monitoring Enhanced Clock & Stateful • Clock Management Power management Group Operations Maintains historical info Increases Efficiency Easy of Use All GPUs Supported 3 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

NVIDIA DATA CENTER GPU MANAGER (DCGM) Comprehensive GPU Management for Accelerated Data Center Health Monitoring Maximize GPU Reliability & Uptime Streamline GPU Administration & TCO Active Policy Diagnostics Governance Boost Performance & Resource Efficiency Power & Clock Mgmt. 4

Maximize GPU Reliability & Availability Active Health Monitoring & Analysis Comprehensive Diagnostics 5

Maximize GPU Reliability & Availability Active Health Monitoring & Analysis Create Group dcgmi group --create all_gpus_grp --default Successfully created group "all_gpus_grp “ group id: 1 NON INVASIVE Set Watches dcgmi health – g 1 --set pmi Performed during job Health monitor systems set successfully execution Get Watches Overall health for the GPU subsystems (PCIe, SM, dcgmi health -g 1 -f +----------------------------------------------------------------------------+ MCU, PMU, Inforom, | Group Health Watches | Power and thermal +=========+==================================================================+ | PCIe | On | system) | NVLINK | Off | | PMU | Off | | MCU | Off | | Memory | On | | SM | Off | | InfoROM | On | | Thermal | Off | | Power | Off | | Driver | Off | 6 +---------+------------------------------------------------------------------+

Maximize GPU Reliability & Availability Active Health Monitoring & Analysis Run Health Check : Healthy System NON INVASIVE dcgmi health --check -g 1 Health Monitor Report +------------------+---------------------------------------------------------+ | Overall Health: Healthy | +==================+=========================================================+ Performed during job execution Run Health Check : System with problems dcgmi health --check -g 1 Overall health for the GPU Health Monitor Report subsystems (PCIe, SM, +----------------------------------------------------------------------------+ MCU, PMU, Inforom, | Group 1 | Overall Health: Warning | +==================+=========================================================+ Power and thermal | GPU ID: 0 | Warning | system) | | PCIe system: Warning - Detected more than 8 PCIe | | | replays per minute for GPU 0: 13 | +------------------+---------------------------------------------------------+ | GPU ID: 1 | Warning | | | InfoROM system: Warning - A corrupt InfoROM has been | | | detected in GPU 1. | +------------------+---------------------------------------------------------+ 7

Maximize GPU Reliability & Availability Comprehensive Diagnostics Quick Diagnostics (~secs) INVASIVE dcgmi diag – g 1 -r 1 +---------------------------+-------------+ | Diagnostic | Result | +===========================+=============+ |----- Deployment --------+-------------| | Blacklist | Pass | Performed at job | NVML Library | Pass | epilogue/prologue or when job | CUDA Main Library | Pass | | CUDA Toolkit Libraries | Pass | fails | Permissions and OS Blocks | Pass | | Persistence Mode | Pass | | Environment Variables | Pass | Validates device sub-components, | Page Retirement | Pass | interlink bandwidth, memory/ecc | Graphics Processes | Pass | +---------------------------+-------------+ state and deployment software integrity Several levels of diagnostic are available. Level 1-3 with – r. 8

Maximize GPU Reliability & Availability Comprehensive Diagnostics Extended Diagnostics (~mins) INVASIVE dcgmi diag – g 1 -r 2 +---------------------------+-------------+ | Diagnostic | Result | +===========================+=============+ |----- Deployment --------+-------------| | Blacklist | Pass | Performed at job | NVML Library | Pass | epilogue/prologue or when job | CUDA Main Library | Pass | | CUDA Toolkit Libraries | Pass | fails | Permissions and OS Blocks | Pass | | Persistence Mode | Pass | | Environment Variables | Pass | Validates device sub-components, | Page Retirement | Pass | interlink bandwidth, memory/ecc | Graphics Processes | Pass | state and deployment software +----- Performance -------+-------------+ | SM Performance | Pass - All | integrity | Targeted Performance | Pass - All | | Targeted Power | Warn - All | +---------------------------+-------------+ Several levels of diagnostic are available. Level 1-3 with – r. 9

Maximize GPU Reliability & Availability Comprehensive Diagnostics Hardware Diagnostics INVASIVE dcgmi diag -r 3 +---------------------------+-------------+ | Diagnostic | Result | +===========================+=============+ |----- Deployment --------+-------------| | Blacklist | Pass | Performed at job | NVML Library | Pass | epilogue/prologue or when job | CUDA Main Library | Pass | | CUDA Toolkit Libraries | Pass | fails | Permissions and OS Blocks | Pass | | Persistence Mode | Pass | | Environment Variables | Pass | Validates device sub-components, | Page Retirement | Pass | interlink bandwidth, memory/ecc | Graphics Processes | Pass | +----- Hardware ----------+-------------+ state and deployment software | GPU Memory | Pass - All | integrity | Diagnostic | Pass - All | +----- Integration -------+-------------+ | PCIe | Pass - All | Several levels of diagnostic are +----- Performance -------+-------------+ available. Level 1-3 with – r. | SM Performance | Pass - All | | Targeted Performance | Pass - All | | Targeted Power | Warn - All | +---------------------------+-------------+ 10

Streamline GPU Administration & TCO Flexible GPU Governance Policies Manage GPU group Configuration Job Statistics 11

Streamline GPU Administration & TCO Flexible GPU Governance Policies With Existing Tools Using DCGM Continuous monitoring by the user Condition Notification Action Identify GPUs with double bit errors Auto-detects double Condition : Watch for DBE bit errors, performs Action : Page retirement Perform GPU reset page retirement, and Notification : Callback to correct notifies the user problems 12