ZEN AND THE ART OF VGPU SELECTION Jeremy Main - Lead Solution Architect NVIDIA GRID, Japan jmain@nvidia.com
“ The real purpose of the scientific method is to make sure nature hasn’t misled you into thinking you know something you actually don’t know. ” Robert M. Pirsig Zen and the Art of Motorcycle Maintenance: An Inquiry Into Values 2
FUNCTIONAL VIEWPOINTS WITH APPLICATION OF RATIONAL ANALYSIS 3
FUNDAMENTALS: FRAME RATE 4
3D APPLICATION : CATIA V5 5
FRAMERATE 4 seconds 6
FRAMERATE 4 seconds 1 second 7
FRAMERATE 4 Frames / Second 250ms / Frame 1 second 8
FRAMERATE 8 Frames / Second 125ms / Frame 1 second 9
FRAMERATE 16 Frames / Second 62ms / Frame 1 second 10
FRAMERATE 30 Frames / Second 33ms / Frame 1 second 11
FRAMERATE 60 Frames / Second 16ms / Frame 1 second 12
AND SO? 13
FRAMERATE IF the application can construct 3D data fast enough (efficient geometry representation) AND the GPU is powerful enough… Max FPS = 60 FPS GPU Utilization = 100% (grossly simplified for illustrative purposes only) 1 second 14
FUNDAMENTALS: GPU UTILIZATION 15
GPU UTILIZATION IF the application can construct 3D data fast enough (efficient geometry representation) AND the GPU is powerful enough Max FPS = 60 FPS GPU Utilization = 50% (grossly simplified for illustrative purposes only) 1 second 16
GPU UTILIZATION IF the application can construct 3D data fast enough (efficient geometry representation) BUSY IDLE AND the GPU is powerful enough Max FPS = 60 FPS GPU Utilization = 50% (grossly simplified for illustrative purposes only) 1 second 17
GPU UTILIZATION IF the application can’t construct 3D data fast enough (inefficient geometry representation) AND the GPU is powerful enough BUSY IDLE Max FPS = 15 FPS GPU Utilization = 20% (grossly simplified for illustrative purposes only) 1 second 18
GPU UTILIZATION IF the application can construct 3D data fast enough (efficient geometry representation) BUSY AND the GPU is NOT powerful enough Max FPS = 15 FPS GPU Utilization = 100% (grossly simplified for illustrative purposes only) 1 second 19
FUNDAMENTALS: VSYNC 20
VSYNC VSYNC = ON : ~(v)Display horizontal Sync. Ex: 60Hz == 16ms/frame IF the application can construct 3D data fast enough (efficient geometry representation) BUSY IDLE AND the GPU is powerful enough Max FPS = 60 FPS GPU Utilization = 50% (grossly simplified for illustrative purposes only) 1 second 21
VSYNC VSYNC = ON (Half Display Refresh): ~(v)Display horizontal Sync. Ex: (60Hz / 2) == 33ms/frame IF the application can construct 3D data fast enough (efficient geometry representation) BUSY IDLE AND the GPU is powerful enough Max FPS = 30 FPS GPU Utilization = 25% (grossly simplified for illustrative purposes only) 1 second 22
FUNDAMENTALS: FRAME RATE LIMITER 23
FRAME RATE LIMITER Frame Rate Limiter = ON : <= ~60 Potential frames rendered / second IF the application can construct 3D data fast enough (efficient geometry representation) BUSY IDLE AND the GPU is powerful enough Max FPS = 60 FPS GPU Utilization = 50% (grossly simplified for illustrative purposes only) 1 second 24
FUNDAMENTALS GOING (OR NOT GOING) FASTER 25
GOING (OR NOT GOING) FASTER Frame Rate Limiter = OFF , VSYNC = ON IF the application can construct 3D data fast enough (efficient geometry representation) BUSY IDLE AND the GPU is powerful enough Max FPS = 60 FPS GPU Utilization = 50% (grossly simplified for illustrative purposes only) 1 second 26
GOING (OR NOT GOING) FASTER Frame Rate Limiter = ON , VSYNC = OFF IF the application can construct 3D data fast enough (efficient geometry representation) BUSY IDLE AND the GPU is powerful enough Max FPS = 60 FPS GPU Utilization = 50% (grossly simplified for illustrative purposes only) 1 second 27
GOING (OR NOT GOING) FASTER Frame Rate Limiter = OFF , VSYNC = OFF IF the application can construct 3D data fast enough (efficient geometry representation) BUSY AND the GPU is powerful enough Max FPS = (until CPU or GPU bottleneck) GPU Utilization = 100% (grossly simplified for illustrative purposes only) 1 second 28
FUNDAMENTALS: RENDERED VS. SAMPLED 29
Sampling 20 Frames / Second 50ms / Sample Rendering 60 Frames / Second 16ms / Frame 30
Sampling 20 Frames / Second 50ms / Sample Rendering 60 Frames / Second 16ms / Frame 31
Sampling 20 Frames / Second 50ms / Sample Rendered but… unused frames Rendering 60 Frames / Second 16ms / Frame 32
@VIRTUALIZED_RESOURCE “A BAD, VERY BAD WASTE OF SHARED RESOURCES” 33
RENDERED VS. SAMPLED Options… Cautions If your sample framerate < 30 FPS, consider changing VSYNC policy to: “Adaptive Half Refresh” to lock max FPS @ 30 FPS and reduce “waste” May lead to additional input/output latency due to longer period between frame updates CPU based image compression can limit the actual delivered framerate based on quality settings, percentage of display changed, number of displays Network bandwidth deficiencies, quality affect delivered framerate Endpoint performance (ability to decode compression) affects displayable framerate 34
FUNDAMENTALS: CPU 35
CPU / VCPU UTILIZATION More is not always better 1 of 1 vCPUs @ 100% utilization = 100% reported utilization 1 of 2 vCPUs @ 100% utilization = 50% reported utilization 1 of 4 vCPUs @ 100% utilization = 25% reported utilization 1 of 8 vCPUs @ 100% utilization = 13% reported utilization Virtual environments using CPU-based image compression with full-screen updates can expect to have the compressor process consume a single vCPU Adding more vCPU cores can negatively impact VM performance due to pCPU scheduling contention by the hypervisor Know how much CPU resources your application and workload requires 36
FUNDAMENTALS: SYSTEM MEMORY 37
SYSTEM MEMORY Locked in (like a time-share contract) vDGA and vGPU VMs require all VM memory to be locked on startup Important consideration during PoC phase as well as production Be aware of VM memory exceeding the per-socket capacity (NUMA traversal) 38
FUNDAMENTALS: FRAMEBUFFER 39
FRAMEBUFFER I own thee… until shutdown It is yours for the duration so ensure you get the correct “size”, i.e. Profile Can not use another GPU’s framebuffer Does not support dynamic resizing Can not use excess “unused” capacity of other VM framebuffers on the same GPU Applications may efficiently represent geometry but will fall back to legacy methods when framebuffer is exhausted. Will lead to reduced rendering performance 40
FUNDAMENTALS: DECODE 41
DECODE For most of your video playback needs Stream must be h.264, VP8, HVEC Main Profile, VP9 Profile 0 Complete details in NVIDIA Video Codec SDK Application Notes – Decoder Application must support GPU decode capability for supported streams YouTube playback on Chrome uses VP9 (Caution) -> VP9 decode not verified FireFox, Edge will playback with hardware decode Splash player with GPU decode enabled will playback with hardware decode Other video players natively support available GPU decode as well 42
FUNDAMENTALS: ENCODE 43
ENCODER Free a vCPU do to other special things Dedicated silicon for encode on each GPU Out of band encoding, does not impact rendering performance NVENC added from Citrix XenDesktop 7.11 and VMware Horizon 7.0 Blast Extreme Confirm endpoints can perform H.264 decode, and enabled in client settings Up-to-date endpoint software required Ensure policies or settings do not override GPU encoder use; i.e. “build to lossless” 44
MEASUREMENT 45
MEASUREMENT PRINCIPLES Not all possible data points! Clarify and document the context(s) being measured Select metrics that will help explain different points of resource contention Capture workstation, PC data for pre-PoC sizing investigation (Optional) Capture screenshots @ 1FPS -> PNG -> ffmpeg -> MP4 file Capture VM, Endpoint and host metrics (nvidia-smi) for PoC Save data in a consistent manner, document testing procedures 46
TOOLS 47
TOOLS: SYSINFO32 Available in all Windows Environments Use SysInfo32 “System Information” to capture the measurement context CPU model, Clocks, Logical Cores Operating System Display Adapters Lots of ‘other’ information that surely must be interesting to someone? 48
TITLE ONLY SLIDE 49
TOOLS: PERFMON Available in all Windows Environments A large variety of counters! Very powerful for local or remote collection Some counters only exist in WMI, sadly Export hundreds of data points to CSV for endless sorting 50
COUNTER CREATION AND USAGE
Create new ”User Defined” collector Start ”perfmon” Expand “Data Collector Sets” Select “User Defined” -> “New” -> “Data Collector Set”
Set base collector properties Enter a name for the collector Select “Create Manually” Click “Next”
Configuration (continued 1) Select “Performance Counter”
Configuration (continued 2) • Change sample interval 1 Second • Click “Add” to add counters
Recommend
More recommend