April 4-7, 2016 | Silicon Valley GET TO KNOW THE NVIDIA GRID TM SDK Shounak Deshpande, NVIDIA
Background NVIDIA GRID SDK AGENDA Measuring Performance Maximizing Performance Interactive Question-Answer Session 2
CLOUD\REMOTE GRAPHICS VDI Enterprise, Remote Workstation VMWare, CITRIX, Dassault, and more Game streaming GeForceNow Windows DirectX / OpenGL Linux OpenGL 3
REMOTE GRAPHICS ECOSYSTEM CLIENT SERVER User input IP NIC CPU Network Decode Render Encode Render Capture Client Network Remote Graphics Server 4
GRID SW AND HW STACK COMPONENTS Streaming Capture (Pixel grabbing) HW Accelerated video compression HW Accelerated video decoding Virtualization Graphics Shim layers (app streaming) Platform Virtualization (VDI) Hypervisors (VDI) Full Virtualization (VDI) HW Platforms Server Client AWS G2 Instance GRID K520, M30 GPU Anything Tesla M60 GPU NVIDIA Quadro GPUs 5
NVIDIA GRID SDK 6
NVIDIA CAPTURE SDK (Formerly known as NVIDIA GRID SDK) Goal: Enable Low Latency Remote Graphics Solutions by harnessing NVIDIA GPUs OS: Windows 7+, Linux (CentOS, Debian, RedHat, more) Download: https://developer.nvidia.com/grid-app-game-streaming Support: GRID-devtech-support@nvidia.com 7
NVIDIA CAPTURE SDK COMPONENTS Interface Definitions Sample Code NVIFR API NVFBC API Low Latency Low Latency Render Target Desktop Capture Capture Documentation NVENC Low latency Hardware Encoder NVFBC library NVIFR library GPU Driver 8
NVIDIA CAPTURE SDK: THE “CAPTURE” PART NVFBC NVIFR No-frills RenderTarget capture Brute force, capture all on screen Supports Directx9,10,11, OpenGL APIs Orthogonal to Graphics APIs Easy to integrate with NVENC Easy to integrate with NVENC API API Easy onboarding, no process Needs to be injected in target injection process Efficient than GDI-based screen One session per target window scraping Enables higher density of One session per display streamed apps 9
NVIDIA CAPTURE SDK : INTERFACES NVFBC: NVIDIA Frame Buffer Capture NVIFR: NVIDIA In-band Frame Render NVFBC NVIFR - Directx NVIFR - Directx W i NVFBCToSys NVFBCCuda n NVIFRToSys NVIFRToSys d o w NVFBCToDX9Vid NVFBCToHWEnc NVIFRToHWEnc NVIFRToHWEnc s NVIFR - OpenGL NVFBC L NVFBCToCuda i NVIFRToSys NVIFRToHWEnc n NVFBCToSys u NVFBCToHWEnc x -ToHWEnc interfaces internally invoke NVENC API (part of NVIDIA Video Codec SDK) 10
EVOLUTION OF NVIDIA CAPTURE SDK Legacy 2014 2015 2016 SDK 2.3 3.0 4.0 5.0 • GRID K340, • GRID M30 full • HEVC support • Enable NVFBC without driver reload • GRID M30 limited Windows support support K520, K1, K2, • Tesla M60 support • Windows 10 support • Maxwell NVENC • NVENC RC 2.0 Quadro 4000+ • New unified codec- • New NVFBC interface to capture enhancements – support agnostic interface for desktop to DirectX 9 video memory quarter-res first pass; • H.264 encode HW encoder surface, along with diffmap support lossless encoding; 4:4:4 support • Driver support for • Timeout API for NVFBC blocking mode encoding • Windows 7, 8, H.264 YUV 4:4:4 capture 8.1 support NVIFR • Separate thread Mouse capture for all capture+encode for NVFBC interfaces DX10/DX11 • Propagate frame timestamp through applications NvIFRHWEncode • GRID M30 full • HEVC support Linux • GRID K340, support • Tesla M60 support K520, K1, K2, • NvIFR full • New unified codec- Quadro 4000+ parity for agnostic interface for support NVENC features HW encoder • H.264 encode with Windows support • NVENC RC 2.0 GRID K340, K520, K1, K2, Quadro K2000+ HW GRID M30, Quadro M6000 Tesla M60 11
USING NVFBC API 12
USING NVFBC FOR DESKTOP CAPTURE Enable NVFBC Create NVFBC capture session object Setup NVFBC capture session object Capture Release NVFBC capture session object 13
CAPTURING A SCREENSHOT WITH NVFBC Create NVFBC session object Set up NVFBC session “Capture” starts here Read Grabbed buffer 14
CAPTURING USING NVFBC Begin NVFBC enabled, NvFBCGetStatusEx() not in use Check NVFBC Status NVFBC Not Enabled NvFBCCreateEx() Create NVFBC Session Success NVFBC already in use Fail Setup NVFBC Session Success Success NvFBCEnable() Enable NVFBC Success Fail Fail Grab() Fail \ Terminate Exit Release NVFBC Session 15
DESKTOP REMOTING USING NVFBC + NVENC HW ENCODER Desktop NVFBC Capture Process Composition [System Process] IDirec3DSurface9* IDirec3DSurface9* Captured buffer Video Bitstream packet Capture Thread Encode Thread NVENC API NV GPU Driver NVFBC NV GPU 3D HW NVENC HW ~ 2 millisec < 1millisec ~ 4 millisec * Latency approx. for 1080p desktop 16 streamed as 720p video
USING NVIFR API 17
USING NVIFR FOR APPLICATION STEAMING Write a Shim layer to host NVIFR Inject Shim layer into target application Fetch rendering graphics context Create NVIFR session object using the context Setup NVIFR session object Capture Release NVIFR session object 18
APP STREAMING USING HW ENCODER App Render() or Present() Streaming Shim Component Compressed Video Bitstream NVIFR NVIFR is injected into the application DX/OGL Runtime before the graphics runtime, using an app-level shim layer NVENC 3D HW HW NV GPU 19
DIRECTX APP STREAMING USING NVIFR HW ENCODER Application allocates output buffers and event handles Select the rate control mode and encoder preset according to use case 20
DIRECTX APP STREAMING USING NVIFR HW ENCODER The event handles passed to NvIFRSetupHWEncoder will be signaled when NVENC has finished work submitted by NvIFRTransferRenderTargetToHWEncoder API 21
OPENGL APP STREAMING USING NVIFR HW ENCODER Create session Create TransferObject 22
OPENGL APP STREAMING USING NVIFR HW ENCODER Capture + Encode Retrieve output bitstream Release buffers for re-use 23
MEASURING PERFORMANCE 24
MEASURING PERFORMANCE Guidelines Use high precision timers. In-process performance measurement is suitable only for generating average numbers. Measure GPU Utilization. (GPU-Z, NVIDIA SMI, etc.) Note GPU clock values during measurement. 25
MEASURING PERFORMANCE Use High Performance Multimedia Timer for accuracy 26
MEASURING PERFORMANCE Start Measurement before capture loop Run through capture\encode loop Stop Measurement here 27
MAXIMIZING QUALITY & PERFORMANCE 28
MAXIMIZING QUALITY & PERFORMANCE Goals & Challenges Goals: - Low latency - Smooth playback of streamed video - Minimum impact on target application\system performance Challenge: - Finding the right balance to get maximum CPU-GPU utilization without negative impact 29 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
MAXIMIZING QUALITY & PERFORMANCE Guidelines Know the system’s limits. Memory management : Ensure there is no time lost for paging Resource Utilization : GPU-intensive applications need frame rate throttling while lightweight appllications need pipelining and multithreading of capture – encode/post-process tasks Timing : Ensure capture rate matches display rate Impact on target : Use parallelism 30 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
MAXIMIZING QUALITY & PERFORMANCE Memory management Ensure no paging. Loss due to paging (insufficient video memory) - Choose optimal rendering quality settings - Choose optimal desktop or application window resolution Encoder Paging work Idle 31 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
MAXIMIZING QUALITY & PERFORMANCE Resource Utilization: Multithreading Capture and encode/post-process should run on different threads Constraints: Multiple threads must not concurrently access same DirectX context NVIFR Capture thread should never stall NVFBC Capture thread should never miss a display refresh 32 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
MAXIMIZING QUALITY & PERFORMANCE Resource Utilization : Pipelining Goal: Minimize time spent by encode thread to wait for capture to complete and vice versa Benefit: Control on timing capture calls, less impact on application rendering performance Triple buffering is sufficient in most cases Encode\Post- Capture process Thread Thread Buffer Queue [write to [read from buffer # i] buffer# (i-1)%N] 33 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
MAXIMIZING PERFORMANCE Resource Utilization: Multiple Contexts with NVIFR Why use multiple contexts? Encoder’s D3D Context Game’s D3D Context NVIFR capture happens in-band, shares the DirectX/OGL context NvIFRCopyToSharedSurface NvIFRCopyFromSharedSurface used by the target application. for DX9, for DX9, Any GPU work scheduled by StretchRect to a shared StretchRect from a shared NVIFR on this context reflects as surface for DX9Ex surface for DX9Ex drop in rendering frame rate ResourceCopyRegion to a ResourceCopyRegion from a Solution: shared surface for Dx1x shared surface for Dx1x Use shared buffers to hold captured output, for processing through a separate DirectX/OGL context running on a separate thread. Shared Surface 34 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
MAXIMIZING QUALITY & PERFORMANCE NUMA NUMA: Non-Uniform Memory Addressing Create resources in the same part of the memory where the bus holding the GPU is located, reduces contention for bus bandwidth. 35 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
Recommend
More recommend