nvidia video technologies
play

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/26/2018 NVIDIA Video - PowerPoint PPT Presentation

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/26/2018 NVIDIA Video Technologies Overview Video Codec SDK Updates AGENDA Perf/Quality Optimization Benchmarks Roadmap 2 NVIDIA VIDEO TECHNOLOGIES 3 Gamestream VIDEO CODEC SDK A comprehensive


  1. NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/26/2018

  2. NVIDIA Video Technologies Overview Video Codec SDK Updates AGENDA Perf/Quality Optimization Benchmarks Roadmap 2

  3. NVIDIA VIDEO TECHNOLOGIES 3

  4. Gamestream VIDEO CODEC SDK A comprehensive set of APIs for GPU- Cloud transcoding accelerated video encode and decode Remote desktop & visualization NVENCODE API for video encode acceleration Intelligent video analytics NVDECODE API for video & JPEG decode acceleration (formerly called NVCUVID API) Independent of CUDA/3D cores on GPU for Video archiving pre-/post-processing Video editing 4

  5. NVIDIA VIDEO TECHNOLOGIES cuDNN, TensorRT , Easy access to GPU DeepStream SDK cuBLAS, cuSPARSE video acceleration SOFTWARE VIDEO CODEC SDK CUDA TOOLKIT Video Encode and Decode for Windows and Linux APIs, libraries, tools, samples CUDA, DirectX, OpenGL interoperability NVIDIA DRIVER NVENC NVDEC CUDA HARDWARE Video decode Video encode High-performance computing on GPU 5

  6. NVIDIA GPU VIDEO CAPABILITIES Decode HW* Encode HW* CPU Formats: • Formats: MPEG-2 • • H.264 VC1 • • H.265 VP8 • • Lossless VP9 • H.264 • Bit depth: H.265 • • 8 bit Lossless NVENC NVDEC Buffer • 10 bit Bit depth: • Color** 8/10/12 bit • YUV 4:4:4 • YUV 4:2:0 Color** • YUV 4:2:0 CUDA Cores Resolution • Up to 8K*** Resolution • Up to 8K*** * See support diagram for previous NVIDIA HW generations 6 ** 4:2:2 is not natively supported on HW *** Support is codec dependent

  7. VIDEO CODEC SDK UPDATE 7

  8. VIDEO CODEC SDK UPDATE SDK 8.1 SDK 7.x B-as-ref Pascal QP/emphasis map 10-bit encode 4K60 HEVC encode FFmpeg Reusable classes & ME-only for VR new sample apps Quality++ SDK 8.0 SDK 6.0 SDK 8.2 10-bit transcode ARGB Decode + inference 10/12-bit decode Quality+ optimizations OpenGL Dec+Enc Dec. optimizations ME-only WP, AQ, Enc. Quality Q2 2018 2015 2016 2017 Q1 2018 8

  9. B-FRAMES AS REFERENCE Non-ref B-frames B-frames as reference B2 B2 B1 B3 P I B3 P B1 I ➢ Improved visual quality – up to 0.6 dB PSNR (BD-PSNR = 0.3 dB) ➢ Negligible performance penalty ➢ Ensure decoder support 9

  10. WITHOUT B-AS-REF 1080p @3 Mbps 10

  11. WITH B-AS-REF 1080p @3 Mbps 11

  12. WITHOUT B-AS-REF 1080p @3 Mbps 12

  13. WITH B-AS-REF 1080p @3 Mbps 13

  14. DESKTOP CONTENT ENCODING Challenges in Preserving Details Problem ➢ Desktop content is challenging to encode ➢ Thin-line text, wireframes, high-detail textures ➢ If severely bitrate constrained, recovery is difficult without IDR. ➢ QP modulation requires knowledge of complexity ➢ Rate control in NVENC firmware 14

  15. Original Image 15

  16. Encoded (& Decoded) Image 16

  17. EMPHASIS MAP Region of Interest Encoding Solution ➢ Identify “high - detail” areas within the captured image (NVFBC) ➢ Provide feedback to encoder to treat these areas differently (NVENC) 17

  18. EMPHASIS MAP Region of Interest Encoding Generated by NVFBC Interpreted by NVENC as ∆ QP 16 16 5 5 4 5 3 2 1 0 --- --- -- --- -- - - 16 16 5 5 5 3 3 2 2 0 --- --- --- -- -- - - 5 5 4 4 2 1 0 0 --- --- -- -- - - 3 2 4 3 2 1 1 2 -- - -- -- - - - - 1 1 0 3 2 4 0 0 - - -- - -- 5 = High detail areas Encoder translates to ∆ QP 0 = Low detail areas ∆ QP depends on absolute QP 18

  19. REDESIGNED SDK SAMPLES Reusable Encoder/Decoder Classes ➢ Reusable base classes, easy-to-understand, end-user focused ➢ Sample apps re-designed ➢ Encode base classes: NvEncoderD3D9, NvEncoderD3D11, NvEncoderCUDA, NvEncoderD3GL ➢ Decode base class: NvDecoder ➢ Abstraction over low-level enc/dec APIs ➢ init(), run(), destroy() ➢ FFmpeg demux 20

  20. REDESIGNED SDK SAMPLES Decode Applications Basic Decoding Low-latency decode AppDec AppDecLow Latency AppDecD3D Decode and Display using D3D9 AppDecMem Decode from memory buffer and D3D11 AppDecGL Decode and Display using AppDecMulti Use-case: Surveillance, OpenGL Input multiple videos on screen AppDecImage Decoding and Color Conversion AppDecPerf Multi-threaded, perf Provider to a specific format (BGRA, measurement BGRA64) 21

  21. REDESIGNED SDK SAMPLES Encode Applications Encoding CUDA surfaces Low-latency encode, intra- AppEncCUDA AppEncLow Latency refresh, slices etc. AppEncD3D9 Encoding using D3D9 surfaces AppEncME ME-only mode AppEnc Encoding using D3D11 surfaces AppEncPerf App for Encoder performance D3D11 measurement Encoding & decoding in Encoding & quality AppEncDec AppEncQual different threads, HDR measurement (PSNR) streaming 22

  22. OPTIMIZATION STRATEGIES 23

  23. OPTIMIZATION STRATEGIES General Guidelines ➢ Minimize PCIe transfers ➢ Eliminate, if possible ➢ Use CUDA for video pre-/post-processing ➢ Multiple threads/processes to balance enc/dec utilization ➢ Monitor using nvidia-smi: nvidia-smi dmon -s uc -i <GPU_index> ➢ Analyze using GPUView on Windows ➢ Minimize disk I/O ➢ Optimize encoder settings for quality/perf balance 24

  24. SW TRANSCODE ffmpeg -c:v h264 -i input.mp4 -c:a copy -c:v h264 -b:v 5M output.mp4 System Memory SW SW Decode Encode Bitstream Bitstream YUV YUV 32 fps* *1:2 transcode, fps per session 4 GHz Intel i7-6700K 25

  25. SW TRANSCODE + SCALE ffmpeg -c:v h264 -i input.mp4 -vf scale=1280:720 -c:a copy -c:v h264 -b:v 5M output.mp4 System Memory SW SW Preprocess Decode Encode (e.g. scaling) Bitstream Bitstream YUV YUV YUV YUV 29 fps* *1:2 transcode, fps per session 4 GHz Intel i7-6700K 26

  26. GPU UNOPTIMIZED TRANSCODE ffmpeg -vsync 0 -c:v h264_cuvid -i input.mp4 -c:a copy -c:v h264_nvenc -b:v 5M output.mp4 System Memory PCIe transfer PCIe transfer Bitstream Bitstream 288 fps* *1:2 transcode, fps per session GP104 GPU NVENC NVDEC Encode Decode YUV YUV GPU Memory 27

  27. GPU UNOPTIMIZED TRANSCODE + CPU SCALE ffmpeg -vsync 0 -c:v h264_cuvid -i input.mp4 -c:a copy – vf scale=1280:720 -c:v h264_nvenc -b:v 5M output.mp4 System Memory PCIe transfer PCIe transfer Preprocess (e.g. scaling) Bitstream Bitstream 76 fps* NVENC NVDEC Encode Decode YUV YUV *1:2 transcode, fps per session GP104 GPU GPU Memory 28

  28. HIGH-PERF GPU OPTIMIZED TRANSCODE ffmpeg -vsync 0 – hwaccel cuvid -c:v h264_cuvid -i input.mp4 -c:a copy – vf scale_npp=1280:720 -c:v h264_nvenc -b:v 5M output.mp4 System Memory 472 fps* Bitstream Bitstream *1:2 transcode, fps per session GP104 GPU NVENC NVDEC Preprocess Encode Decode (scaling in CUDA) YUV YUV YUV YUV GPU Memory 29

  29. HIGH-PERF GPU OPTIMIZED TRANSCODE ffmpeg -vsync 0 – hwaccel cuvid -c:v h264_cuvid – resize 1280x720 -i input.mp4 -c:a copy -c:v h264_nvenc -b:v 5M output.mp4 System Memory 490 fps* Bitstream Bitstream *1:2 transcode, fps per session GP104 GPU NVENC NVDEC Preprocess Encode Decode (scaling in CUDA) YUV YUV YUV YUV GPU Memory 30

  30. FFMPEG VIDEO TRANSCODING Tips ➢ Look at FFmpeg users’ guide in NVIDIA Video Codec SDK package ➢ Use – hwaccel keyword to keep entire transcode pipeline on GPU ➢ Run multiple 1: N transcode sessions to achieve M : N transcode at high perf 31

  31. CUDA FILTERS IN FFMPEG ➢ -resize option with NVDEC (e.g. -c:v h264_cuvid –resize 1280x720 … ) ➢ scale_npp : Built-in CUDA library filters ➢ Custom CUDA filter examples in FFmpeg ➢ scale_cuda ➢ thumbnail_cuda ➢ Build your own using above as guide ➢ If you must use CPU and GPU filters, minimize PCIe x’fers 32

  32. MIXING CPU & GPU FILTERS Fade (CPU) + Scale (GPU) Why doesn’t this work? ffmpeg.exe -y -c:v h264_cuvid -i input.264 -vf "fade,scale_npp=1280:720" -c:v h264_nvenc output.264 This works ffmpeg.exe -y -c:v h264_cuvid -i input.264 -vf "fade,hwupload_cuda,scale_npp=1280:720" -c:v h264_nvenc output.264 33

  33. MIXING CPU & GPU FILTERS Scale (GPU) + Fade (CPU) Why doesn’t this work? ffmpeg.exe -y -c:v h264_cuvid -i input.264 -vf "hwupload_cuda,scale_npp=1280:720,hwdownload,fade" -c:v h264_nvenc output.264 One solution ffmpeg.exe -y -c:v h264_cuvid -i input.264 -vf "hwupload_cuda,scale_npp=1280:720,hwdownload,format=nv12,fade" -c:v h264_nvenc output.264 Optimal solution ffmpeg.exe -y -hwaccel cuvid -c:v h264_cuvid -i input.264 -vf "scale_npp=1280:720,hwdownload,format=nv12,fade" -c:v h264_nvenc output.264 34

  34. OPTIMIZATION TIPS ➢ Write your own CUDA filters ➢ Combine CUDA filters; e.g. scaling + color space conversion in a single filter ➢ For systems with multiple CPU sockets, avoid accesses to local sysmem of one CPU from another CPU. Find the local NUMA node and localize the storage per CPU . 35

  35. BENCHMARKS 36

  36. P4: 5X MORE H.264 ENCODE THAN 2S CPU SERVER Up to 5x more throughput, up to 10x better efficiency at ~ quality H.264 hq Encode Throughput H.264 hq Encode Efficiency H.264 hq Encode Quality (Streams) (Streams / Watt) (PSNR YUV) 40 1 50 0.8 40 30 0.6 30 20 0.4 20 10 0.2 10 0.9 0.022 0.010 0.003 0 0 0 720p30 1080p30 4K30 720p30 1080p30 4K30 720p30 1080p30 4K30 Tesla P4 37 Dual Intel Xeon E5-2660v3 @ 2.6 GHz

Recommend


More recommend