platforms
play

PLATFORMS BRANT ZHAO, NVIDIA MAX LV, NVIDIA Performance Energy - PowerPoint PPT Presentation

POWER EFFICIENT VISUAL COMPUTING ON MOBILE PLATFORMS BRANT ZHAO, NVIDIA MAX LV, NVIDIA Performance Energy Efficiency Power Efficient GPU Programming - Case Studies & Findings Case study #1: Image Pyramid Blending Image Pyramid


  1. POWER EFFICIENT VISUAL COMPUTING ON MOBILE PLATFORMS BRANT ZHAO, NVIDIA MAX LV, NVIDIA

  2. • Performance • Energy Efficiency

  3. Power Efficient GPU Programming - Case Studies & Findings

  4. Case study #1: Image Pyramid Blending

  5. Image Pyramid Blending Reconstruct, Up-sample and + = Add

  6. Image Pyramid Blending - A naïve CUDA implementation cudaMalloc cudaMalloc cudaMalloc cudaMalloc CPU for for for for pyramids pyramids pyramids pyramids Create Create Create Laplacian Laplacian Gaussian Blend Reconstruct GPU Pyramids Pyramids Pyramids Laplacian Blended Image for left for right for mask Pyramids image image image CPU FREQUENCY TIME

  7. Image Pyramid Blending - Power optimized: Avoid CPU<->GPU interleaving cudaMalloc cudaMalloc cudaMalloc cudaMalloc CPU for for for for pyramids pyramids pyramids pyramids Create Create Create Laplacian Laplacian Gaussian Blend Reconstruct Pyramids Pyramids Pyramids Laplacian Blended GPU for left for right for mask Pyramids Image image image image CPU FREQUENCY TIME

  8. Image Pyramid Blending - Perf/Watt comparison 1.05 1.00 NORMALIZED PERFORMANCE 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 NORMALIZED CPU+GPU POWER CPU GPU interleaving NOT Interleaving

  9. Case study #2: 2D Convolution

  10. 2D Convolution 1 2 1 2 0.25 0 0 0 0 2 3 1 + = 0 0 0 2 0 1 10 0 0 0.75

  11. 2D Convolution 1 2 1 2 0.25 0 0 0 0 2 3 1 8 + = 0 0 0 2 0 1 10 0 0 0.75

  12. 2D Convolution - 3x3 2D convolution with FP16 0.25 0 0 1 2 1 2 pack0 pack2 pack1 pack3 pack5 1 8 0 0 0 0 0 pack4 2 3 0.25 0.5 0 0 0.75 2 0 1 10 pack6 pack8 pack7 • Basic operations for 2 output pixels • 9 packed FP16 MAD

  13. 2D Convolution - Perf/Watt comparison 1.1 1.0 NORMALIZED PERFORMANCE 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 NORMALIZED GPU POWER FP32 FP16

  14. Case study #3: Sparse Lucas- ∆𝑞 2 Kanade Optical Flow ∆𝑞 1 (SparseLK)

  15. SparseLK ∆𝑞 2 ∆𝑞 0 ∆𝑞 1 First Frame 𝐽 Second Frame 𝐽 𝑜𝑓𝑦𝑢 −1 2 𝐽 𝑦 𝐽 𝑦 𝐽 𝑧 (𝐽 𝑦 − 𝐽 𝑜𝑓𝑦𝑢 𝑦 + ∆𝑞 𝑞𝑠𝑓𝑤 ) 𝐽 𝑦 ∆𝑞 = 𝐽 𝑧 2 𝐽 𝑦 𝐽 𝑧 𝐽 𝑧 𝑦∈𝛻 𝑦∈𝛻

  16. SparseLK - Solution#1 −1 2 𝐽 𝑦 𝐽 𝑦 𝐽 𝑧 (𝐽 𝑦 − 𝐽 𝑜𝑓𝑦𝑢 𝑦 + ∆𝑞 𝑞𝑠𝑓𝑤 ) 𝐽 𝑦 ∆𝑞 = 𝐽 𝑧 2 𝐽 𝑦 𝐽 𝑧 𝐽 𝑧 𝑦∈𝛻 𝑦∈𝛻 T0 T1 … T5 • Multiple threads for a feature point • Share data via shared memory or shuffle Reduction needed to get final • results High thread level • parallelism(TLP) but more instructions needed

  17. SparseLK - Solution#2 −1 2 𝐽 𝑦 𝐽 𝑦 𝐽 𝑧 (𝐽 𝑦 − 𝐽 𝑜𝑓𝑦𝑢 𝑦 + ∆𝑞 𝑞𝑠𝑓𝑤 ) 𝐽 𝑦 ∆𝑞 = 𝐽 𝑧 2 𝐽 𝑦 𝐽 𝑧 𝐽 𝑧 𝑦∈𝛻 𝑦∈𝛻 T0 • Each thread handles a feature point T1 • No need to shuffle data No need to do reduction • Need more registers to hold • data High instruction level • parallelism(ILP) but low occupancy

  18. SparseLK - Instruction# and Perf/Watt 𝑋𝑝𝑠𝑙𝑚𝑝𝑏𝑒 𝑄𝑓𝑠𝑔 = 𝑋𝑝𝑠𝑙𝑚𝑝𝑏𝑒 𝑇𝑓𝑑 𝑋𝑏𝑢𝑢 = 𝐹𝑜𝑓𝑠𝑕𝑧 𝐹𝑜𝑓𝑠𝑕𝑧 𝑇𝑓𝑑 𝐹𝑜𝑓𝑠𝑕𝑧 = 𝐹𝑜𝑓𝑠𝑕𝑧𝑄𝑓𝑠𝐽𝑜𝑡𝑢 𝑡ℎ𝑣𝑔𝑔𝑚𝑓 ∗ 𝐽𝑜𝑢𝑠𝑣𝑑𝑢𝑗𝑝𝑜 𝑡ℎ𝑣𝑔𝑔𝑚𝑓 𝐹𝑜𝑓𝑠𝑕𝑧 = 𝐹𝑜𝑓𝑠𝑕𝑧𝑄𝑓𝑠𝐽𝑜𝑡𝑢 𝑡ℎ𝑣𝑔𝑔𝑚𝑓 ∗ 𝐽𝑜𝑢𝑠𝑣𝑑𝑢𝑗𝑝𝑜 𝑡ℎ𝑣𝑔𝑔𝑚𝑓 + 𝐹𝑜𝑓𝑠𝑕𝑧𝑄𝑓𝑠𝐽𝑜𝑡𝑢 𝑠𝑓𝑒𝑣𝑑𝑢𝑗𝑝𝑜 ∗ 𝐽𝑜𝑡𝑢𝑠𝑣𝑑𝑢𝑗𝑝𝑜 𝑠𝑓𝑒𝑣𝑑𝑢𝑗𝑝𝑜 + 𝐹𝑜𝑓𝑠𝑕𝑧𝑄𝑓𝑠𝐽𝑜𝑡𝑢 𝑠𝑓𝑒𝑣𝑑𝑢𝑗𝑝𝑜 ∗ 𝐽𝑜𝑡𝑢𝑠𝑣𝑑𝑢𝑗𝑝𝑜 𝑠𝑓𝑒𝑣𝑑𝑢𝑗𝑝𝑜 + 𝐹𝑜𝑓𝑠𝑕𝑧𝑄𝑓𝑠𝐽𝑜𝑡𝑢 𝑝𝑢ℎ𝑓𝑠 ∗ 𝐽𝑜𝑡𝑢𝑠𝑣𝑑𝑢𝑗𝑝𝑜 𝑝𝑢ℎ𝑓𝑠 + 𝐹𝑜𝑓𝑠𝑕𝑧𝑄𝑓𝑠𝐽𝑜𝑡𝑢 𝑝𝑢ℎ𝑓𝑠 ∗ 𝐽𝑜𝑡𝑢𝑠𝑣𝑑𝑢𝑗𝑝𝑜 𝑝𝑢ℎ𝑓𝑠 + 𝑄𝑝𝑥𝑓𝑠 𝑥𝑏𝑡𝑢𝑓𝑒 ∗ 𝑈𝑗𝑛𝑓

  19. SparseLK - Perf/Watt comparison 1.2 NORMALIZED PERFORMANCE 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 NORMALIZED GPU POWER Multithread per feature SingleThreadPerFeature

  20. Summary  Analyze the whole pipeline at the system level  Use energy efficient features on the target platform  Balance between TLP and ILP

  21. THANK YOU brantz@nvidia.com

Recommend


More recommend