POWER EFFICIENT VISUAL COMPUTING ON MOBILE PLATFORMS BRANT ZHAO, NVIDIA MAX LV, NVIDIA
• Performance • Energy Efficiency
Power Efficient GPU Programming - Case Studies & Findings
Case study #1: Image Pyramid Blending
Image Pyramid Blending Reconstruct, Up-sample and + = Add
Image Pyramid Blending - A naïve CUDA implementation cudaMalloc cudaMalloc cudaMalloc cudaMalloc CPU for for for for pyramids pyramids pyramids pyramids Create Create Create Laplacian Laplacian Gaussian Blend Reconstruct GPU Pyramids Pyramids Pyramids Laplacian Blended Image for left for right for mask Pyramids image image image CPU FREQUENCY TIME
Image Pyramid Blending - Power optimized: Avoid CPU<->GPU interleaving cudaMalloc cudaMalloc cudaMalloc cudaMalloc CPU for for for for pyramids pyramids pyramids pyramids Create Create Create Laplacian Laplacian Gaussian Blend Reconstruct Pyramids Pyramids Pyramids Laplacian Blended GPU for left for right for mask Pyramids Image image image image CPU FREQUENCY TIME
Image Pyramid Blending - Perf/Watt comparison 1.05 1.00 NORMALIZED PERFORMANCE 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 NORMALIZED CPU+GPU POWER CPU GPU interleaving NOT Interleaving
Case study #2: 2D Convolution
2D Convolution 1 2 1 2 0.25 0 0 0 0 2 3 1 + = 0 0 0 2 0 1 10 0 0 0.75
2D Convolution 1 2 1 2 0.25 0 0 0 0 2 3 1 8 + = 0 0 0 2 0 1 10 0 0 0.75
2D Convolution - 3x3 2D convolution with FP16 0.25 0 0 1 2 1 2 pack0 pack2 pack1 pack3 pack5 1 8 0 0 0 0 0 pack4 2 3 0.25 0.5 0 0 0.75 2 0 1 10 pack6 pack8 pack7 • Basic operations for 2 output pixels • 9 packed FP16 MAD
2D Convolution - Perf/Watt comparison 1.1 1.0 NORMALIZED PERFORMANCE 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 NORMALIZED GPU POWER FP32 FP16
Case study #3: Sparse Lucas- ∆𝑞 2 Kanade Optical Flow ∆𝑞 1 (SparseLK)
SparseLK ∆𝑞 2 ∆𝑞 0 ∆𝑞 1 First Frame 𝐽 Second Frame 𝐽 𝑜𝑓𝑦𝑢 −1 2 𝐽 𝑦 𝐽 𝑦 𝐽 𝑧 (𝐽 𝑦 − 𝐽 𝑜𝑓𝑦𝑢 𝑦 + ∆𝑞 𝑞𝑠𝑓𝑤 ) 𝐽 𝑦 ∆𝑞 = 𝐽 𝑧 2 𝐽 𝑦 𝐽 𝑧 𝐽 𝑧 𝑦∈𝛻 𝑦∈𝛻
SparseLK - Solution#1 −1 2 𝐽 𝑦 𝐽 𝑦 𝐽 𝑧 (𝐽 𝑦 − 𝐽 𝑜𝑓𝑦𝑢 𝑦 + ∆𝑞 𝑞𝑠𝑓𝑤 ) 𝐽 𝑦 ∆𝑞 = 𝐽 𝑧 2 𝐽 𝑦 𝐽 𝑧 𝐽 𝑧 𝑦∈𝛻 𝑦∈𝛻 T0 T1 … T5 • Multiple threads for a feature point • Share data via shared memory or shuffle Reduction needed to get final • results High thread level • parallelism(TLP) but more instructions needed
SparseLK - Solution#2 −1 2 𝐽 𝑦 𝐽 𝑦 𝐽 𝑧 (𝐽 𝑦 − 𝐽 𝑜𝑓𝑦𝑢 𝑦 + ∆𝑞 𝑞𝑠𝑓𝑤 ) 𝐽 𝑦 ∆𝑞 = 𝐽 𝑧 2 𝐽 𝑦 𝐽 𝑧 𝐽 𝑧 𝑦∈𝛻 𝑦∈𝛻 T0 • Each thread handles a feature point T1 • No need to shuffle data No need to do reduction • Need more registers to hold • data High instruction level • parallelism(ILP) but low occupancy
SparseLK - Instruction# and Perf/Watt 𝑋𝑝𝑠𝑙𝑚𝑝𝑏𝑒 𝑄𝑓𝑠𝑔 = 𝑋𝑝𝑠𝑙𝑚𝑝𝑏𝑒 𝑇𝑓𝑑 𝑋𝑏𝑢𝑢 = 𝐹𝑜𝑓𝑠𝑧 𝐹𝑜𝑓𝑠𝑧 𝑇𝑓𝑑 𝐹𝑜𝑓𝑠𝑧 = 𝐹𝑜𝑓𝑠𝑧𝑄𝑓𝑠𝐽𝑜𝑡𝑢 𝑡ℎ𝑣𝑔𝑔𝑚𝑓 ∗ 𝐽𝑜𝑢𝑠𝑣𝑑𝑢𝑗𝑝𝑜 𝑡ℎ𝑣𝑔𝑔𝑚𝑓 𝐹𝑜𝑓𝑠𝑧 = 𝐹𝑜𝑓𝑠𝑧𝑄𝑓𝑠𝐽𝑜𝑡𝑢 𝑡ℎ𝑣𝑔𝑔𝑚𝑓 ∗ 𝐽𝑜𝑢𝑠𝑣𝑑𝑢𝑗𝑝𝑜 𝑡ℎ𝑣𝑔𝑔𝑚𝑓 + 𝐹𝑜𝑓𝑠𝑧𝑄𝑓𝑠𝐽𝑜𝑡𝑢 𝑠𝑓𝑒𝑣𝑑𝑢𝑗𝑝𝑜 ∗ 𝐽𝑜𝑡𝑢𝑠𝑣𝑑𝑢𝑗𝑝𝑜 𝑠𝑓𝑒𝑣𝑑𝑢𝑗𝑝𝑜 + 𝐹𝑜𝑓𝑠𝑧𝑄𝑓𝑠𝐽𝑜𝑡𝑢 𝑠𝑓𝑒𝑣𝑑𝑢𝑗𝑝𝑜 ∗ 𝐽𝑜𝑡𝑢𝑠𝑣𝑑𝑢𝑗𝑝𝑜 𝑠𝑓𝑒𝑣𝑑𝑢𝑗𝑝𝑜 + 𝐹𝑜𝑓𝑠𝑧𝑄𝑓𝑠𝐽𝑜𝑡𝑢 𝑝𝑢ℎ𝑓𝑠 ∗ 𝐽𝑜𝑡𝑢𝑠𝑣𝑑𝑢𝑗𝑝𝑜 𝑝𝑢ℎ𝑓𝑠 + 𝐹𝑜𝑓𝑠𝑧𝑄𝑓𝑠𝐽𝑜𝑡𝑢 𝑝𝑢ℎ𝑓𝑠 ∗ 𝐽𝑜𝑡𝑢𝑠𝑣𝑑𝑢𝑗𝑝𝑜 𝑝𝑢ℎ𝑓𝑠 + 𝑄𝑝𝑥𝑓𝑠 𝑥𝑏𝑡𝑢𝑓𝑒 ∗ 𝑈𝑗𝑛𝑓
SparseLK - Perf/Watt comparison 1.2 NORMALIZED PERFORMANCE 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 NORMALIZED GPU POWER Multithread per feature SingleThreadPerFeature
Summary Analyze the whole pipeline at the system level Use energy efficient features on the target platform Balance between TLP and ILP
THANK YOU brantz@nvidia.com
Recommend
More recommend