HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1 Max Lv , NVIDIA - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1 Max Lv , NVIDIA Brant Zhao, NVIDIA April 7 mlv@nvidia.com https://github.com/madeye

Histogram of Oriented Gradients on GPU Optimization Opportunities on a Tegra GPU Optimization #1: Improve ILP (Instruction Level AGENDA Parallelism) Optimization #2: Approximation Optimization #3: Specialization Final Results 2

PEDESTRIAN DETECTION: HOG DESCRIPTOR Histogram of Oriented Gradients Gradient-based feature descriptor developed for pedestrian detection Introduced by Navneet Dalal and Bill Triggs (CVPR’05) Global descriptor for the complete body Very high-dimensional: typically ~4000 dimensions Source: Dalal, N.; Triggs, B., "Histograms of oriented gradients for human detection," CVPR 2005. 3

HOG PIPELINE ON GPU Four GPU Kernels Oriented Block Histograms Linear SVM Gradients Histograms Normalization Oriented Gradients : 3x3 Sobel filter with gamma correction Block Histogram : Pixels vote in proportion to gradient magnitude, with a tri-linear interpolation, in each block ( 16x16 pixels ) Histograms Normalization : Normalize each block of histogram ( 36-bin ) Linear SVM : A linear SVM classifier, dot product of each window ( 7x15 36-bin normalized histograms) and trained coefficients 4

OPTIMIZATION OPPORTUNITIES On a 2-SM Maxwell GPU in Tegra X1 NVIDIA Tegra X1 Maxwell GPU Our goal is to improve the performance further based on a Specification well-optimized implementation in VisionWorks CUDA Cores 256 Texture Units 16 Trade-offs between ILP (Instruction-level-parallelism) and ROPs 16 DLP (Data-level-parallelism) GPU Clock ~1000MHz Memory Clock 1600MHz Trade-offs between precision and computation (LPDDR4) Memory Bus 64-bit Trade-offs between generalization and specialization Width FP16 Peak 1024 GFLOPS FP32 Peak 512 GFLOPS Architecture Maxwell 5

OPTIMIZATION #1 Improve ILP (Instruction Level Parallelism) DLP Existed GPU kernels optimized for large (Thread #) GPU, improving DLP to saturate SMs A For small GPUs on Tegra, it’s possible to z gain perf with larger ILP but smaller DLP Increase workload in each thread while # B of total threads decreases z Try different configs until the best perf is achieved ILP (In-flight ops per thread) 6

OPTIMIZATION #1 Example: Best ILP & DLP trade-off for Block Histograms 12 T1 T2 T3 T4 Various patterns to compute a block of histograms. Best trade-off: Each thread calculates 3x12 pixels 12 Not work well on large GPUs like Titan X, 16 but suitable for Tegra X1 7 16

OPTIMIZATION #2 Approximation Compute as FP16/FP32 in SM 32-bit float point of GPU is unnecessary for most of computer vision applications 0, 0.5, 1.0, … `--use_fast_math` is enabled by default for our CV projects Conversion / (De)Normalization / Compute in float point, but load and store Sampling pixels in integer using texture instructions In Texture Sometimes it’s safe to relax the precision even further 0, 128, 255, … Store as 8-bit/16-bit Integer in Memory 8

OPTIMIZATION #2 Example: Fast atan2f() for Oriented Gradients A fast version of atan2f() with 3rd order Lagrange polynomial interpolation, and float atan2f_lagrange_3rd (const float dy, const float dx) { without handling corner cases float A = 0.0f, B = 0.0f; float Offset = copysignf (float(M_PI), dy); Comparison between different atan2f if ( fabsf (dy) < fabsf (dx)) { implementations A = dx; B = dy; if (dx >= 0.0f) Offset = 0.0f; Native This work } else { A = -dy; B = dx; Offset *= 0.5f; FMA/FADD (op) 12 4 } MUFU.RCP (op) 2 1 const float r = B / A; const float p = 1.0f - fabsf (r); Handle Corner Case (op) ~30 ~5 return ((-0.0663f*p + 0.311f) * p Avg. Error (degree) 0.01 0.05 + float(M_PI/4.0)) * r + Offset; } 9

OPTIMIZATION #3 Specialization Specialize parameters of CV applications to __global__ void kernel (int N) { enable further optimization ... Unroll the loop fully to eliminate index #pragma unroll for (int i = 0; i < N; i++) { computation and conditional branches if (i % 3) { ... } Allow automatic register blocking by ... compiler, better instruction scheduling tmp[i] += ... } Allow more tricks to reuse on-chip data ... } 10

OPTIMIZATION #3 Example: Transform Linear SVM to 36-layer 7x15 2D Convolutions Dot products of (7x15x36)-dimension vectors = Sum of 36 -layer 7 x 15 2D convolutions Load the whole patch to shared memory Uniform loads of coefficients in constant memory, without any bank conflict Reuse our well-optimized 2D convolution kernel (aggressive register blocking, GTC’15, Zhao et.al) 11

OPTIMIZATION #3 Example: Transform Linear SVM to 36-layer 7x15 2D Convolutions winPerImgX Each element is 7 dot product of winPerImgY ... ... each window 15 Atomic Add = = * … … … 2D convolution on 36 layers Add up results of all layers 12

FINAL RESULTS 214 FPS on Tegra X1 Runtime (ms) of VGA input on Tegra X1, compared to the previous implementation of VisionWorks (https://developer.nvidia.com/embedded/visionworks) Base Optimized 10.00 8.73 9.00 8.00 7.00 1.87x Speedup 6.00 4.67 5.00 3.90 4.00 3.00 2.48 2.23 2.00 1.22 1.01 0.86 0.85 1.00 0.29 0.00 Oriented Block Histograms Histogram Linear SVM Overall Gradients Normalization 13

April 4-7, 2016 | Silicon Valley THANK YOU mlv@nvidia.com https://github.com/madeye

April 4-7, 2016 | Silicon Valley BACKUPS

OPTIMIZATION #2 Example: Fast atan2f() for Oriented Gradients Employ LOP3 (3-operand logic operations, new instruction of Maxwell arch) float atan2f_lagrange_3rd (const float dy, const float dx) { float flag, z = 0.0f; __SET_LT (flag, fabsf(dy), fabsf(dx)); uint32_t m, t1 = 0x80000000; float t2 = float(M_PI) / 2.0f; __LOP3_0x2e (m, __float_as_int(dx), t1, __float_as_int(t2)); float w = flag * __int_as_float(m) + float(M_PI)/2.0f; float Offset = copysignf (w, dy); LOP3 eliminates float t = fminf ( fabsf (dx), fabsf (dy)) / fmaxf (fabsf(dx), fabsf (dy)); conditional branches uint32_t r, b = __float_as_int(flag) << 2; uint32_t mask = __float_as_int(dx) ^ __float_as_int(dy) ^ (~b); __LOP3_0xe2 (r, mask, t1, __floast_as_int(t)); const float p = fabsf(__int_as_float(r)) - 1.0f; return ((-0.0663f*(-p) + 0.311f) * (-p) + float(float(M_PI)/4.0)) * (*(float *)&r) + Offset; } 16

HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1 Max Lv , NVIDIA - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1 Max Lv , NVIDIA Brant Zhao, NVIDIA April 7 mlv@nvidia.com https://github.com/madeye Histogram of Oriented Gradients on GPU Optimization Opportunities on a

Temporary Covered Temporary Drop Off Pedestrian Bridge Pedestrian Bri Foundations Pedestrian

Pedestrian Pedestrian Pedestrian C Pedestrian C C Crossing confusion rossing confusion

Mayors Pedestrian Advisory Council Wednesday, February 15 Annual Pedestrian Fatalities 2005 -

Understanding Pedestrian Collisions Partnering Conference September 10, 2013 Pedestrian Safety

A High-Precision GPU, CPU and Memory Power Model for the Tegra K1 SoC Kristoffer Robin Stokke

A High-Precision, Hybrid GPU, CPU and RAM Power Model for the Tegra K1 SoC Kristoffer Robin

SIGGRAPH 2013 Shaping the Future of Visual Computing TEGRA: Attacking Mobile Entertainment with

Porting Nouveau to Tegra K1 How NVIDIA became a Nouveau contributor Alexandre Courbot, NVIDIA

Connect Currituck Pedestrian Master Plan Connect Currituck Pedestrian Master Plan Key Benefits

Understanding driver/pedestrian conflicts: Driver Understanding driver/pedestrian conflicts:

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

Study sites 50 unsignalized Introduction pedestrian crossings in Warsaw High pedestrian

Drone Net Using Tegra for Multi-Spectral Detection and Tracking in Shared Air Space

Pawling Pedestrian Plan Presentation June 17, 2019 Pawling Pedestrian Plan Dutchess County

TXDOT STATEWIDE BICYCLE AND PEDESTRIAN PROGRAM TxDOT Statewide Bicycle and Pedestrian Program

Mockingbird Bicycle & Pedestrian Bridge Bicycle and Pedestrian Advisory Committee November

Company Presentation - July 2015 1 Safe Harbor In keeping with the SEC's "Safe Harbor"

Research Centre Sodankyl - Finland Location Airport EFSO Sodankyl, Finland 672348N 0263705E

Using Genetic Algorithms to solve the Minimum Labeling Spanning Tree Problem MLST; problem

Leveraging Faculty Liaisons and Other Tools for SI Expansion Presenters: Waffaa Hanna

AVOIDING LIGHTING PROBLEMS BEFORE THEY ARE INSTALLED AMERICAN LIGHTING ASSOCIATION SEMINAR

TEN Ltd TEN Ltd Tsakos Energy Navigation March 19, 2015 Q4 & YE 2014 Earnings Conference

Food Safety Summit, 2015 lrgowda2k11@gmail.com Foo ood and nd th the Con

Results of mBank Group Q1 2016 Solid start to the year despite market headwinds Management Board

HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1 Max Lv , NVIDIA - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1 Max Lv , NVIDIA Brant Zhao, NVIDIA April 7 mlv@nvidia.com https://github.com/madeye Histogram of Oriented Gradients on GPU Optimization Opportunities on a

Temporary Covered Temporary Drop Off Pedestrian Bridge Pedestrian Bri Foundations Pedestrian

Pedestrian Pedestrian Pedestrian C Pedestrian C C Crossing confusion rossing confusion

Mayors Pedestrian Advisory Council Wednesday, February 15 Annual Pedestrian Fatalities 2005 -

Understanding Pedestrian Collisions Partnering Conference September 10, 2013 Pedestrian Safety

A High-Precision GPU, CPU and Memory Power Model for the Tegra K1 SoC Kristoffer Robin Stokke

A High-Precision, Hybrid GPU, CPU and RAM Power Model for the Tegra K1 SoC Kristoffer Robin

SIGGRAPH 2013 Shaping the Future of Visual Computing TEGRA: Attacking Mobile Entertainment with

Porting Nouveau to Tegra K1 How NVIDIA became a Nouveau contributor Alexandre Courbot, NVIDIA

Connect Currituck Pedestrian Master Plan Connect Currituck Pedestrian Master Plan Key Benefits

Understanding driver/pedestrian conflicts: Driver Understanding driver/pedestrian conflicts:

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

Study sites 50 unsignalized Introduction pedestrian crossings in Warsaw High pedestrian

Drone Net Using Tegra for Multi-Spectral Detection and Tracking in Shared Air Space

Pawling Pedestrian Plan Presentation June 17, 2019 Pawling Pedestrian Plan Dutchess County

TXDOT STATEWIDE BICYCLE AND PEDESTRIAN PROGRAM TxDOT Statewide Bicycle and Pedestrian Program

Mockingbird Bicycle &amp; Pedestrian Bridge Bicycle and Pedestrian Advisory Committee November

Company Presentation - July 2015 1 Safe Harbor In keeping with the SEC's &quot;Safe Harbor&quot;

Research Centre Sodankyl - Finland Location Airport EFSO Sodankyl, Finland 672348N 0263705E

Using Genetic Algorithms to solve the Minimum Labeling Spanning Tree Problem MLST; problem

Leveraging Faculty Liaisons and Other Tools for SI Expansion Presenters: Waffaa Hanna

AVOIDING LIGHTING PROBLEMS BEFORE THEY ARE INSTALLED AMERICAN LIGHTING ASSOCIATION SEMINAR

TEN Ltd TEN Ltd Tsakos Energy Navigation March 19, 2015 Q4 &amp; YE 2014 Earnings Conference

Food Safety Summit, 2015 lrgowda2k11@gmail.com Foo ood and nd th the Con

Results of mBank Group Q1 2016 Solid start to the year despite market headwinds Management Board

Mockingbird Bicycle & Pedestrian Bridge Bicycle and Pedestrian Advisory Committee November

Company Presentation - July 2015 1 Safe Harbor In keeping with the SEC's "Safe Harbor"

TEN Ltd TEN Ltd Tsakos Energy Navigation March 19, 2015 Q4 & YE 2014 Earnings Conference