Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms
Onur Ulusel, Christopher Picardo, Christopher Harris, Sherief Reda, R. Iris Bahar, School of Engineering, Brown University
Hardware Acceleration of Feature Detection and Description - - PowerPoint PPT Presentation
Hardware Acceleration of Feature Detection and Description Algorithms on LowPower Embedded Platforms Onur Ulusel, Christopher Picardo, Christopher Harris, Sherief Reda, R. Iris Bahar, School of Engineering, Brown University Image Processing
Onur Ulusel, Christopher Picardo, Christopher Harris, Sherief Reda, R. Iris Bahar, School of Engineering, Brown University
2
– Input data has changed from words/numbers to images – Sensors have improved dramatically
– Autonomization relies on image processing
– Real‐time computing + limited data bandwidth prefer local computing to offloading to cloud – BUT image processing can be very computationally intensive and power hungry
www.google.com www.guardiantv.com
– Feature detection and feature description are key building blocks for image retrieval, biometric identification, visual odometry, etc. – Computational efficient detection and analysis of image features is critical for performance and energy‐efficiency
3
http://www.sybernautix.com/ https://blog.pivotal.io https://vision.in.tum.de
4
FPGAs GPUs Xilinx Virtex 6 Xilinx Zynq 7020 1532 core NVIDIA GeForce GTX 680 192 core NVIDIA Jetson TK1 Power 15 W <5W 195 W <12W
5
6
Slide adapted from Darya Frolova, Denis Simakov, Weizmann Institude
flat region: no change within block edge: no change along the edge corner: change in all directions
– E.g., intensity, orientation
7
Images from: R. Szeliski, Computer Vision: Algorithms and Applications
8
– SIFT: Scale‐Invariant Feature Transform – SURF: Speeded Up Robust Features
– BRIEF: Binary Robust Independent Elementary Features – BRISK: Binary Robust Invariant Scalable Keypoints
1 10 100 1000 20 40 60 80 SIFT SURF BRIEF BRISK CPU run‐time (ms) ‐ log scale Accuracy Percetage Precision Recall Run‐time (% of detected features that are correct) (% of features detected)
9
Bresenham Circle sliding window
12‐pixel continuity? If so then feature Pre‐compare pixels 1, 5, 9, and 13 to determine possibility for continuity On average 98.5% of the comparisons fail the continuity test at the pre‐compare stage
Rosten and Drummond, ECCV’06
10
Chosen sampling pattern results in a 512‐bit characterization array
11
BRISK sampling pattern Red circles represent standard deviation of Gaussian smoothing
Start R d I Frame Read Input Frame h i l l Bresenham circle For each pixel , p, apply the 7x7 filter of Bresenham circle is a corner p is a corner Generate N sampling pairs , Xi and Yi, around p Xi > Yi Di = 1 Di = 0
Feature Detection Brief Feature Description
Stop False False
Start R d I Frame Read Input Frame F h i l l Bresenham circle For each pixel , p, apply the 7x7 filter of Bresenham circle is a corner p is a corner Generate N sampling pairs , Xi and Yi, around p Xi > Yi Di = 1 Di = 0
Feature Detection
False False Perform Orientation Compensation for BRISK Stop
Feature Description
– A significant amount of extra hardware resources for this step
14
Image data
Zig‐Zag Traversing Line Buffers Mask Size Register Array
Buffer Address Generator and Register Array
+ ‐ + ‐ + ‐ + ‐
Pre‐compute units
Equality
X 12
+ ‐
Circle Comparator
X x x xx&y coordinates Enable & Sync signals
ARM Cortex A9 CPU Central Interconnect
32b GP AXI Master Port
10‐Line word Buffer Processing System (PS) Control
Programmable Logic (PL)
is_corner
Smoothing & Region Generation N‐wide comparator Orientation Compensation
descriptor
AXI Interconnect
+ ‐
Memory Interface DDR3 15
Image data
Zig‐Zag Traversing Line Buffers Mask Size Register Array
Buffer Address Generator and Register Array
+ ‐ + ‐ + ‐ + ‐
Pre‐compute units
Equality
X 12
+ ‐
Circle Comparator
X x x xx&y coordinates Enable & Sync signals
ARM Cortex A9 CPU Central Interconnect
32b GP AXI Master Port
10‐Line word Buffer Processing System (PS) Control
Programmable Logic (PL)
is_corner
Smoothing & Region Generation N‐wide comparator Orientation Compensation
descriptor
AXI Interconnect
+ ‐
Memory Interface DDR3 16
FAST Feature Detection
Image data
Zig‐Zag Traversing Line Buffers Mask Size Register Array
Buffer Address Generator and Register Array
+ ‐ + ‐ + ‐ + ‐
Pre‐compute units
Equality
X 12
+ ‐
Circle Comparator
X x x xx&y coordinates Enable & Sync signals
ARM Cortex A9 CPU Central Interconnect
32b GP AXI Master Port
10‐Line word Buffer Processing System (PS) Control
Programmable Logic (PL)
is_corner
Smoothing & Region Generation N‐wide comparator Orientation Compensation
descriptor
AXI Interconnect
+ ‐
Memory Interface DDR3 17
BRIEF Descriptor
Image data
Zig‐Zag Traversing Line Buffers Mask Size Register Array
Buffer Address Generator and Register Array
+ ‐ + ‐ + ‐ + ‐
Pre‐compute units
Equality
X 12
+ ‐
Circle Comparator
X x x xx&y coordinates Enable & Sync signals
ARM Cortex A9 CPU Central Interconnect
32b GP AXI Master Port
10‐Line word Buffer Processing System (PS) Control
Programmable Logic (PL)
is_corner
Smoothing & Region Generation N‐wide comparator Orientation Compensation
descriptor
AXI Interconnect
+ ‐
Memory Interface DDR3 18
BRISK Descriptor
19
Feature detection logic
Image data
Zig‐Zag Traversing Line Buffers Mask Size Register Array
Buffer Address Generator and Register Array
+ ‐ + ‐ + ‐ + ‐
Pre‐compute units
Equality
X 12
+ ‐
Circle Comparator
X x x xx&y coordinates Enable & Sync signals
ARM Cortex A9 CPU Central Interconnect
32b GP AXI Master Port
10‐Line word Buffer Processing System (PS) Control
Programmable Logic (PL)
is_corner
Smoothing & Region Generation N‐wide comparator Orientation Compensation
descriptor
AXI Interconnect
+ ‐
Memory Interface DDR3
Descriptor logic Data control logic for detection and description
4.4 25.8 13.8 13.7 36.4 103.7 18.2 13.7 50.8 111.7 27.6 13.7
20 40 60 80 100 120 Intel i7 CPU ARM on Jetson Tegra GPU on Jetson Zynq FPGA
Run‐time (ms)
FAST FAST+BRIEF FAST+BRISK
20
21
(detection) (detection + description)
21.5 6.5 4.3 2.20 22.2 6.2 6.3 2.27 22.4 6.3 6.3 2.31 5 10 15 20 25 Intel i7 CPU Embedded CPU Tegra GPU Zynq FPGA Power (W) FAST FAST+BRIEF FAST+BRISK 94.0 166.8 59.3 30.00 809.4 696.4 114.1 30.96 1137.7 705.1 174.3 31.58 200 400 600 800 1000 1200 Intel i7 CPU Embedded CPU Tegra GPU Zynq FPGA Energy (mJ) FAST FAST+BRIEF FAST+BRISK
– Needs better data management
22 0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0% 35.0% 40.0% 45.0% Execution Dependency Pipe Busy Memory Throttle Not Selected
Stall Reasons during GPU Computation
Description Detection
(BRIEF) (FAST)
0% 20% 40% 60% 80% 100%
FPGA: FAST FPGA: FAST + BRIEF GPU: FAST GPU: FAST + BRIEF
Instruction Distribution
Load/Store FP/Integer
has bump in load/store ops
– Almost 10X more than just FAST
23
10% 14% 27% 32% 19% 37% 57% 66% 37%
LookUp Tables Flip Flops Block Rams FAST FAST+BRIEF FAST+BRISK
Lookup Tables Flip Flops Block RAMs FAST 4564 1551 8 FAST + BRIEF 14398 2093 11 FAST + BRISK 25575 7115 11
Distribution of resources Resource utilization
– FAST + BRISK: 36 fps vs. 147 fps – FPGA amenable to various HW optimizations:
– For GPUs, multiple kernels highly bounded by kernel scheduler and memory bottlenecks – FPGA customization on layers better for tackling operations on multiple kernels.
– identify nature of bottlenecks – Customized FPGA HW can often better manage certain types of bottlenecks
24