Regionlet Object Detector with Hand-crafted and CNN Feature Xiaoyu Wang Snapchat Research Xiaoyu Wang Shenghuo Zhu Ming Yang Yuanqing Lin Snapchat Research Horizon Robotics Alibaba Group Baidu
Snapchat Overview of this section • Regionlet Object Detector • Regionlet Localizer (re-localization) • Regionlet with Deep CNN Feature • CNN Feature Extraction • Support Pixel Integral Image • Application Examples • Car Detection for Fine-grained Image Classification • Pedestrian, Car, Cyclist Detection for Autonomous Driving
Snapchat What is Regionlet Object Detector • A significant extension to traditional boosting object detector • Together with OverFeat and R-CNN, the Regionlet detector is one of the first several detectors that successfully adopt deep CNN features for generic object detection.
Snapchat How does Regionlet detector connect to past/future CNN-based Object RealBoost 1 Boosting Feature Detection Selection Segmentation as Spatial Pyramid Pooling Selective Search 2 Object Proposal in SPP-Net 4 Low-level Feature Generalized Spatial Pyramid for CNN RoI Pooling in Fast R- Feature Pooling CNN 5 Deep CNN 3 future 2013 Past 1. C. Huang, et. al. Boosting nested cascade detector for multi-view face detection. ICPR , 2004. 2. K. E. A. Van de Sande, et. al. Segmentation as selective search for object recognition. ICCV 2011 3. Krizhevsky, et. al. ImageNet Classification with Deep Convolutional Neural Networks. NIPS 2012 4. He, et. al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. ECCV 2014 5. Ross Girshick. Fast R-CNN. ICCV 2015
Snapchat Boosting Object Detector 𝒚 𝟐 𝑂 𝑔 𝑌 = � 𝛾 𝑗 ℎ 𝑗 ( 𝑦 𝑗 ) 𝑗=0 𝒚 𝟑 𝒚 𝑶 Weak classifier A sub-region where weak A detection window classifier is built based on
Snapchat Traditional Boosting Detection Framework Model 1 Model 2 Use multiple components to detect Operate on multiple scales to objects with various aspect ratios detect objects in different scales How about a single model, but flexible during testing, no feature pyramids, no multiple components
Snapchat What the Regionlet Detector Proposed • A boosting classifier that can take inputs of different scales • A boosting classifier that can take inputs of different viewpoints • A boosting classifier containing feature pooling learning
Snapchat Regionlet: Definition Region( 𝑆 ): Feature extraction region • Regionlet( 𝑠 1 , 𝑠 2 , 𝑠 • 3 ): A sub-region in a feature extraction area whose position/resolution are relative and normalized to a detection window Region Regionlet
Snapchat Regionlet: Definition( cont. ) • Regionlet coordinates are normalized ( 𝑚 , 𝑢 , 𝑠 , 𝑐 ) Traditional (50,50,180,180) ℎ 𝑥 , 𝑢 𝑚 ℎ , 𝑠 𝑥 , 𝑐 ℎ Normalized 𝑥 (.25, .25, .90,.90) (50,50,180,180) (.25, .25, .90,.90)
Snapchat Regionlet: Definition( cont. ) • Regionlet definition = Generalized Spatial Pyramid • Similar • Both use relative coordinates • Difference • Regionlet: coordinates are relative to the detection window (not the image) • Regionlet: coordinates are flexible (do not have to evenly divide the image/window) • Regionlet feature extraction = Generalized Spatial Pyramid Pooling Rectangles in Spatial Pyramid Rectangles in Generalized Spatial Pyramid
Snapchat Connection to other methods in pooling design CNN-based Object Detection Object Proposal Spatial Pyramid Pooling in SPP-Net 1 Generalized Spatial Pyramid for CNN RoI Pooling in Fast R- Feature Pooling CNN 2 1. He, et. al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. ECCV 2014 2. Ross Girshick. Fast R-CNN. ICCV 2015
Snapchat Regionlet: Feature extraction Non-local pooling Could be Hand-crafted features or deep CNN features, whatever feature your like!
Snapchat Regionlet Classifier • Each weak classifier is based on a 1-D feature extracted from a region Feature extraction 𝑦 Feature Regionlets 𝑜−1 ℎ 𝑦 = � 𝑤 𝑝 𝟚 𝐶 𝑦 = 0 Weak Classifier 𝑝=1 𝑈 H 𝑌 = � β 𝑗 ℎ 𝑗 ( 𝑦 𝑗 ) Strong Classifier 𝑗=1
Snapchat Detection Framework (a) (b) (c) (a) : Input image Regionlet (b) : Generate object regions 1,2,3 Region (c) : Feature extraction and pooling Generalized Spatial Pyramid Pooling inside Regionget Low-level features CNN features (will talk later) Max-pooling among Regionlets 1. K. E. A. Van de Sande, et. al. Segmentation as selective search for object recognition. ICCV 2011 2. B. Alexe , et. al. Measuring the objectness of image windows. T-PAMI 2012 3. S. Ren, et. al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015
Snapchat Multiple scale & viewpoints Handling Not a motorbike Regionlet Model Adjusting the model to a candidate bounding box Adjusting the model to a candidate bounding box
Snapchat Multiple scale & viewpoints Handling Motorbike Detected Regionlet Model Adjusting the model to a candidate bounding box Adjusting the model to a candidate bounding box
Snapchat Weak Classifier Construction • Weak learner on each REGION Regionlet feature 𝑦 𝑗 (after pooling) • Eight lots lookup table • Lookup table is learned • Lot value is learned Assign lot • One lot is activated for one feature 0.01 -0.2 -0.5 -0.4 0.02 0.15 0.5 0.3 𝑂 H( 𝑌 ) = � 𝑀𝑀𝑈 𝑗 ( 𝑦 𝑗 ) 𝑗=1 Weak learner output: -0.5
Snapchat Regionlet Training • How to get regions and regionlets • Regions • Regions are randomly sampled • Effective Regions are greedily selected to reduce learning cost • Regionlets • Each Region & Regionlet configuration are randomly configured • A Region and its regionlets configuration are selected simultaneously • Region & Regionlet pool is fixed for each cascade learning
Snapchat Regionlet: Training • Constructing the regions/regionlets pool • Small region, fewer regionlets -> fine spatial layout • Large region, more regionlets -> robust to deformation • Learning realBoost 1 cascades • 16K region/regionlets candidates for each cascade • Learning of each cascade stops when the error rate is achieved (1% for positive, 37.5% for negative) • Last cascade stops after collecting 5000 weak classifiers • Result in 4-7 cascades • 2-3 hours to finish training one category on a 8-core machine 1. C. Huang, et. al. Boosting nested cascade detector for multi-view face detection. ICPR , 2004.
Snapchat Regionlet: Testing • No image resizing • Any scale, any aspect ratio • Adapt the model size to the same size as the object candidate bounding box One model, resize + image Multiple models, original + image + Ours, One model, original image
Snapchat Overview of this section • Regionlet Object Detector • Regionlet Localizer • Regionlet with Deep CNN Feature • CNN Feature Extraction • Support Pixel Integral Image • Application Examples • Car Detection • Pedestrian, Car, Cyclist Detection for Autonomous Driving
Snapchat Regionlet Localizer (object re-localization) • Why a localizer is needed (classification & localization precision dilemma) VS Data augmentation during As accurate location as possible training to accommodate during testing inaccurate localization
Snapchat Regionlet Localizer • Regionlet feature can be reused for localization • Each Regionlet feature is associated with a spatial location • The location is learned during classifier training
Snapchat Regionlet Localizer • Regionlet feature can be reused for localization Regionlet classifier 1 Regionlet classifier N 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 8N dimensional binary vector
Snapchat Regionlet Localizer ⋯ 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 × 𝑿 ∆𝒎 , ∆𝒖 , ∆𝒔 , ∆𝒄
Snapchat Regionlet Localizer Training • Random sample examples which have > 0.6 overlap with ground truth • Less overlap gives poor results • The regression task learns the location difference
Snapchat Regionlet Localizer • Experiment result on our car dataset for autonomous driving • 17501 cars for training • 12546 cars for testing Detection performance (% AP) 0.5 overlap 0.7 overlap Regionlet 62.7% 34.6% Regionlet + localization 65.3% 43.9% Improvement 2.6% 9.1%
Snapchat Overview of this section • Regionlet Object Detector • Regionlet Localizer • Regionlet with Deep CNN Feature • CNN Feature Extraction • Support Pixel Integral Image • Application Examples • Car Detection • Pedestrian, Car, Cyclist Detection for Autonomous Driving
Snapchat Regionlet with DCNN • Deep CNN • Deep structure learns high-level information • Max-pooling is robust to parts misalignment • Information are jointly learned • How to establish a bridge for DCNN and Regionlet object detection framework?
Snapchat Regionlet with DCNN • Deep CNN structure • Features from convolution layers retain spatial information Convolutional layers
Snapchat Regionlet with DCNN • Deep CNN structure • Features from convolution layers retain spatial information A feature vector
Recommend
More recommend