Visual Geometry Group, Department of Engineering Science R-CNN minus R Karel Lenc, Andrea Vedaldi
Object detection 2 Goal : tightly enclose objects of a certain type in a bounding box “bikes” “planes” “horses” “birds”
Top performer: Region proposals + CNN 3 chair CNN WHAT background CNN potted CNN plant proposal WHERE generation [Girshick et al. 2013, He et al. 2014, 2015]
Top performer: Region proposals + CNN 4 WHAT Convolutional neural networks E.g. region classification with AlexNet, VGG VD [Krizhevsky et al. 2012, Simonyan Zisserman 2014] WHERE Segmentation algorithm E.g. region proposal from selective search [Uijlings et al. 2013]
Can CNN understand where as well as what ? 5 proposal Where? WHAT generation Convolutional neural network ? WHERE
Approaches to object detection 6 Scanning windows Hough voting Sliding windows Implicit shape models [Amit Geman ▶ ▶ 1997, Leibe et al. 2003] HOG detector [Dalal Triggs 2005] ▶ Max margin [Maji Berg 2009], Random ▶ DPM [Felzenszwalb et al. 2008] ▶ Forests [Gall Lempitsky 2009] Cascaded windows ▶ Classifiers & features AdaBoost [Viola Jones 2004] ▶ linear SVMs, kernel SVM, Fisher ▶ MKL [Vedaldi et al. 2009] ▶ Vectors, … [Cinbis et al. 2013, …] B and Bound [Lampert et al. 2009] ▶ convolutional neural networks ▶ Jumping windows [Sermanet et al. 2014, Girshick et al. ▶ 2014, …] [Sivic et al. 2008] ▶ HOG, SIFT, C-SIFT, … ▶ Selective windows ▶ [van de Sande et al. 2010, …] [Endres and Hoeim 2010, Uijlings ▶ Segmentation cues, … [Shotton et al. ▶ et. al 2011, Alexe et al. 2012, Gu et 2008, Cinbis et al. 2013, …] al. 2012]
Evolution of object detection 7 PASCAL VOC 2007 data RCNN-VGG 70 [Girshick et al.] 60 RCNN-Alex 50 [Girshick et al.] mAP [%] 40 DPM Regionlet [Wang [Felzenszwalb et et al.] al.] 30 MKL [Vedaldi et DPMv5 [Girshick 20 al.] et al.] 10 0 2008 2009 2010 2011 2012 2013 2014 2015 Year
R-CNN 8 [Girshick et al. 2013] Pros : simple and effective chair CNN background CNN potted CNN plant Cons : slow as the CNN is re-evaluated for each tested region f 8 c 5 c 1 c 2 c 3 c 4 f 6 f 7 label (SVM)
SPP R-CNN 9 [He et al. 2014] f 8 chair f 6 f 7 (SVM) f 8 c 5 c 1 c 2 c 3 c 4 bowl f 6 f 7 (SVM) potted f 8 f 6 f 7 plant (SVM) local features pooling encoder Convolutional features = local features Region descriptor = pooled local features Spatial pyramid + max pooling [He et al. 2014] ▶ Bag of words, Fisher vector, VLAD, …. [Cimpoi et. al. 2015] ▶ Order of magnitudes speedup
Computational cost 10 Detection time Avg. Time per Image [ms] R-CNN SPP-CNN 0 2000 4000 6000 8000 10000 12000 Sel. Search CNN evaluation SPP-CNN results in a significant test-time speedup However, region proposal extraction is the new bottleneck R-CNN minus R : can we get rid of region proposal extraction?
Streamlining R-CNN and SPP-CNN Dropping proposal generation
Streamlining R-CNN and SPP-CNN Dropping proposal generation
A complex learning pipeline 13 (SPP) R-CNN training comprises many steps label (fine tuning) c 5 f 6 f 7 c 1 c 2 c 3 c 4 f 8 label (ranking) SVM linear b. box regress. 1. Pre-train a large CNN (on ImageNet) 2. Extract region proposals (on PASCAL VOC) 3. Use pre-processed regions to: 1. Fine-tune the CNN 2. Learn an SVM to rank regions 3. Learn a bounding-box regressor to refine localization
A complex learning pipeline 14 (SPP) R-CNN training comprises many steps label c 5 f 6 f 7 c 1 c 2 c 3 c 4 f 8 label SVM linear b. box regress. frozen With SPP R-CNN of [He et al. 2014] fine-tuning is limited to the fully connected layers
Streamlining R-CNN 15 Removing the SVM phase score(s) learning loss mAP 𝑇 𝑑 0 𝑇 𝑑 = exp( 𝑥 𝑑 , 𝜚 𝒚 + 𝑐 𝑑 ) − log fine tuning 38.1 𝑇 0 + 𝑇 1 + 𝑇 2 + … + 𝑇 𝐷 𝑅 1 = 𝑥 1 , 𝜚 𝒚 + 𝑐 1 max 0, 1 − 𝑧 𝑅 1 ⋮ ⋮ region ranking 59.8 𝑅 𝐷 = 𝑥 𝐷 , 𝜚 𝒚 + 𝑐 𝐷 max{0, 1 − 𝑧 𝑅 𝐷 } 𝑅 𝑑 = log 𝑇 𝑑 region raking from fine-tuning 58.4 𝑇 0 Up to a simple transformation, softmax is just as good as hinge loss for box ranking.
Streamlining R-CNN and SPP-CNN 16 See also [Fast R-CNN and Faster R-CNN] label c 5 f 6 f 7 c 1 c 2 c 3 c 4 f 8 lin. b. box regress. frozen label c 5 c 1 c 2 c 3 c 4 f 6 f 7 f 8 SPP b. box f reg SPP and bounding box regressions can be easily implemented in a CNN (with a DAG topology) and trained jointly in one step
Streamlining R-CNN and SPP-CNN Dropping proposal generation
A constant-time region proposal generator 18 Algorithm Preprocessing Collect all the training bounding boxes (x 1 ,y 1 ,x 2 ,y 2 ) Use K-means to extract K clusters in (x 1 ,y 1 ,x 2 ,y 2 ) space Proposal generation Regardless of the image, return the same K cluster centers Proposals are now very fast but very inaccurate We let the CNN compensate with the bounding box regressor
Proposal statistics on PASCAL VOC 19 selective search sliding windows clustering ground truth 2K 7K 3K
Information pathways 20 [See also Lenc Vedaldi CPVR 2015] invariant equivariant representation representation what path label c 5 f 6 f 7 c 1 c 2 c 3 c 4 f 8 shared local features linear bounding box regress. where path
CNN-based bounding box regression 21 Dashed line : proposals Solid line : corrected by the CNN
Performance 22 0.6 0.58 mAP (VOC07) 0.56 0.54 0.52 0.5 0.48 0.46 0.44 0.42 Sel. Search (2K Slid. Win. (7K Clusters (2K Clusters (7K boxes) Boxes) Boxes) Boxes) Baseline BBR Observations Selective search is much better than fixed generators ▶ However, bounding box regression almost eliminates the difference ▶ Clustering allows to use significantly less boxes than sliding windows ▶
Timings 23 Finding (1) Streamlining accelerates SPP Avg. Time per Image [ms] Streamlined SPP SPP 0 50 100 150 200 250 300 350 400 450 GPU↔CPU Im. Prep. CONV Layers Spat. Pooling FC Layers Bbox Regr.
Timings 24 Finding (2) Dropping selective search is a huge benefit Avg. Time per Image [ms] Minus R Streamlined SPP SPP 0 500 1000 1500 2000 2500 3000 GPU↔CPU Sel. Search Im. Prep. CONV Layers Spat. Pooling FC Layers Bbox Regr.
Timings 25 Finding (2) Dropping selective search is a huge benefit Avg. Time per Image [ms] Minus R Streamlined SPP SPP RCNN 0 2000 4000 6000 8000 10000 12000 GPU↔CPU Sel. Search Im. Prep. CONV Layers Spat. Pooling FC Layers Bbox Regr.
Timings 26 Test-time speedups Times faster than R-CNN 67.5 Minus R Streamlined 5.0 SPP SPP 4.5 1.0 RCNN 0 10 20 30 40 50 60 70 80
Conclusions 27 Current CNNs can localize objects well External segmentation cues bring only a minor benefit at a great expense ▶ Benefits of CNN-only solutions Much faster, particularly at test time ▶ Much simpler and streamlined implementations ▶ Future steps Eliminate the remaining accuracy gap ▶ Essentially achieved in ▶ [Faster R-CNN, Ren et al. 2015] Beyond bounding boxes ▶ Beyond detection ▶
Recommend
More recommend