mil ut at ilsvrc2014
play

MIL-UT at ILSVRC2014 IIT Guwahati (undergrad) -> Virginia Tech - PowerPoint PPT Presentation

MIL-UT at ILSVRC2014 IIT Guwahati (undergrad) -> Virginia Tech (intern) Senthil Purushwalkam, Yuichiro Tsuchiya, Atsushi Kanehira, Asako Kanezaki and *Tatsuya Harada The University of Tokyo Pipeline of CLS-LOC task Multiclass Object


  1. MIL-UT at ILSVRC2014 IIT Guwahati (undergrad) -> Virginia Tech (intern) Senthil Purushwalkam, Yuichiro Tsuchiya, Atsushi Kanehira, Asako Kanezaki and *Tatsuya Harada The University of Tokyo

  2. Pipeline of CLS-LOC task Multiclass Object Detection 1-1 Scoring each bounding boxes by RCNN with hard negative classes Averaged multiclass fc7 Passive Aggressive with hard negative mining Input image Scoring regions Extract region Extract CNN features by multiclass PA proposals Late Score fusion 1-2 Scoring whole image by FV as contextual scores Averaged multiclass Passive Aggressive Whole image Extract FV with Scoring whole image spacial information by multiclass PA

  3. Region Proposals and Feature Extraction 1-1 Scoring each bounding boxes by RCNN Averaged multiclass fc7 Passive Aggressive with hard negative mining Input image Scoring regions by Extract region Extract CNN features multiclass PA proposals Late fusion Score 1-2 Scoring whole image by FV as contextual scores Averaged multiclass Passive Aggressive Scoring whole image by Extract FV with spacial information Whole image multiclass PA • R-CNN • R. Girshick, J. Donahue, T. Darrell, J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR, 2014. • Region proposals • Selective Search • J.R.R. Uijlings, K.E.A. van de Sande, T. Gevers, A.W.M. Smeulders. Selective Search for Object Recognition. IJCV, 2013. • CNN features • Single CNN model (5 conv layers, 2 fully connected layers) • Pre-computed ILSVRC13 model • http://www.cs.berkeley.edu/~rbg/r-cnn-release1-data-ilsvrc2013-caffe-proto-v0.tgz • No fine-tuning • 4096 dim fc7 features

  4. Multiclass Object Detection 1-1 Scoring each bounding boxes by RCNN Averaged multiclass fc7 Passive Aggressive with hard negative mining Input image Scoring regions by Extract region Extract CNN features multiclass PA proposals Late fusion Score 1-2 Scoring whole image by FV as contextual scores Averaged multiclass Passive Aggressive Scoring whole image by Extract FV with spacial information Whole image multiclass PA • Hard negatives classes • Idea: Create ‘negative’ classes and train on 2K classes • A. Kanezaki, S. Inaba, Y. Ushiku, Y. Yamashita, H. Muraoka, Y. Kuniyoshi and T. Harada. Hard Negative Classes for Multiple Object Detection. ICRA, 2014. • Minimize detection errors as well as classification errors • Passive Aggressive algorithm with hard negative mining

  5. Multiclass object detection (training with negative classes) We use Passive Aggressive (PA) [Crammer et al., 2006] to learn multi-class linear classifiers. 1 2 + 𝐷 ζ s. t. 𝑚 𝒚 𝑗 𝑢 , 𝑧 𝑗 𝑢 ; 𝑋 ≤ ζ , ζ ≥ 0 𝑋 𝑢+1 = arg min 2 𝑋 − 𝑋 𝑢 𝑋 ERROR r : Positive class Score of class 1 𝒚 𝑢 s Score of class 2 : Negative class = 𝑋𝒚 𝑢 = … … with the highest 𝒙 𝐿 score Score of class K (𝑢+1) = 𝒙 𝑠 (𝑢) + 𝜐 𝑢 𝒚 𝑢 𝑢 𝑈 𝑢 𝑈 𝒙 𝑠 𝜐 𝑢 = min 𝐷, 1 − (𝒙 𝑠 𝒚 𝑢 − 𝒙 𝑡 𝒚 𝑢 ) where (𝑢+1) = 𝒙 𝑡 (𝑢) − 𝜐 𝑢 𝒚 𝑢 2 2 𝒚 𝑢 𝒙 𝑡

  6. Multiclass object detection (training with negative classes) 𝑚 𝒚 𝑗 𝑢 , 𝑧 𝑗 𝑢 ; 𝑋 Core Idea Hard negative classes ERROR 𝒙 1 Cf.) single background Score of class 1 𝒙 2 class 𝒚 𝑢 Score of class 2 𝒙 𝑐𝑕 = … … does not work. 𝒙 𝐿 Score of class K 𝒙′ 1 Score of negative class 1 Score of negative class 2 𝒙′ 2 … … Score of negative class K 𝒙′ 𝐿 (𝑢+1) = 𝒙 𝑠 (𝑢) + 𝜐 𝑢 𝒚 𝑢 𝑢 𝑈 𝑢 𝑈 𝒙 𝑠 𝜐 𝑢 = min 𝐷, 1 − (𝒙 𝑠 𝒚 𝑢 − 𝒙 𝑡 𝒚 𝑢 ) where (𝑢+1) = 𝒙 𝑡 (𝑢) − 𝜐 𝑢 𝒚 𝑢 2 2 𝒚 𝑢 𝒙 𝑡

  7. Multiclass object detection (training with negative classes) 𝑚 𝒚 𝑗 𝑢 , 𝑧 𝑗 𝑢 ; 𝑋 Ex.) If a training sample 𝒚 𝑢 is a positive sample of class 2, ERROR Classification error 𝒙 1 r = class 2 Score of class 1 𝒙 2 𝒚 𝑢 Score of class 2 = s : Negative class … … with the highest 𝒙 𝐿 Score of class K score 𝒙′ 1 Score of negative class 1 Candidates of 𝑡 : Score of negative class 2 𝒙′ 2 … c lass1, 3, …, or K, … or negative class 2 Score of negative class K 𝒙′ 𝐿 (𝑢+1) = 𝒙 𝑠 (𝑢) + 𝜐 𝑢 𝒚 𝑢 𝑢 𝑈 𝑢 𝑈 𝒙 𝑠 𝜐 𝑢 = min 𝐷, 1 − (𝒙 𝑠 𝒚 𝑢 − 𝒙 𝑡 𝒚 𝑢 ) where (𝑢+1) = 𝒙 𝑡 (𝑢) − 𝜐 𝑢 𝒚 𝑢 2 2 𝒚 𝑢 𝒙 𝑡

  8. Multiclass object detection (training with negative classes) 𝑚 𝒚 𝑗 𝑢 , 𝑧 𝑗 𝑢 ; 𝑋 Ex.) If a training sample 𝒚 𝑢 is a negative sample of class 2, ERROR Detection error 𝒙 1 s = class 2 Score of class 1 𝒙 2 𝒚 𝑢 Score of class 2 = … … 𝒙 𝐿 Score of class K 𝒙′ 1 Score of negative class 1 r Score of negative class 2 𝒙′ 2 = negative class 2 … … Score of negative class K 𝒙′ 𝐿 (𝑢+1) = 𝒙 𝑠 (𝑢) + 𝜐 𝑢 𝒚 𝑢 𝑢 𝑈 𝑢 𝑈 𝒙 𝑠 𝜐 𝑢 = min 𝐷, 1 − (𝒙 𝑠 𝒚 𝑢 − 𝒙 𝑡 𝒚 𝑢 ) where (𝑢+1) = 𝒙 𝑡 (𝑢) − 𝜐 𝑢 𝒚 𝑢 2 2 𝒚 𝑢 𝒙 𝑡

  9. Features for Contextual Scores 1-1 Scoring each bounding boxes by RCNN Averaged multiclass fc7 Passive Aggressive with hard negative mining Input image Scoring regions by Extract region Extract CNN features multiclass PA proposals Late fusion Score 1-2 Scoring whole image by FV as contextual scores Averaged multiclass Passive Aggressive Scoring whole image by Extract FV with spacial information Whole image multiclass PA • Improved Fisher Vector – F. Perronnin, J. Sanchez, and T. Mensink. Improving the fisher kernel for large-scale image classification. ECCV, 2010. – INRIA's Fisher vector implementation • http://lear.inrialpes.fr/src/inria_fisher/ – L2 normalization, Power normalization, Spatial pyramid • Parameters of IFV for all local features in our system – Dimension reduction of local feature (D): 64 dim – # of components in GMM (K): 256 – 5 scales of local patches – Spatial pyramid (P): 1x1 + 2x2 + 3x1 = 8 – Dimension of IFK: 2PKD=262,144 dim • Local Descriptors – SIFT 9

  10. Classifiers for Contextual Scores 1-1 Scoring each bounding boxes by RCNN Averaged multiclass fc7 Passive Aggressive with hard negative mining Input image Scoring regions by Extract region Extract CNN features multiclass PA proposals Late fusion Score 1-2 Scoring whole image by FV as contextual scores Averaged multiclass Passive Aggressive Scoring whole image by Extract FV with spacial information Whole image multiclass PA • – • – • 10

  11. Online Learning for Large-Scale Visual Recognition   μ  • Three guidelines y y arg max x i i i  \ • Y. Ushiku, M. Hidaka, T. Harada. y Y y i Three Guidelines of Online Learning 1       μ μ μ μ Averaging  for Large-Scale Visual Recognition. 1 2 T T CVPR, 2014. 1. Perceptron can compete against the latest methods. • Provided that the second guideline is observed. 2. Averaging is necessary for any algorithm. • First-order algorithms w/o averaging cannot compete against second-order algorithms. • When averaging is used, the accuracies of all algorithms become very close to each other. 3. Investigate multiclass learning first. • Both one-versus-the-rest learning and multiclass learning achieve similar accuracy. • However, one-versus-the-rest takes much longer CPU time to converge than multiclass does.

  12. Late Fusion 1-1 Scoring each bounding boxes by RCNN 𝐷𝑂𝑂 𝑇 𝑗,1 Multiclass PA for class 1 ⋮ ⋮ fc7 Multiclass PA for class 𝑘 𝐷𝑂𝑂 𝑇 𝑗,𝑘 ⋮ ⋮ Input image 𝐷𝑂𝑂 Multiclass PA for class 1000 𝑇 𝑗,1000 Compute CNN Extract region Scoring regions by features proposals Multiclass PA for each class 1-2 Scoring whole image by FV as contextual scores 𝐺𝑊 𝑇 1 Multiclass PA for class 1 ⋮ ⋮ Multiclass PA for class 𝑘 𝐺𝑊 𝑇 𝑘 ⋮ ⋮ 𝐺𝑊 𝑇 1000 Multiclass PA for class 1000 Whole image Extract FV with Scoring by linear classifier spacial information trained by PA for each class 2. Rescoring with combining RCNN feature and FV For bounding box 𝑗 , class 𝑘 , 𝑜𝑓𝑥 = 𝑇 𝑗,𝑘 𝐷𝑂𝑂 𝑇 𝐺𝑊 𝑇 𝑗,𝑘 𝑘

  13. Results Validation dataset Method Localization error Classification error R-CNN feature + one-vs-all SVMs 0.631743 0.460080 R-CNN feature + multi-class PA 0.446121 0.285720 R-CNN feature + multi-class PA 0.387516 0.227200 using hard negative classes R-CNN feature + multi-class PA using hard negative classes, and FV 0.341743 0.18768 Test dataset Team name Localization error Classification error VGG 0.253231 0.07405 GoogLeNet 0.264414 0.14828 SYSU_Vision 0.31899 0.14446 0.337414  0.20734  MIL (our team)

Recommend


More recommend