from rigid templates to grammars object detection with
play

From Rigid Templates to Grammars: Object Detection with Structured - PowerPoint PPT Presentation

From Rigid Templates to Grammars: Object Detection with Structured Models Ross B. Girshick Dissertation defense April 20, 2012 The question What objects are where? 2 Why it matters Intellectual curiosity - How do we extract this


  1. From Rigid Templates to Grammars: Object Detection with Structured Models Ross B. Girshick Dissertation defense April 20, 2012

  2. The question What objects are where? 2

  3. Why it matters • Intellectual curiosity - How do we extract this information from the signal? • Applications - Semantic image and video search - Human-computer interaction ( e.g. , Kinect) - Automotive safety - Camera focus-by-detection - Surveillance - Semantic image and video editing - Assistive technologies - Medical imaging - ... 3

  4. Proxy task: PASCAL VOC Challenge • Localize & name ( detect ) 20 basic-level object categories - Airplane, bicycle, bus, cat, car, dog, person, sheep, sofa, monitor, etc. person motorbike Input Desired output • 11k training images with 500 to 8000 instances / category • Evaluation: bounding-box overlap; average precision (AP) ������� ( � �� , � � ) = | � �� ∩ � � | | � �� ∪ � � | 4 Image credits: PASCAL VOC

  5. Challenges • Deformation • Viewpoint • Subcategory • Variable structure • Occlusion • Background clutter • Photometric 5

  6. Challenges • Deformation 6 Image credit: http://i173.photobucket.com/albums/w78/yahoozy/MultipleExposures2.jpg

  7. Challenges • Viewpoint Image credits: PASCAL VOC 7

  8. Challenges • Subcategory –– “airplane” images 8 Image credits: PASCAL VOC

  9. Challenges • Variable structure 9 Image credits: PASCAL VOC

  10. PASCAL VOC Challenges 2007-2011 • 2007 Challenge - Winner: Deformable part models & Latent SVM [FMR’08] - 21% mAP ↑ - Baseline for dissertation Prior work This work • Winners of 2008 & 2009 Challenges ↓ • Fast forward to the 2011 Challenge - Our system ( voc-release4 ): 34% mAP - Top system (NLPR): 41% mAP - NLPR method: voc-release4 + LBP image features + richer spatial model (GMM) + more context rescoring - Second (MIT-UCLA) and third place (Oxford) also based on voc-release4 10

  11. Contributions –– By area • Object representation * - Mixture models (in PAMI’10); Latent orientation; Person grammar model • E ffi cient detection algorithms * - Cascaded detection for DPM (oral at CVPR’10) • Learning * - Weak-label structural SVM (spotlight at NIPS’11) • Detection post-processing - Bounding box prediction & context rescoring • Image representation - Enhanced HOG features; features for boundary truncation & small objects • Software - voc-release{2,3,4} – currently the “go to” object detection system 11

  12. Object representation

  13. Model lineage – Dalal & Triggs p score( � , � ) = w · ψ ( � , � ) (f) [Dalal and Triggs ’05] “Root fi lter” Image pyramid HOG feature pyramid • Histogram of Oriented Gradients (HOG) features • Scanning window detector (linear fi lter) • w learned by SVM 13

  14. Model lineage – Latent SVM DPM p 0 score( � , � � ) = max � ∈ Z ( � ) w · ψ ( � , ( � � , � )) � = ( � � , . . . , � � ) z Image pyramid HOG feature pyramid [FMR’08] • Dalal & Triggs + Parts in a deformable con fi guration z • Scanning window detection: max over z at each p 0 • w learned by latent SVM 14

  15. Superposition of views 15

  16. Mixture of DPMs Person Car • Training (component labels are hidden) - Cluster training examples by bounding-box aspect ratio - Initialize root fi lters for each component (cluster) independently - Merge components into mixture model and train with latent SVM 16

  17. Mixtures with latent orientation (“pushmi-pullyu” instead of horse) Learning without latent orientation (right-facing horse) Learning with latent orientation [GFM voc-release4] 17

  18. Unsupervised orientation clustering • Online clustering with a hard constraint Cluster 1 Cluster 2 Seed ... i th example ... Assign i th example to nearest cluster Flipped example must go to the other cluster 18

  19. Latent orientation improves performance Horse model type AP (2007) Single component 42.1 Mixture model 47.3 3 components Latent orientation 56.8 2x3 components 19

  20. Results – Mixture models and latent orientation • Mixture models boost mAP by 3.7 points • Latent orientation boost mAP by 2.6 AP points AP scores using the PASCAL 2007 evaluation • 12 AP point improvement (>50% relative) over the baseline 20

  21. E ffi cient detection

  22. Cascaded detection for DPM • Add in parts one-by-one and prune partial scores • Sparse dynamic programming tables (reuse computation!) 22

  23. Threshold selection & PCA fi lters • Data-driven threshold selection - Based on statistics of partial scores on training data - Provably safe (“probably approximately admissible” thresholds) - Empirically e ff ective • 2-stage cascade with simpli fi ed appearance models - Use PCA of HOG features (or model fi lters) - Stage 1: place low-dimensional fi lters; Stage 2: place original fi lters 23

  24. Results –– 15x speedup (no loss in mAP) High recall Lower recall ⇒ faster PASCAL 2007 comp3 class: motorbike PASCAL 2007 comp3 class: motorbike 1 1 0.9 0.9 0.8 0.8 0.7 0.7 precision precision 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 baseline (AP 48.7) baseline (AP 48.7) 0.1 0.1 cascade (AP 48.9) cascade (AP 41.8) 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 recall recall 23.2x faster 31.6x faster (618ms per/image) (454ms per/image) 24

  25. Towards richer grammar models

  26. People are complicated Helmet, Ski cap, no face, Pirate hat, dresses, Truncation, holding glass, occluded left side truncated long hair heavy occlusion Objects from visually rich categories have diverse structural variation 26

  27. Compositional models More mixture components? [DT’05] No! AP 0.12 [FMR’08] AP 0.27 (f) [FGMR’10] AP 0.36 [GFM voc-release4] There are too many combinations! AP 0.42 Instead... ... compositional models de fi ned by grammars 27

  28. Object detection grammars • A modeling language for building object detectors [FM’10] - Terminals (model image data appearance) - Nonterminals (objects, parts, ...) - Weighted production rules (de fi ne composition, variable structure) • Composition - Objects are recursively composed of other objects (parts) • Variable structure - Expanding di ff erent rules produces di ff erent structures • Person → Head, Torso, Arms, Legs • Head → Eye, Eye, Mouth • Mouth → Smile OR Mouth → Frown 28

  29. Object detection grammars • Object hypothesis = derivation tree T p = ( x , y , l ) T : Person( x , y , l ) Root( x , y , l ) Part 1 ( x 1 , y 1 , l 1 ) Part N ( x N , y N , l N ) ... • Linear score function Detection with DP � ∗ ( � ) = argmax score( � , � ) = w · ψ ( � , � ) w · ψ ( � , � ) � ∈ T � 29

  30. Build on what works Can we build a better person detector? 30

  31. Case study: a person detection grammar Subtype 1 Subtype 2 Example detections and derived fi lters Part 1 Part 2 Part 3 Part 4 Part 5 Part 6 Occluder Parts 1-6 (no occlusion) Parts 1-4 & occluder Parts 1-2 & occluder • Fine-grained occlusion • Sharing across all derivations • Model of the stu ff that causes occlusion • Part subtypes and multiple resolutions • Parts have subparts (not pictured) 31

  32. Training models Subtype 1 Subtype 2 Part 1 Part 2 training Part 3 Part 4 Part 5 Part 6 Occluder • PASCAL data: bounding-box labels • No derivation trees given! (weakly-supervised learning) • Learn the parameters w 32

  33. De fi ning examples • Each bounding box is a foreground example • All locations in background images are background examples • From these examples, learn the prediction rule � w ( � ) = argmax w · ψ ( � , � ) Predicted output � ∈ S ( � ) Possible outputs Input example Feature map (derivation trees) 33

  34. Parameter learning • Richer models, richer problems One good output... and many bad ones! • Which learning framework should we use? 34

  35. � Classi fi cation training � � � � w � � + � � � ���� ( w ) = � max � , � � � � max � ∈ S ( � � ) w · ψ ( � � , � ) � = � Training: LSVM objective: LSVM objective: “score +1 here” “score +1 here” Who wins? Both derivations were Testing: vs. trained to score +1. 35

  36. Structured output training Good output Bad output Training: “outscore all other “score lower by a margin” outputs by a margin” A “good” output Testing: vs. should win. 36

  37. � � Latent structural SVM [Yu and Joachims] � � w � � + � � � ( w ) = � � ����� ( w , � � , � � ) � = � � ) ∈ Y × Z [ w · ψ ( � , ˆ � , ˆ � ) + � margin ( � , ˆ � ����� ( w , � , � ) = max � )] (ˆ � , ˆ � ∈ Z w · ψ ( � , � , ˆ � max � ) ˆ • Objective and task loss ( L margin ) might be inconsistent ... y ˆ � margin ( � , � ) = � ��� � margin ( � , ˆ � ) = � - Many outputs with zero loss –– LSSVM “requires” the training label 37

Recommend


More recommend