tw two sta stage ge object object detec detectors tors
play

Tw Two-sta stage ge object object detec detectors tors CV3DST - PowerPoint PPT Presentation

Tw Two-sta stage ge object object detec detectors tors CV3DST | Prof. Leal-Taix 1 Ty Types of object ct dete tecto ctors One-stage detectors Class score (cat, Classification dog, person) Feature Image extraction Bounding


  1. Tw Two-sta stage ge object object detec detectors tors CV3DST | Prof. Leal-Taixé 1

  2. Ty Types of object ct dete tecto ctors • One-stage detectors Class score (cat, Classification dog, person) Feature Image extraction Bounding box Localization (x,y,w,h) • Two-stage detectors Class score (cat, Classification Extraction of dog, person) Feature Image object extraction Refine bounding box proposals Localization ( Δ x, Δ y, Δ w, Δ h) CV3DST | Prof. Leal-Taixé 2

  3. Ty Types of object ct dete tecto ctors • One-stage detectors Class score (cat, Classification dog, person) Feature Image extraction Bounding box Localization (x,y,w,h) • Two-stage detectors Class score (cat, Classification Extraction of dog, person) Feature Image object extraction Refine bounding box proposals Localization ( Δ x, Δ y, Δ w, Δ h) CV3DST | Prof. Leal-Taixé 3

  4. Lo Locali lizati tion • Bounding box regression Output: Box coordinates (x,y,w,h) Feature extraction (this time with a L2 loss function Neural Network) Image Ground truth: Box coordinates Lecture 8 - 12 CV3DST | Prof. Leal-Taixé 4

  5. Lo Locali lizati tion • Bounding box regression Output: Box coordinates (x,y,w,h) L2 loss function Convolutional Image Neural Network Ground truth: Box coordinates Lecture 8 - 12 CV3DST | Prof. Leal-Taixé 5

  6. Lo Locali lizati tion n and nd cla lassificati tion • Bounding box regression Fully connected Output: Box coordinates (x,y,w,h) Convolutional Image Neural Network Lecture 8 - 12 CV3DST | Prof. Leal-Taixé 6

  7. Lo Locali lizati tion n and nd cla lassificati tion • Bounding box regression Fully connected L2 loss Output: Box coordinates (x,y,w,h) Convolutional Image Neural Network Softmax loss Output: Class scores Lecture 8 - 12 CV3DST | Prof. Leal-Taixé 7

  8. Lo Locali lizati tion n and nd cla lassificati tion • Bounding box regression Regression head Output: Box coordinates (x,y,w,h) Convolutional Image Neural Network Classification Output: head Class scores Lecture 8 - 12 CV3DST | Prof. Leal-Taixé 8

  9. Lo Locali lizati tion n and nd cla lassificati tion • It was typical to train the classification head first, freeze the layers • Then train the regression head • At test time, we use both! Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014 CV3DST | Prof. Leal-Taixé 10

  10. Ov Overfe rfeat • Sliding window + box regression + classification Feature map Boxes (5 x 5 x 1024) (1000 x 4) Convolutional Class scores Image Neural Network 1000 (221 x 221 x 3) Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014 Lecture 8 - 12 CV3DST | Prof. Leal-Taixé 11

  11. Ov Overfe rfeat • Sliding window + box regression + classification Image (468 x 356 x 3) Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014 Lecture 8 - 12 CV3DST | Prof. Leal-Taixé 12

  12. Ov Overfe rfeat • Sliding window + box regression + classification Image (468 x 356 x 3) Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014 Lecture 8 - 12 CV3DST | Prof. Leal-Taixé 13

  13. Ov Overfe rfeat • Sliding window + box regression + classification Image (468 x 356 x 3) Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014 Lecture 8 - 12 CV3DST | Prof. Leal-Taixé 14

  14. Ov Overfe rfeat • Sliding window + box regression + classification Image (468 x 356 x 3) Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014 Lecture 8 - 12 CV3DST | Prof. Leal-Taixé 15

  15. Ov Overfe rfeat • Sliding window + box regression + classification We end up with many predictions and we have to combine them for a final detection (in Overfeat they have a greedy method) Image (468 x 356 x 3) Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014 Lecture 8 - 12 CV3DST | Prof. Leal-Taixé 16

  16. Ov Overfe rfeat • Sliding window + box regression + classification We end up with many predictions and we have to combine them for a final detection (in Overfeat they have a greedy method) Image (468 x 356 x 3) Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014 Lecture 8 - 12 CV3DST | Prof. Leal-Taixé 17

  17. Ov Overfe rfeat • In practice: use many sliding window locations and multiple scales Window positions + score maps Box regression outputs Final Predictions Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014 Lecture 8 - 31 CV3DST | Prof. Leal-Taixé 18

  18. Ov Overfe rfeat • Sliding window + box regression + classification Feature map Boxes (5 x 5 x 1024) (1000 x 4) Convolutional Class scores Image Neural Network 1000 (221 x 221 x 3) What prevents us from dealing with any image size? Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014 Lecture 8 - 12 CV3DST | Prof. Leal-Taixé 19

  19. Wh What at ab abou out multiple e ob objec jects? • Localization: Regression • How about detection? CV3DST | Prof. Leal-Taixé 20

  20. Wh What at ab abou out multiple e ob objec jects? • Localization: Regression • How about detection? 3 objects means having an output of 12 numbers (3 x 4) CV3DST | Prof. Leal-Taixé 21

  21. Wh What at ab abou out multiple e ob objec jects? • Localization: Regression • How about detection? 14 objects means having an output of 56 numbers (14 x 4) CV3DST | Prof. Leal-Taixé 22

  22. What Wh at ab abou out multiple e ob objec jects? • Localization: Regression • How about detection? • Having a variable sized output is not optimal for Neural Networks • There are a couple of workarounds: – RNN: Romera-Paredes and Torr. Recurrent Instance Segmentation. ECCV 2016. – Set prediction: Rezatofighi, Kaskman, Motlagh, Shi, Cremers, Leal-Taixé, Reid. Deep Perm-Set Net: Learn to predict sets with unknown permutation and cardinality using deep neural networks. Arxiv: 1805.00613 CV3DST | Prof. Leal-Taixé 23

  23. De Dete tecti ction as cla classifica cati tion? • Localization: Regression • How about detection? Regression Is this a Flamingo? NO CV3DST | Prof. Leal-Taixé 24

  24. De Dete tecti ction as cla classifica cati tion? • Localization: Regression • How about detection? Regression Is this a Flamingo? NO CV3DST | Prof. Leal-Taixé 25

  25. De Dete tecti ction as cla classifica cati tion? • Localization: Regression • How about detection? Regression Is this a Flamingo? YES! CV3DST | Prof. Leal-Taixé 26

  26. De Dete tecti ction as cla classifica cati tion? • Localization: Regression • How about detection? Classification • Problem: – Expensive to try all possible positions, scales and aspect ratios – How about trying only on a subset of boxes with most potential? CV3DST | Prof. Leal-Taixé 27

  27. Reg Region on Pr Propo posals ls • We have already seen a method that gives us “interesting” regions in an image that potentially contain an object • Step 1: Obtain region proposals • Step 2: Classify them. Lecture 8 - 49 CV3DST | Prof. Leal-Taixé 28

  28. Th The e R-CNN family ly CV3DST | Prof. Leal-Taixé 29

  29. R-CN CNN Girschick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014 CV3DST | Prof. Leal-Taixé 30

  30. R-CN CNN Classification head Regression head to refine the bounding box Extract features location Warping to a fix size 227 x 227 Girschick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014 CV3DST | Prof. Leal-Taixé 31

  31. R-CN CNN • Training scheme: – 1. Pre-train the CNN on ImageNet – 2. Finetune the CNN on the number of classes the detector is aiming to classify (softmax loss) – 3. Train a linear Support Vector Machine classifier to classify image regions. One SVM per class! (hinge loss) – 4. Train the bounding box regressor (L2 loss) CV3DST | Prof. Leal-Taixé 32

  32. R-CN CNN • PROS: – The pipeline of proposals, feature extraction and SVM classification is well-known and tested. Only features are changed (CNN instead of HOG). – CNN summarizes each proposal into a 4096 vector (much more compact representation compared to HOG) – Leverage transfer learning: the CNN can be pre-trained for image classification with C classes. One needs only to change the FC layers to deal with Z classes. CV3DST | Prof. Leal-Taixé 33

  33. R-CN CNN • CONS: Let us try to solve this first – Slow! 47s/image with VGG16 backbone. One considers around 2000 proposals per image, they need to be warped and forwarded through the CNN. – Training is also slow and complex – The object proposal algorithm is fixed. Feature extraction and SVM classifier are trained separately à not exploiting learning to its full potential. CV3DST | Prof. Leal-Taixé 34

Recommend


More recommend