SSD: Single Shot MultiBox Detector Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg Slides by: Sulabh Shrestha
Receptive Field Use multiple Ref: https://cv-tricks.com/object-detection/single-shot-multibox-detector-ssd/ ▪ Deep feature maps ▪ Shallow feature maps ▪ Larger size ▪ Smaller size ▪ Smaller receptive fields ▪ Larger receptive fields ▪ May not be able to see larger objects ▪ May miss small objects ▪ Use multiple for corresponding receptive field sized objects
Architecture VGG ▪ Base Network + Extra Feature Layer ▪ No FC layer ▪ Specific feature maps responsive to particular scale of objects ▪ Not necessarily same as the receptive field ▪ A hyper-parameter ▪ Dependent on data 8x8 Feature map 4x4 Feature map
Base Network ▪ VGG 16 ▪ Pool5 changed: ▪ 3x3 kernel instead of 2x2 ▪ Stride 1 instead of 2 ▪ 1 st 2 FCs replaced by CNN ▪ DeepLab LargeFOV ▪ Last FC removed altogether ▪ No dropouts used ▪ Conv4_3 also used for prediction ▪ 4 th Group of Conv ▪ 3 rd kernel Ref: VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Multiple Default Boxes ▪ Similar to Anchor boxes of Faster-RCNN ▪ Example feature map: ▪ m x n ▪ p-channels n ▪ For each location (i, j) ▪ Multiple default boxes ( k ) p ▪ 3 x 3 x p-channel CNN for each box m ▪ Confidence of each class, c i ; i Є [1, C] ▪ x, y, w, h ▪ (C+4) outputs ▪ Total outputs for 1 feature map: ▪ m * n * k * (#classes + 4)
Scale and Aspect ratio ▪ How many default boxes per location? ▪ Scale ▪ Related to but not exact as the receptive field ▪ If m feature maps used for prediction: ▪ s min = 0.2 ▪ s max = 0.9 ▪ Eg. ▪ s = 0.2 ▪ img-size = 300 ▪ Default box corresponding size = 0.2 * 300 = 60 ▪ Aspect ratios(a r ) ▪ {1, 2, 3, 1/2, 1/3} ~ k ▪ Width (w k a ) = s k √ a r ▪ Height (h k a ) = s k / √ a r ▪ Eg. ▪ s = 0.2, img-size = 300 ▪ a r = 1 --> w = 0.2 * 300 = 60 h = 0.2 * 300 = 60 ▪ a r = 2 --> w = 0.2 * √ 2 * 300 = 85 h = 0.2 / √ 2 * 300 = 42 ▪ a r = 1/2 --> w = 0.2 * √ ½ * 300 = 42 h = 0.2 / √ ½ * 300 = 85
Training • Basenet pre-trained on ImageNet CLS-LOC dataset • Fine-tuned for respective dataset • Matching Strategy 𝑠𝑝𝑣𝑜𝑒𝑢𝑠𝑣𝑢ℎ > 0.5 → 𝑞𝑝𝑡𝑗𝑢𝑗𝑤𝑓 • Any 𝐽𝑃𝑉 𝑒𝑓𝑔𝑏𝑣𝑚𝑢𝑐𝑝𝑦 • Simplifies learning problem • Can detect object in multiple overlapping default boxes • Loss • Confidence loss ( c ) • Softmax loss over multiple classes • Localization loss ( xywh ) • Smooth L1 loss • Ground truth box( g ) vs Default box( l ) Ref: https://github.com/rbgirshick/py-faster-rcnn/files/764206/SmoothL1Loss.1.pdf
Results PASCAL VOC2007 test detection results PASCAL VOC2012 test detection results
Inference • Filter boxes with low confidence • NMS with 0.45 IOU • Take top 200 detections • Better mAP VOC2007 Test data • Faster FPS
Analysis • Better than 2 stage network: • Single network for localization and classification • Better than YOLO • Use multiple feature maps • Use many more default boxes • No FC layer • Faster inference • Fewer parameters • Smaller input size • Faster RCNN • 600 min. size • YOLO • 448 x 448
Ablation Studies - 1 • Data Augmentation helps • Original image • Random sample of patch • Sample patch • IOU min is 0.1, 0.3, 0.5, 0.7, 0.9 • More Multiple boxes helps • Using FC instead of CNN (Atrous) • Similar result • 20% slow
Ablation Studies - 2 • Use different number of feature maps • Similar # of default boxes to make it fair • More feature maps better • Up to a certain extent • Not using boundary defaults boxes better • Avoid default boxes lying outside the image
Thank you Questions?
Recommend
More recommend