Video Images WangTao, wtao@qiyi.com IQIYI ltd. 2016.4 Outline - - PowerPoint PPT Presentation
Video Images WangTao, wtao@qiyi.com IQIYI ltd. 2016.4 Outline - - PowerPoint PPT Presentation
CNN Based Object Detection in Large Video Images WangTao, wtao@qiyi.com IQIYI ltd. 2016.4 Outline Introduction Background Challenge Our approach System framework Object detection Scene recognition Body
Outline
- Introduction
- Background
- Challenge
- Our approach
- System framework
- Object detection
- Scene recognition
- Body segmentation
- Same style matching
- Experiments
- Conclusion
Background
- Image retrieval
- Video advertising
Video out applications
Challenge
- Real video data vs. image dataset
- Clutter background
- Multiple objects
- Small objects
- Variant pose/position
- Partial occlusion
Our task
- Problems:
- Content based object retrieval in large video images
- High accuracy for same style matching
- High speed in large video database
- Solution:
- Accurate object detection + scene classification
- Discriminated DNN features and PCA/LDA transformation
- Speed up by parallel indexing and hierarchical filtering
System framework
Scene Classification Video key frame Object detection Body segmentation CNN feature Indexing Database Query image Faster-RCNN rect CNN feature Scene Classification Match Distance sort Result Body segmentation
indexing query
Object detection (I)
- Object detection by faster-RCNN
- Faster-RCNN, Region proposals + object scores, [Ren, Shaoqing, et al.
NIPS2015]
- Trained on MS coco db (300k images) + video images (10k images)
- More pervasive and general for images with multi-objects
- Multi-class object detection including
- Clothes(skirt,jacket,trousers)
- Bags(handbag , backpack , draw-bar box )
- Electronics (mobile, laptop,TV,keyboard,mouse,
microwave oven , oven , refrigerator )
- Glasses, necklace, hat
- Shoes
Object detection (II)
- Object detection by CNN regression
- Input an image, output the coordinates of the object
rectangle [Erhan, Dumitru, et al. CVPR2014]
- Efficient for images with single object, not recognized by
faster-RCNN
Body Segmentation
- Constraint by human body parts
- CNN based body segmentation [Jonathan Long,CVPR2015]
- Bounding box, body mask, body parsing
- riginal image segmentation image
Scene classification
- CNN based Scene classification [Bolei Zhou, NIPS2014]
Video Key frame Is Scene? yes/no CNN absed Scene classification tags
Non scene images Scene images of kitchen, office, living room, and bedroom
Multi-frame fusion
Scene classification Preciosn:65.8% Recall:74% Threshold@0.7 Preciosn:83.8% Recall:56.7%
Scene classes
- 0 kitchen
- 1 dining
- 2 bakery
- 3 ice_cream_parlor
- 4 bathroom
- 5 washing_room
- 6 bedroom
- 7 living_room
- 8 office
- 9 children_room
- 10 nursery
- 11 toyshop
- 12 shoe_shop
- 13 jewelry_shop
14 outdoor_ice_world 15 indoor_ice_skating_rink 16 baseball 17 football 18 basketball_court 19 swimming_pool 20 track 21 bowling_alley 22 billiards 23 tennis 24 volleyball 25 gymnasium 26 pleasure_ground 27 hospital_room 28 dentists 29 drugstore 30 music_studio 31 music_store 32 sandbeach 33 hairsalon 34 bar 35 pagoda 36 bamboo_forest 37 mountain 38 coast 39 creek 40 waterfall 41 grass 42 other
Same style matching
- SIFT feature matching
- Normalization of SIFT
- Dimension : 128dim x 400pts
- MAP 22%
- CNN feature of imagenet 1k classifier
- Model :VGG19
- Layers : fc7
- Dimension : 4096 600
- MAP 28%
- CNN feature of Same style classifier
- Model :VGG19
- Layers : fc7
- Dimension : 4096 600
- MAP 34%
Multi-feature fusion
- Same class matching classifier on imagenet 21k classes of 15M images
- Same style matching classifier trained on 1239 queries of 1M images
- Speed
- Nvidia K40 GPU, 10x faster than CPU i7
- Faster RCNN speed: 200ms/frame , image size 1920x1080
- Vgg19 feature speed: 60ms/frame, image size 256x256
CNN Models Feature dim MAP Inception_bn1k 1024 24% Inception_21k 1024 34% Vgg19_caffe 4096 34% Inception_21k + vgg19_caffe 5120 43%
Experiments
- MAP precision on 3M testing images, trained on1M images
- Speed up
- Parallel flann tree indexing
- Hierarchical filtering by object classes, 10x faster speed
- Query speed: 1s /image on 5000 teleplays with 2M images
Vgg 19model Full image Object rectangle PCA+LDA Inception-21k MAP √ √ × × × 27.8% √ × √ × × 34.2% √ × √ √ × 37.3% √ × √ × √ 43.1% √ × √ √ √ 46.1%
Query system GUI
Query examples on image dataset
Query examples on video dataset
Conclusion
- Bounding box is important to recognize object
- Fusion Same style matching with same class matching
features to get higher accuracy
- PCA and LDA further improve accuracy and speed
- GPU is faster for CNN feature extraction
- Speed up query by parallel indexing and hierarchical
filtering
References
- Erhan, Dumitru, et al. "Scalable object detection using deep neural networks." Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 2014.
- Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in
Neural Information Processing Systems. 2015.
- Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural
networks." Advances in neural information processing systems. 2012.
- Arandjelović, Relja, and Andrew Zisserman. "Three things everyone should know to improve object retrieval." Proceedings
- f the IEEE Conference on Computer Vision and Pattern Recognition. 2012.
- Jonathan Long, Evan Shelhamer, Trevor Darrell, Fully convolution Networks for Semantic Segmentation. CVPR 2015
arXiv:1411.4038.
- Conditional Random Fields as Recurrent Neural Networks. S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D.
Du, C. Huang, P. Torr ICCV 2015.
- Li Shen, Zhouchen Lin and Qingming Huang, Learning deep convolutional neural networks for places2 scene recognition,
Clinical Orthopaedics and Related Research, 2015
- Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba and Aude Oliva, Learning Deep Features for Scene
Recognition using Places Database, NIPS, 2014
- Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva and Antonio Torralba, Object detectors emerge in deep scene cnns,
ICLR, 2015
- Ruobing Wu, Baoyuan Wang, Wenping Wang and Yizhou Yu, Harvesting discriminative meta objects with deep CNN
features for Scene Classification, ICCV, 2015
- Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna,Rethinking the Inception
Architecture for Computer Vision, arXiv:1512.00567 ,2015