video images
play

Video Images WangTao, wtao@qiyi.com IQIYI ltd. 2016.4 Outline - PowerPoint PPT Presentation

CNN Based Object Detection in Large Video Images WangTao, wtao@qiyi.com IQIYI ltd. 2016.4 Outline Introduction Background Challenge Our approach System framework Object detection Scene recognition Body


  1. CNN Based Object Detection in Large Video Images WangTao, wtao@qiyi.com IQIYI ltd. 2016.4

  2. Outline • Introduction • Background • Challenge • Our approach • System framework • Object detection • Scene recognition • Body segmentation • Same style matching • Experiments • Conclusion

  3. Background • Image retrieval • Video advertising Video out applications

  4. Challenge • Real video data vs. image dataset - Clutter background - Multiple objects - Small objects - Variant pose/position - Partial occlusion

  5. Our task • Problems : • Content based object retrieval in large video images • High accuracy for same style matching • High speed in large video database • Solution : • Accurate object detection + scene classification • Discriminated DNN features and PCA/LDA transformation • Speed up by parallel indexing and hierarchical filtering

  6. System framework Scene Classification Video key indexing frame Object Body CNN Indexing detection segmentation feature Database Scene Classification Query image Faster-RCNN CNN Body Match query rect segmentation feature Distance sort Result

  7. Object detection (I) • Object detection by faster-RCNN Faster-RCNN, Region proposals + object scores, [Ren, Shaoqing, et al. • NIPS2015] Trained on MS coco db (300k images) + video images (10k images) • More pervasive and general for images with multi-objects •

  8. • Multi-class object detection including • Clothes(skirt , jacket , trousers ) • Bags ( handbag , backpack , draw-bar box ) • Electronics ( mobile, laptop , TV , keyboard , mouse , microwave oven , oven , refrigerator ) • Glasses, necklace, hat • Shoes

  9. Object detection (II) • Object detection by CNN regression • Input an image, output the coordinates of the object rectangle [Erhan, Dumitru, et al. CVPR2014] • Efficient for images with single object, not recognized by faster-RCNN

  10. Body Segmentation • Constraint by human body parts • CNN based body segmentation [Jonathan Long,CVPR2015] • Bounding box, body mask, body parsing original image segmentation image

  11. Scene classification • CNN based Scene classification [Bolei Zhou, NIPS2014] Video Is Scene? CNN absed Multi-frame tags Key frame yes/no Scene classification fusion Scene classification Preciosn:65.8% Recall:74% Threshold@0.7 Preciosn:83.8% Recall:56.7% Non scene images Scene images of kitchen, office, living room, and bedroom

  12. Scene classes 28 dentists • 0 kitchen 14 outdoor_ice_world 29 drugstore • 1 dining 15 indoor_ice_skating_rink 30 music_studio • 2 bakery 16 baseball 31 music_store • 3 ice_cream_parlor 17 football 32 sandbeach • 4 bathroom 18 basketball_court 33 hairsalon • 5 washing_room 19 swimming_pool 34 bar • 6 bedroom 20 track 35 pagoda • 7 living_room 21 bowling_alley 36 bamboo_forest • 8 office 22 billiards 37 mountain • 9 children_room 23 tennis 38 coast • 10 nursery 24 volleyball 39 creek • 11 toyshop 25 gymnasium 40 waterfall • 12 shoe_shop 26 pleasure_ground 41 grass • 13 jewelry_shop 27 hospital_room 42 other

  13. Same style matching • SIFT feature matching Normalization of SIFT • Dimension : 128dim x 400pts • MAP 22% • • CNN feature of imagenet 1k classifier Model :VGG19 • Layers : fc7 • Dimension : 4096  600 • MAP 28% • • CNN feature of Same style classifier Model :VGG19 • Layers : fc7 • Dimension : 4096  600 • MAP 34% •

  14. Multi-feature fusion • Same class matching classifier on imagenet 21k classes of 15M images • Same style matching classifier trained on 1239 queries of 1M images CNN Models Feature dim MAP Inception_bn1k 1024 24% Inception_21k 1024 34% Vgg19_caffe 4096 34% Inception_21k + vgg19_caffe 5120 43% • Speed • Nvidia K40 GPU, 10x faster than CPU i7 • Faster RCNN speed: 200ms/frame , image size 1920x1080 • Vgg19 feature speed: 60ms/frame, image size 256x256

  15. Experiments • MAP precision on 3M testing images, trained on1M images Vgg 19model Full image Object PCA+LDA Inception-21k MAP rectangle × × × √ √ 27.8% × × × √ √ 34.2% × × √ √ √ 37.3% × × √ √ √ 43.1% × √ √ √ √ 46.1% • Speed up Parallel flann tree indexing • Hierarchical filtering by object classes, 10x faster speed • Query speed: 1s /image on 5000 teleplays with 2M images •

  16. Query system GUI

  17. Query examples on image dataset

  18. Query examples on video dataset

  19. Conclusion • Bounding box is important to recognize object • Fusion Same style matching with same class matching features to get higher accuracy • PCA and LDA further improve accuracy and speed • GPU is faster for CNN feature extraction • Speed up query by parallel indexing and hierarchical filtering

  20. References Erhan, Dumitru, et al. "Scalable object detection using deep neural networks." Proceedings of the IEEE Conference on • Computer Vision and Pattern Recognition . 2014. Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in • Neural Information Processing Systems . 2015. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural • networks." Advances in neural information processing systems . 2012. Arandjelović , Relja, and Andrew Zisserman. "Three things everyone should know to improve object retrieval." Proceedings • of the IEEE Conference on Computer Vision and Pattern Recognition. 2012. Jonathan Long, Evan Shelhamer, Trevor Darrell, Fully convolution Networks for Semantic Segmentation. CVPR 2015 • arXiv:1411.4038. Conditional Random Fields as Recurrent Neural Networks. S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. • Du, C. Huang, P. Torr ICCV 2015. Li Shen, Zhouchen Lin and Qingming Huang, Learning deep convolutional neural networks for places2 scene recognition, • Clinical Orthopaedics and Related Research, 2015 Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba and Aude Oliva, Learning Deep Features for Scene • Recognition using Places Database, NIPS, 2014 Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva and Antonio Torralba, Object detectors emerge in deep scene cnns, • ICLR, 2015 Ruobing Wu, Baoyuan Wang, Wenping Wang and Yizhou Yu, Harvesting discriminative meta objects with deep CNN • features for Scene Classification, ICCV, 2015 Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna,Rethinking the Inception • Architecture for Computer Vision, arXiv:1512.00567 ,2015

Recommend


More recommend