Video Images WangTao, wtao@qiyi.com IQIYI ltd. 2016.4 Outline - - PowerPoint PPT Presentation

video images
SMART_READER_LITE
LIVE PREVIEW

Video Images WangTao, wtao@qiyi.com IQIYI ltd. 2016.4 Outline - - PowerPoint PPT Presentation

CNN Based Object Detection in Large Video Images WangTao, wtao@qiyi.com IQIYI ltd. 2016.4 Outline Introduction Background Challenge Our approach System framework Object detection Scene recognition Body


slide-1
SLIDE 1

CNN Based Object Detection in Large Video Images

WangTao, wtao@qiyi.com IQIYI ltd. 2016.4

slide-2
SLIDE 2

Outline

  • Introduction
  • Background
  • Challenge
  • Our approach
  • System framework
  • Object detection
  • Scene recognition
  • Body segmentation
  • Same style matching
  • Experiments
  • Conclusion
slide-3
SLIDE 3

Background

  • Image retrieval
  • Video advertising

Video out applications

slide-4
SLIDE 4

Challenge

  • Real video data vs. image dataset
  • Clutter background
  • Multiple objects
  • Small objects
  • Variant pose/position
  • Partial occlusion
slide-5
SLIDE 5

Our task

  • Problems:
  • Content based object retrieval in large video images
  • High accuracy for same style matching
  • High speed in large video database
  • Solution:
  • Accurate object detection + scene classification
  • Discriminated DNN features and PCA/LDA transformation
  • Speed up by parallel indexing and hierarchical filtering
slide-6
SLIDE 6

System framework

Scene Classification Video key frame Object detection Body segmentation CNN feature Indexing Database Query image Faster-RCNN rect CNN feature Scene Classification Match Distance sort Result Body segmentation

indexing query

slide-7
SLIDE 7

Object detection (I)

  • Object detection by faster-RCNN
  • Faster-RCNN, Region proposals + object scores, [Ren, Shaoqing, et al.

NIPS2015]

  • Trained on MS coco db (300k images) + video images (10k images)
  • More pervasive and general for images with multi-objects
slide-8
SLIDE 8
  • Multi-class object detection including
  • Clothes(skirt,jacket,trousers)
  • Bags(handbag , backpack , draw-bar box )
  • Electronics (mobile, laptop,TV,keyboard,mouse,

microwave oven , oven , refrigerator )

  • Glasses, necklace, hat
  • Shoes
slide-9
SLIDE 9

Object detection (II)

  • Object detection by CNN regression
  • Input an image, output the coordinates of the object

rectangle [Erhan, Dumitru, et al. CVPR2014]

  • Efficient for images with single object, not recognized by

faster-RCNN

slide-10
SLIDE 10

Body Segmentation

  • Constraint by human body parts
  • CNN based body segmentation [Jonathan Long,CVPR2015]
  • Bounding box, body mask, body parsing
  • riginal image segmentation image
slide-11
SLIDE 11

Scene classification

  • CNN based Scene classification [Bolei Zhou, NIPS2014]

Video Key frame Is Scene? yes/no CNN absed Scene classification tags

Non scene images Scene images of kitchen, office, living room, and bedroom

Multi-frame fusion

Scene classification Preciosn:65.8% Recall:74% Threshold@0.7 Preciosn:83.8% Recall:56.7%

slide-12
SLIDE 12

Scene classes

  • 0 kitchen
  • 1 dining
  • 2 bakery
  • 3 ice_cream_parlor
  • 4 bathroom
  • 5 washing_room
  • 6 bedroom
  • 7 living_room
  • 8 office
  • 9 children_room
  • 10 nursery
  • 11 toyshop
  • 12 shoe_shop
  • 13 jewelry_shop

14 outdoor_ice_world 15 indoor_ice_skating_rink 16 baseball 17 football 18 basketball_court 19 swimming_pool 20 track 21 bowling_alley 22 billiards 23 tennis 24 volleyball 25 gymnasium 26 pleasure_ground 27 hospital_room 28 dentists 29 drugstore 30 music_studio 31 music_store 32 sandbeach 33 hairsalon 34 bar 35 pagoda 36 bamboo_forest 37 mountain 38 coast 39 creek 40 waterfall 41 grass 42 other

slide-13
SLIDE 13

Same style matching

  • SIFT feature matching
  • Normalization of SIFT
  • Dimension : 128dim x 400pts
  • MAP 22%
  • CNN feature of imagenet 1k classifier
  • Model :VGG19
  • Layers : fc7
  • Dimension : 4096 600
  • MAP 28%
  • CNN feature of Same style classifier
  • Model :VGG19
  • Layers : fc7
  • Dimension : 4096 600
  • MAP 34%
slide-14
SLIDE 14

Multi-feature fusion

  • Same class matching classifier on imagenet 21k classes of 15M images
  • Same style matching classifier trained on 1239 queries of 1M images
  • Speed
  • Nvidia K40 GPU, 10x faster than CPU i7
  • Faster RCNN speed: 200ms/frame , image size 1920x1080
  • Vgg19 feature speed: 60ms/frame, image size 256x256

CNN Models Feature dim MAP Inception_bn1k 1024 24% Inception_21k 1024 34% Vgg19_caffe 4096 34% Inception_21k + vgg19_caffe 5120 43%

slide-15
SLIDE 15

Experiments

  • MAP precision on 3M testing images, trained on1M images
  • Speed up
  • Parallel flann tree indexing
  • Hierarchical filtering by object classes, 10x faster speed
  • Query speed: 1s /image on 5000 teleplays with 2M images

Vgg 19model Full image Object rectangle PCA+LDA Inception-21k MAP √ √ × × × 27.8% √ × √ × × 34.2% √ × √ √ × 37.3% √ × √ × √ 43.1% √ × √ √ √ 46.1%

slide-16
SLIDE 16

Query system GUI

slide-17
SLIDE 17

Query examples on image dataset

slide-18
SLIDE 18
slide-19
SLIDE 19

Query examples on video dataset

slide-20
SLIDE 20
slide-21
SLIDE 21

Conclusion

  • Bounding box is important to recognize object
  • Fusion Same style matching with same class matching

features to get higher accuracy

  • PCA and LDA further improve accuracy and speed
  • GPU is faster for CNN feature extraction
  • Speed up query by parallel indexing and hierarchical

filtering

slide-22
SLIDE 22

References

  • Erhan, Dumitru, et al. "Scalable object detection using deep neural networks." Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition. 2014.

  • Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in

Neural Information Processing Systems. 2015.

  • Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural

networks." Advances in neural information processing systems. 2012.

  • Arandjelović, Relja, and Andrew Zisserman. "Three things everyone should know to improve object retrieval." Proceedings
  • f the IEEE Conference on Computer Vision and Pattern Recognition. 2012.
  • Jonathan Long, Evan Shelhamer, Trevor Darrell, Fully convolution Networks for Semantic Segmentation. CVPR 2015

arXiv:1411.4038.

  • Conditional Random Fields as Recurrent Neural Networks. S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D.

Du, C. Huang, P. Torr ICCV 2015.

  • Li Shen, Zhouchen Lin and Qingming Huang, Learning deep convolutional neural networks for places2 scene recognition,

Clinical Orthopaedics and Related Research, 2015

  • Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba and Aude Oliva, Learning Deep Features for Scene

Recognition using Places Database, NIPS, 2014

  • Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva and Antonio Torralba, Object detectors emerge in deep scene cnns,

ICLR, 2015

  • Ruobing Wu, Baoyuan Wang, Wenping Wang and Yizhou Yu, Harvesting discriminative meta objects with deep CNN

features for Scene Classification, ICCV, 2015

  • Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna,Rethinking the Inception

Architecture for Computer Vision, arXiv:1512.00567 ,2015

slide-23
SLIDE 23