Video Images WangTao, wtao@qiyi.com IQIYI ltd. 2016.4 Outline - - PowerPoint PPT Presentation

▶

Jan 16, 2023 42 likes •276 views

CNN Based Object Detection in Large Video Images WangTao, wtao@qiyi.com IQIYI ltd. 2016.4 Outline Introduction Background Challenge Our approach System framework Object detection Scene recognition Body

SLIDE 1

CNN Based Object Detection in Large Video Images

WangTao, wtao@qiyi.com IQIYI ltd. 2016.4

SLIDE 2

Outline

Introduction
Background
Challenge
Our approach
System framework
Object detection
Scene recognition
Body segmentation
Same style matching
Experiments
Conclusion

SLIDE 3

Background

Image retrieval
Video advertising

Video out applications

SLIDE 4

Challenge

Real video data vs. image dataset
Clutter background
Multiple objects
Small objects
Variant pose/position
Partial occlusion

SLIDE 5

Our task

Problems：
Content based object retrieval in large video images
High accuracy for same style matching
High speed in large video database
Solution：
Accurate object detection + scene classification
Discriminated DNN features and PCA/LDA transformation
Speed up by parallel indexing and hierarchical filtering

SLIDE 6

System framework

Scene Classification Video key frame Object detection Body segmentation CNN feature Indexing Database Query image Faster-RCNN rect CNN feature Scene Classification Match Distance sort Result Body segmentation

indexing query

SLIDE 7

Object detection (I)

Object detection by faster-RCNN
Faster-RCNN, Region proposals + object scores, [Ren, Shaoqing, et al.

NIPS2015]

Trained on MS coco db (300k images) + video images (10k images)
More pervasive and general for images with multi-objects

SLIDE 8

Multi-class object detection including
Clothes(skirt，jacket，trousers）
Bags（handbag ， backpack ， draw-bar box )
Electronics （mobile, laptop，TV，keyboard，mouse，

microwave oven ， oven ， refrigerator ）

Glasses, necklace, hat
Shoes

SLIDE 9

Object detection (II)

Object detection by CNN regression
Input an image, output the coordinates of the object

rectangle [Erhan, Dumitru, et al. CVPR2014]

Efficient for images with single object, not recognized by

faster-RCNN

SLIDE 10

Body Segmentation

Constraint by human body parts
CNN based body segmentation [Jonathan Long,CVPR2015]
Bounding box, body mask, body parsing
riginal image segmentation image

SLIDE 11

Scene classification

CNN based Scene classification [Bolei Zhou, NIPS2014]

Video Key frame Is Scene? yes/no CNN absed Scene classification tags

Non scene images Scene images of kitchen, office, living room, and bedroom

Multi-frame fusion

Scene classification Preciosn:65.8% Recall:74% Threshold@0.7 Preciosn:83.8% Recall:56.7%

SLIDE 12

Scene classes

0 kitchen
1 dining
2 bakery
3 ice_cream_parlor
4 bathroom
5 washing_room
6 bedroom
7 living_room
8 office
9 children_room
10 nursery
11 toyshop
12 shoe_shop
13 jewelry_shop

14 outdoor_ice_world 15 indoor_ice_skating_rink 16 baseball 17 football 18 basketball_court 19 swimming_pool 20 track 21 bowling_alley 22 billiards 23 tennis 24 volleyball 25 gymnasium 26 pleasure_ground 27 hospital_room 28 dentists 29 drugstore 30 music_studio 31 music_store 32 sandbeach 33 hairsalon 34 bar 35 pagoda 36 bamboo_forest 37 mountain 38 coast 39 creek 40 waterfall 41 grass 42 other

SLIDE 13

Same style matching

SIFT feature matching
Normalization of SIFT
Dimension : 128dim x 400pts
MAP 22%
CNN feature of imagenet 1k classifier
Model :VGG19
Layers : fc7
Dimension : 4096 600
MAP 28%
CNN feature of Same style classifier
Model :VGG19
Layers : fc7
Dimension : 4096 600
MAP 34%

SLIDE 14

Multi-feature fusion

Same class matching classifier on imagenet 21k classes of 15M images
Same style matching classifier trained on 1239 queries of 1M images
Speed
Nvidia K40 GPU, 10x faster than CPU i7
Faster RCNN speed: 200ms/frame , image size 1920x1080
Vgg19 feature speed: 60ms/frame, image size 256x256

CNN Models Feature dim MAP Inception_bn1k 1024 24% Inception_21k 1024 34% Vgg19_caffe 4096 34% Inception_21k + vgg19_caffe 5120 43%

SLIDE 15

Experiments

MAP precision on 3M testing images, trained on1M images
Speed up
Parallel flann tree indexing
Hierarchical filtering by object classes, 10x faster speed
Query speed: 1s /image on 5000 teleplays with 2M images

Vgg 19model Full image Object rectangle PCA+LDA Inception-21k MAP √ √ × × × 27.8% √ × √ × × 34.2% √ × √ √ × 37.3% √ × √ × √ 43.1% √ × √ √ √ 46.1%

SLIDE 16

Query system GUI

SLIDE 17

Query examples on image dataset

SLIDE 18

SLIDE 19

Query examples on video dataset

SLIDE 20

SLIDE 21

Conclusion

Bounding box is important to recognize object
Fusion Same style matching with same class matching

features to get higher accuracy

PCA and LDA further improve accuracy and speed
GPU is faster for CNN feature extraction
Speed up query by parallel indexing and hierarchical

filtering

SLIDE 22

References

Erhan, Dumitru, et al. "Scalable object detection using deep neural networks." Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition. 2014.

Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in

Neural Information Processing Systems. 2015.

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural

networks." Advances in neural information processing systems. 2012.

Arandjelović, Relja, and Andrew Zisserman. "Three things everyone should know to improve object retrieval." Proceedings
f the IEEE Conference on Computer Vision and Pattern Recognition. 2012.
Jonathan Long, Evan Shelhamer, Trevor Darrell, Fully convolution Networks for Semantic Segmentation. CVPR 2015

arXiv:1411.4038.

Conditional Random Fields as Recurrent Neural Networks. S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D.

Du, C. Huang, P. Torr ICCV 2015.

Li Shen, Zhouchen Lin and Qingming Huang, Learning deep convolutional neural networks for places2 scene recognition,

Clinical Orthopaedics and Related Research, 2015

Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba and Aude Oliva, Learning Deep Features for Scene

Recognition using Places Database, NIPS, 2014

Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva and Antonio Torralba, Object detectors emerge in deep scene cnns,

ICLR, 2015

Ruobing Wu, Baoyuan Wang, Wenping Wang and Yizhou Yu, Harvesting discriminative meta objects with deep CNN

features for Scene Classification, ICCV, 2015

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna,Rethinking the Inception

Architecture for Computer Vision, arXiv:1512.00567 ,2015

SLIDE 23