Deep Learning Based Semantic Video Indexing and Retrieval Anna Podlesnaya, Sergey Podlesnyy Cinema and Photo Research Institute (NIKFI) This work was funded by Russian Federation Ministry of Culture Contract No. 2214-01-41/06-15
Fast Track Contribution 1 Contribution 2 Contribution 3 Video Segmentation Video Indexing Search by Examples Feature vector extracted Graph-based database Video retrieval by sample by GoogLeNet contains for temporal, spatial and video clip @0.86 enough semantic semantic properties precision information for indexing is proposed. segmenting raw video Online learning of new into shots with 0.94 Cost efficient pipeline. concepts: video retrieval precision compared to by sample photos @0.64 MPEG-4 i-frames. precision.
Relevance Archives are Huge Production Needs MPEG-7 Query Format Russian Documentary Everyday need for QueryByFreeText ● Archive: 250K items footage in TV production QueryByMedia ● (dated from 1910) ● SpatialQuery Non-fiction movies ● TemporalQuery Russian TV Archive: production relies on 100K items historical and cultural heritage content ISO/IEC 15938-5:2003 Information Youtube: users uploading technology -- Multimedia content 100hrs of video every Education, research, art... description interface -- Part 5: minute (as of 2013) Multimedia description schemes
Semantic features extraction by Video deep neural network Segmentation Shots cut by vector distance spikes between frames With semantic features Temporal pooling for shot semantics summarizing
Deep Neural Network
Semantics Feature Vector
Distance Between Frames’ Feature Vectors
Segmenting Algorithm Details { x 0 , x 1 , … x n } — feature vectors of successive frames
Robustness to Camera Movement Zoom Pan/Rotate Pan Pan Zoom/Pan
Apache Cassandra storage for feature vectors and thumbnails Video Indexing Neo4j graph database for movies archive With graph database Structured queries for keywords-based retrieval
Starting with film or Store per-scene data Add edges to Neo4j tape structure in Neo4j graph to speed up nearest neighbors search FV BK-Tree Digitizing Segmentation Indexing Extraction Building Store per-frame May use additional timecodes and feature classifiers for faces, vectors in Cassandra places, salient objects etc.
Graph-Based Index
Neo4j Graph
Neo4j Graph
Neo4j Query Find Scenes with Zebra MATCH (s:Shot) - [c:Category] -> (w:Wordnet {synset: “zebra”}) WHERE c.weight > 0.1 RETURN s ORDER BY s.duration DESC ASCII art: (s)-[c]->(w)
Neo4j Query Find Scenes with Lion at Left to Zebra MATCH (s:Shot) --> (zebra_obj:Salient_obj) --> (w:Wordnet {synset: “zebra”}) MATCH (s) --> (lion_obj:Salient_obj) --> (w:Wordnet {synset: “lion”}) MATCH (zebra_obj) - [:Left] -> (lion_obj) RETURN s ORDER BY s.duration DESC (s)-->(zebra_obj)-->(w) (s)-->(lion_obj)-->(w) (zebra_obj)-[:Left]->(lion_obj)
Search by Find similar clip Example Find near-duplicates Online learning of new concepts One picture is better than 100 words
Use Case 1 Found clips with required characteristics Search for ELEPHANT Keyword Search Select Sample Clip Find Similar Clips Need elephants herd, forest, sky
Find Similar Clip Quick Search Exhaustive Search An average precision of search by video sample was 0.86. The 31-bit random projection Feature vectors pooled precision was evaluated by by scene ( R 1024 ) hash (RPH) searching by a keyword and then searching by one of resulted shots with cosine BK-Tree on PRH Cosine distance between distance threshold 0.3. A hamming distance from sample clip and every human expert performed sample clip other scene in the true/false positives counting. archive, sort descending Quick incomplete search Well, slow
Found near-duplicates, Use Case 2 robust to resampling, vignetting, hue/sat Show sample clip augmentation etc. Extract Features Exhaustive Search Sort by Cos-distance 0.0057 0.0124 0.0152 0.0583
Found video clips Use Case 3 matching the classifier trained on images Show sample images feature vectors of unknown concept AP 0.64 Google Image Search Train Linear Classifier Exhaustive Search Vowpal Wabbit
Future Work Thank you! ● Faces ● Places Questions welcome Video to text annotations ●
Recommend
More recommend