EE 6882 Overview of Statistical Models for Video Indexing Prof. Shih-Fu Chang Columbia University TA: Eric Zavesky Fall 2007, Lecture 4 Course web site: http:/ / www.ee.columbia.edu/ ~ sfchang/ course/ svia 1 Statistical Paradigm Many problems can be posed as pattern recognition � � Image classification: indoor vs. outdoor? Face? � shot boundary detection, story segmentation � Is the current point a boundary? Statistical models to handle uncertainty and provide flexibility � Image processing tools available � � E.g., homework # 1 Rich tools for learning and prediction � � See course web site Increasing data available � � NIST TREC Video: 300+ hours � Consumer and youtube videos 1
A Very High-Level Stat. Pattern Recog. Architecture (From Jain, Duin, and Mao, SPR Review, ’99) Important issues (1) � Image/video processing � What’s the adequate quality, resolution, etc? � Feature extraction � Color, texture, motion, region, shape, interest points, audio, speech, text, etc � Feature representation � Histogram, bag, graph etc � Invariance to scale, rotation, translation, view, illumination, … � How to reduce dimensions? 2
Important issues (2) Distance measurement � � How to measure similarity between images/videos? � L1, L2, Mahalanobis, Earth Mover’s distance, vector/graph matching Classification models � � Generative vs. discriminative � Multi-modal fusion, early fusion vs. late fusion � E.g., how to use joint audio-visual features to detect events (dancing, wedding…) Efficiency issues � � how to speed up training and testing processes? � How to rapidly build a model for new domains Validation and evaluation � � How to measure performance? � Are models generalizable to new domains? Three related problems � Retrieval, Ranking � Given a query image, find relevant ones � May apply rank threshold to decide relevance � Classification, categorization, detection � Given an image x, predict class label y � Clustering, grouping � Group images/videos into clusters of distinct attributes 3
� An example � News story segmentation using multi- modal, multi-scale features First Understand Data Types and Explore Unique Characteristics Percentage (32.0%) (8.8%) of content (c): multi-story in an anchor seg. (b) different anchor (a): regular anchor segment (21.3%) (15.0%) (f) sep. by music or anim. (d): conti. sports briefings (e): cont. short briefings (i): comm. after sports (g): weather report (h): anchor lead-in before comm. : story : weather : commercial : misc./animation : visual anchors 4
News Story Segmentation τ Objective: a story boundary at time ? � k τ = { shot boundaries or significant pauses} � k observation time τ − τ + τ k 1 k 1 k {video, audio} An anchor face? motion changes? change from music to speech? speech segment? {cue words} i appear {cue words} j appear Need to decide how to formulate features Challenge: diverse features Modality Raw Features Data Value Type text seg. score motion segment continuous music shot boundary point binary Video comm. face segment continuous commercial segment binary sigpas pause point continuous face pitch jump point continuous shot Audio significant pause point continuous musc./spch. disc. segment binary motion spch seg./rapidity segment continuous candidate point ASR cue terms point binary � One way is to use binary Text V-OCR cue terms point binary predicate: text seg. score point continuous combinatorial point binary Misc. if x > threshold, then predict sports segment binary segment boundary (b=1) 5
Example Predicates Predicates no raw feature set An anchor face segment just starts after the candidate point 1 Anchor Face A significant pause within the non-commercial section appears in the 2 Significant pause & surrounding observation window. non-commercial An audio pause with the duration larger than 2.0 second appears after 3 Pause the boundary point. The surrounding observation window has a significant pause with the 4 Significant pause pitch jump intensity larger than the normalized pitch threshold 1.0 and the pause duration larger than 0.5 second. A speech segment before the candidate point 5 Speech segment A speech segment starts in the surrounding observation window 6 Speech segment A commercial starts in 15 to 20 seconds after the candidate point. 7 Commercial A speech segment ends after the candidate point 8 Speech segment An anchor face segment occupies at least 10% of next window 9 Anchor face The surrounding observation window has a pause with the duration 10 Pause larger than 0.25 second. Collect Features from Training Samples One training sample b 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 f Each row represents one predicate i 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 � Face 2 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 3 0 0 0 0 0 1 0 1 1 0 1 0 1 0 1 1 0 1 0 � Motion 4 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 � Significant Pause 5 0 0 1 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 0 6 0 0 1 0 0 0 1 0 1 0 1 0 1 0 0 1 0 0 0 � Speech segment 7 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 � Commercial 8 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 9 0 0 1 0 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 � Text segmentation score … 0 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 0 0 � ASR cue terms 6
Choose Model ∑ λ ⋅ f ( , ) x b 1 i i = q b x ( | ) e i Maximum entropy model λ Z ( ) x λ ∈ where f x b b ( , ), {0,1} i For example = = predicate f ' anchor face ' f ' significant pause ' 1 2 = if current observation: face YES pause = NO = = λ λ + λ q b ( YES x | ) e /( e e ) 1 1 2 = = λ λ + λ q b ( NO x | ) e /( e e ) 2 1 2 Classification: if q(b=YES|x) > 0.5, then predict YES. Background: Entropy m ∑ = − Entropy (bits) � H p log p i 2 i = i 1 Kullback-Leibler (K-L) Distance � � A measure of ‘distance’ between 2 distributions ∑ ( ) q ( x ) = D p ( x ), q ( x ) q ( x ) log KL p ( x ) x ∞ ∫ q ( x ) = or q ( x ) log dx ∞ p ( x ) - ( ) ( ) � ≥ = ⋅ = ⋅ D KL 0 , and 0 i ff p q � Not necessarily symmetric, may not satisfy triangular inequality 7
How to Determine the Weights in the Model? = T {( x b , )} q b x ( | ) Estimate from training data by � λ k k minimizing Kullback-Leibler divergence, defined as � p b x ( | ) ∑∑ = � � � p q D p q ( || ) p b x ( , )log λ q ( | ) b x λ � x b D p q ( || ) ∑∑ = − � + � p x b ( , )log q ( | ) b x constant( ) p λ x b λ Find to maximize the ‘entropy’ � i empirical ≡ ∑∑ � L ( q ) p x b ( , )log q b x ( | ) estimated distribution � λ λ p model from data x b λ Iteratively find � i ∑ ⎛ ⎞ � p x b f x b ( , ) ( , ) ′ 1 log λ = λ + Δ λ ⎜ ⎟ Δ λ = i x b , ∑ i i i ⎜ ⎟ i � M p x q b x f x b ( ) ( | ) ( , ) ⎝ ⎠ λ i x b , • The objective function is convex. So the iterative process can reach the optimum. The same model used to select features I nput: collection of candidate features, training � samples, and the desired model size Output: optimal subset of features and their � corresponding exponential weights q h Current model augmented with feature with � α weight , α h x b ( , ) e q b x ( | ) = q ( | ) b x α , h Z ( ) x α q Select the candidate which improves current model � the most, in each iteration; { } { } = � − � * Reduction of divergence h arg max sup D p q ( || ) D p q ( || ) α , h ∈ α h C { } { } = − arg max sup L ( q ) L ( ) q � α � p , h p ∈ α h C Increase of log-likelihood 8
Recommend
More recommend