Commercial Detection in Heterogeneous Video Streams Using Fused Multi-Modal and Temporal Features Masami Mizutani Fujitsu Labs. LTD. Shahram Ebadollahi Columbia University Shih-Fu Chang Columbia University IEEE ICASSP 2005 Philadelphia March 22, 2005
Outline � Motivation & Previous Work � Our Proposal Method � Approach � Local and Global Features for Commercial Detection � Fusion � Experiment & Result � Conclusion 2
Motivation � CM (commercial) detection � Find CM and PG (program) boundaries in broadcast material � Application: � CM skip capability on digital PVR, � Collecting CM for the marketing use, � Preprocess for further content analysis in PG, etc What’s the state of the art? 3
Previous Work � Dublin University Group (’01) [Marlow01] [Marlow01] � Heuristics to use blank and silence detectors � Philips Research (’03) [Dimitrova03] [Dimitrova03] � Use visual features (blank, scene change rate, text box location) from MPEG streams � Optimize the detection thresholds using Genetic Algorithm � Carnegie Melon University Group (’04) [Hauptmann04] [Hauptmann04] � Did not use blank feature, focus on color and audio � Identical CMs are broadcasted many times � Find repetitious video segments as CM candidates in video streams using SVMs in a hierarchical style 4
Previous Work (cont ’ d) � Reasonable performance, but test data limited and varied. � Blank is proven to be powerful, but not always present. � CMs are not repetitious in heterogeneous data set. � We build a systematic method to fuse diverse features including blank � We validate the results using a large diverse data set. Accuracy # Programs The amount of Fusion Method (F1 %) (# Genre) total data / CM DCU01 92 10 (a few?) 3.5h / 0.4h Heuristics Philips03 89 24 (6 genres) 12h / 2.5h Genetic Algorithm CMU04 91 10 (only news) 5h / 1.2h Hierarchical SVMs Our Method 92 49 (6 genres) 36h / 9h SVM + Duration HMM 5
Our Approach � Classification problem of detected scene change points � Scene change detector works well on CM/PG boundaries. (Mostly hard cut or fade in/out) � Use the pattern of multi-modal features in the local windows located at scene change points. � 15 sec window: half length of most CM clips � 120 sec window: for capturing the start/end of clips having blanks Scene Change PG CM PG 120 sec window 15 sec window Blank Overlay Text Audio(4bins) Color(12bins) Frame Scene Location (256bins) Rate (1bin) Change 16x16 … Rate (1bin) 1 2 3…1112 1 2 3 4 6
Our Approach (cont’d) � Use not only local features but global temporal feature � CM and PG are interleaved in each program � Density and locations of CMs in the entire program stream are dependent on genres and broadcast sources t t t t − + + 1 1 2 i i i i PG CM PG CM PG CM PG 4 L i k e l i h o o d (a) All genres 2 More quickly 0 t 0 0 . 5 1 1 . 5 in sports than 4 L i k e l i h o o d (b) Sports 2 in movie 0 t 0 0 . 5 1 1 . 5 L i k e l i h o o d 4 (c) Movie 2 0 t 0 0 . 5 1 1 . 5 7 Example of distributions of inter-arrival time of CM segments
Problem Formulation � Define two hidden states (CM, PG) at scene change points � Model them as Markov Chain with: � Duration feature : duration of stay at a state � Fused local features: observed content features at a state � Detection of CM/PG boundary � formulated as a problem of inferencing the optimal state sequence by Duration Viterbi algorithm Scene changes ( CM ) CM PG ( PG ) d d CM PG CM f f PG t f: Fused local features 8 d
Modeling Duration of Stay � Duration of PG: Erlang Mixture Model � Erlang is better for fitting positive samples. [ Vasconcelos 00] 00] � Mixture model is for fitting various genres. � The fitness is confirmed by Kolmogorov-Smirnov test � Duration of CM: a uniform distribution � The models are bounded by their max & min in training data. � Normalized actual duration of stay is considered. P Duration of CM Duration of PG 1/(max CM -min CM ) 0 min CM max CM 1 d min PG max PG 9 Now, let’s see feature extraction and fusion …
Feature Extraction: Scene Change, Blank and Overlay Text � Use a scene change (SC) detector [Zhong02] [Zhong02] and an simple blank frame (BF) detector � # of SCs in 15 sec and # of BFs in 120 sec Scene Change 120 sec. # of BFs ・ ・ ・ ・ ・ ・ ・ ・ ・ t Blank Frame # of SCs 15 sec. � Use overlay text location detector based on motion vector and texture energy [Zhang03] [Zhang03] 16(=352pix/22) � Detection results of every 16(=240pix/15) 5 frames are mapped onto a 2D grid (16x16 bins) � Location and frequency of overlay texts appearing in 15 sec. 256 bins 10
Feature Extraction: Audio & Color � Audio (4bins): use a HMM based classifier using MFCC � 1 sec of audio � {silence, speech, music, music/speech} � The counts of each class in 15 sec. Scene Change Count 15 sec. ・ ・ ・ t 1 sec. unit 1 2 3 4 � Color (12bins): use the histogram of the predetermined 12 pallet colors of shots in 15 sec. [Wei04] [Wei04] � The pallet color of each shot is determined based on 3 dominant colors of the keyframe. Scene Change The 12 pallet colors equally Count 15 sec. divides L*u*v space. ・ ・ ・ ・ ・ ・ t 1 2 3 1112 1 shot unit 11
Fuse Multi-Modal Features � Fuse into a single posterior probability in a late fusion style (2-step), due to the great diversity of the features � Use a local two-class (CM/PG) classifier for a modality � Find the posterior of CM using Bayes rule and sigmoid function [Plat99] [Plat99] � Another SVM fuses the posteriors and finds the final posterior of CM Overlay SC Rate (1bin) BF Rate (1bin) Audio (4bins) Color (12bins) Text (256bins) Classifier #1 Classifier #2 Classifier #3 Classifier #4 Classifier #5 (Poisson, ML) (Poisson, ML) (SVM w/ RBF) (SVM w/ RBF) (SVM w/ RBF) Bayes rule for ML 1 Classifier = ( | ) P CM o (SVM w/ RBF) ( | ) ( ) P o PG P PG Conversion to a posterior + 1 ( | ) ( ) P o CM P CM A fused feature Sigmoid function for SVM � Feed to Markov Chain 1 ≈ = ( | ) ( ) P CM o f x α + β + ( ) x 1 e
Experimental Data Set � Heterogeneous data set: � 49 programs from 6 US local/national channels � Including 6 genres: News, Drama, Animation, Entertainment, Movie, Sports � Totally 36 hrs including 9 hrs of commercials � Starts of CM and PG are labeled by manual � 3-Fold Cross Validation (training, validation, testing) CH(date) 6:00PM 6:30PM 7:00PM 7:30PM 8:00PM 8:30PM 9:00PM 9:30PM 10:00AM 10:30PM 11:00AM 11:30PM WB11 DRAMA DRAMA DRAMA DRAMA DRAMA DRAMA DRAMA DRAMA INFO DRAMA DRAMA (Fri. (SitCom) (SitCom) (SitCom) (SitCom) (SitCom) (SitCom) (SitCom) (SitCom) (D/N) (SitCom) (SitCom) 3/12/04) UPN9 DRAMA DRAMA DRAMA DRAMA MOVIE INFO DRAMA ENT (Sat. (SitCom) (SitCom) (SitCom) (SitCom) (D/N) (SitCom) (Gossip) 3/13/04) FOX5 INFO ANIME ANIME DRAMA ANIME DRAMA DRAMA DRAMA INFO DRAMA DRAMA (Sun. (D/N) (SitCom) (SitCom) (SitCom) (Daily New s, (SitCom) (SitCom) 3/14/04) Sports Nesw ) NBC INFO INFO INFO INFO DRAMA DRAMA DRAMA DRAMA DRAMA INFO ENT (Tue (D/N) (Politics/ (Others) (Others) (SitCom) (SitCom) (SitCom) (D/N) (Talk 3/16/04) National) Show ) 12:00PM 12:30PM 1:00PM 1:30PM 2:00PM 2:30PM 3:00PM 3:30PM 4:00PM 4:30PM 5:00AM 5:30PM ABC7 INFO ENT DRAMA DRAMA DRAMA ENT INFO (Mon. (D/N) (QUIZ) (Talk show) (D/N) 3/15/04) CBS2 IN SPORTS EVENT (Basketball Tournament) INFO (Thurs. FO (D/N) 3/18/04) 13
Performance Metric � F1 [D itrova03] for counting correctly classified [Dim imitrova03] boundaries � Each scene change point is a candidate, with label of positive (CM) or negative (PG). � Higher is better. But, can’t deal with short errors. = + 1 2 /( ) F PR P R = + /( ) … Recall R TP TP FN = + /( ) P TP TP FP … Precision PG CM PG Ground Truth t Detection Result PG CM PG t 14 TN FP TP FN TN
Performance Metric (cont ’ d) � WindowDiff [Pevzner02] [Pevzner02] to measure discrepancies between ground truth (ref.) and detection result (hyp.) � Widely used for text segmentation. � Lower is better. − N k 1 ∑ = − > ( , ) (| ( , ) ( , ) | 0 ) WD ref hyp b ref ref b hyp hyp + + − i i k i i k N k = 1 i : # of shots in the entire stream, N k : avg. number of shots in PG and CM segments ( , ) b i j : # of PG and CM boundaries btw position i and j N Ref A scene change shot Hyp 15 i + PG/CM boundary i k
Recommend
More recommend