Narrative Theme Navigation for Sitcoms Supported by Fan-generated Scripts Gerald Friedland, Luke Gottlieb, Adam Janin International Computer Science Institute (ICSI) Presented by: Katya Gonina
What? Novel method to generate indexing information for the navigation of TV content
Why? Lots of different ways to watch videos DVD, Blu-ray On-demand Internet Lots of videos out there! Need better ways to navigate content Show a particular scene Show where a favorite actor talks Support random seek into videos
Example: Sitcoms Specifically “Seinfeld” Strict set of rules Every scene transition is marked by music Every punchline marked by artificial laughter Video: http://www.youtube.com/watch?v=PaPxSsK6ZQA
Outline Original Joke-O-Mat (2009) 1. System setup - Evaluation - Limitations - Enhanced version (2010) 2. System setup - Evaluation - Future Work 3.
Outline Original Joke-O-Mat (2009) 1. System setup - Evaluation - Limitations - Enhanced version (2010) 2. System setup - Evaluation - Future Work 3.
Joke-O-Mat Original system (2008-2009) Ability to navigate basic narrative elements: Scenes Punchlines Dialog segments Per-actor filter Ability to skip certain parts “Surf” the episode "Using Artistic Markers and Speaker Identification for Narrative-Theme Navigation of Seinfeld Episodes” G. Friedland, L. Gottlieb, and A. Janin Proceedings of the 11th IEEE International Symposium on Multimedia (ISM2009), San Diego, California, pp. 511-516
Joke-O-Mat Two main elements: Pre-processing step 1. Online video browser: 2.
Joke-O-Mat Two main elements: Pre-processing and analysis step 1.
Acoustic Event & Speaker Identification Goal: Train GMMs for different audio events Jerry, Kramer, Elaine, George Male & female supporting actor Laughter Music Non-speech (i.e. other noises) Use 1-minute audio sample Compute 19-dim MFCCs Train 20-component GMMs
Audio Segmentation Given the trained GMMs 2.5 sec * 10ms = 250 frames Compute likelihood for each set of features for each GMM Use majority vote to classify to either speakers or laughter/ music/non-speech
Narrative Theme Analysis Transforms acoustic event segmentation and speaker detection into narrative theme segments Rule-based system: Dialog = single contiguous speech segment Punchline = dialog + laughter Top-5 punchlines = 5 punchlines followed by the longest laughter Scene = segment of at least 10 sec between two music events
Narrative Theme Analysis Creates icons for the GUI Sitcom rules: actor has to be shown once a certain speaking time is exceeded Median frame of the longest speech segment for each actor Could use a visual approach here.. Use median frame for other events (scene, punchlines, dialog)
Online Video Browser Shows video Allows for play/pause, seeking to random positions Navigational panel allows to browse directly to: Scene Punchline Top-5 punchlines Dialog element Select/deselect actors http://www.icsi.berkeley.edu/jokeomat/HD/auto/ index.html
Evaluation Phase Performance For 25min Episode Training 30% real-time 2.7min Classification 10% real-time 2.5min Narrative Theme 10% real-time 2.5min Analysis Total 7.7min Diarization Error Rate (DER) = 46% 5% per class Winner of the ACM Multimedia Grand Challenge 2009
Limitations of the original Joke-O-Mat Requires manual training of speaker models Requires 60 seconds of training data for each speaker Cannot support actors with minor roles Does not take into account what was said
Outline Original Joke-O-Mat (2009) 1. System setup - Evaluation - Limitations - Enhanced version (2010) 2. System setup - Evaluation - Future Work 3.
Extended System Enhanced Joke-O-Mat (2010) + Speech Recognition Keyword search Automatic alignment of speaker ID and ASR with: Fan-generated scripts Closed captions Significantly reduces manual intervention
New Joke-O-Mat System
New Joke-O-Mat System
Context-Augmentation Producing transcripts can be costly Luckily we have the Internet! Scripts and closed captions produced by fans
Fan-generated data Fan-sourced scripts Tend to be very accurate However, don’t contain any time information Closed captions Contain time information However, do not contain speaker attribution Less accurate, often intentionally altered Normalize and merge them together…
Fan-generated data Normalize the scripts and the closed captions Then, use minimum edit distance to align two sources Start & End words in script = Start & End words in caption Use timing from the closed caption, speaker from the script If one speaker = single-single speaker segment If multiple speakers = multi-speaker segment (37.3%)
Forced Alignment + = Transcript Audio Alignment Generate detailed timing information for each word Perform all steps of a speech recognizer on the audio But, instead of using a language model, use only the transcript sequence of words Also does speaker adaptation over segments Will be more accurate on speaker-homogeneous segments
Forced Alignment Run forced alignment on each segment For 10 episodes tested – 90% of the segments aligned at the first step Start time & end time of each word Speaker attribution
Forced Alignment Pool segments for each speaker and train speaker models + train a garbage model On audio that falls between the segments Assume that contain only laughter, music, and other non-speech
Forced Alignment For the failed single-speaker segments: Still use segment start and end time Don’t have a way to index exact temporal location of each word For each failed multi-speaker segment: Generate a HMM alternating: Speaker states Garbage states
Forced Alignment For each time step, advance an arc and collect probability Ex: if move across “Patrice” arc, invoke “Patrice” speaker model at that time step Segmentation = most probable path through the HMM Garbage model allows for arbitrary noise between speakers Minimum duration for each speaker In reality, system was not sensitive the the duration
Forced Alignment Multi-speaker segments => many single-speaker segments Run the forced alignment with ASR again
Music & Laughter Segmentation Laughter decoded using Shout speech/nonspeech decoder Music models are trained separately (same as the original Joke-O-Mat)
Putting it all together http://www.icsi.berkeley.edu/jokeomat/HD/auto/ index.html
Evaluation Compare to expert-annotated ground truth DER 1. False alarms: closed captions spanning multiple dialog segments Missed speech: truncation of words in forced alignment
Evaluation Compare to expert-annotated ground truth 2. User Study 25 participants Randomly showed expert- and fan-annotated episodes Asked to state preference
Outline Original Joke-O-Mat (2009) 1. System setup - Evaluation - Limitations - Enhanced version (2010) 2. System setup - Evaluation - Future Work 3.
Limitations & Future Work Laughter and scene transition music – manually trained Require scripts and closed captions Available from show producers Failed single-speaker segments – how to handle? Retrain speaker models HMM for the whole episode Look at other genres (dramas, soap operas, lectures?) New rules Add visual data
Thanks!
Recommend
More recommend