narrative theme navigation for sitcoms supported by fan
play

Narrative Theme Navigation for Sitcoms Supported by Fan-generated - PowerPoint PPT Presentation

Narrative Theme Navigation for Sitcoms Supported by Fan-generated Scripts Gerald Friedland, Luke Gottlieb, Adam Janin International Computer Science Institute (ICSI) Presented by: Katya Gonina What? Novel method to generate indexing information


  1. Narrative Theme Navigation for Sitcoms Supported by Fan-generated Scripts Gerald Friedland, Luke Gottlieb, Adam Janin International Computer Science Institute (ICSI) Presented by: Katya Gonina

  2. What? Novel method to generate indexing information for the navigation of TV content

  3. Why?  Lots of different ways to watch videos  DVD, Blu-ray  On-demand  Internet  Lots of videos out there!  Need better ways to navigate content  Show a particular scene  Show where a favorite actor talks  Support random seek into videos

  4. Example: Sitcoms  Specifically “Seinfeld”  Strict set of rules  Every scene transition is marked by music  Every punchline marked by artificial laughter  Video: http://www.youtube.com/watch?v=PaPxSsK6ZQA

  5. Outline Original Joke-O-Mat (2009) 1. System setup - Evaluation - Limitations - Enhanced version (2010) 2. System setup - Evaluation - Future Work 3.

  6. Outline Original Joke-O-Mat (2009) 1. System setup - Evaluation - Limitations - Enhanced version (2010) 2. System setup - Evaluation - Future Work 3.

  7. Joke-O-Mat  Original system (2008-2009)  Ability to navigate basic narrative elements:  Scenes  Punchlines  Dialog segments  Per-actor filter  Ability to skip certain parts  “Surf” the episode "Using Artistic Markers and Speaker Identification for Narrative-Theme Navigation of Seinfeld Episodes” G. Friedland, L. Gottlieb, and A. Janin Proceedings of the 11th IEEE International Symposium on Multimedia (ISM2009), San Diego, California, pp. 511-516

  8. Joke-O-Mat  Two main elements: Pre-processing step 1. Online video browser: 2.

  9. Joke-O-Mat  Two main elements: Pre-processing and analysis step 1.

  10. Acoustic Event & Speaker Identification  Goal: Train GMMs for different audio events  Jerry, Kramer, Elaine, George  Male & female supporting actor  Laughter  Music  Non-speech (i.e. other noises)  Use 1-minute audio sample  Compute 19-dim MFCCs  Train 20-component GMMs

  11. Audio Segmentation  Given the trained GMMs  2.5 sec * 10ms = 250 frames  Compute likelihood for each set of features for each GMM  Use majority vote to classify to either speakers or laughter/ music/non-speech

  12. Narrative Theme Analysis  Transforms acoustic event segmentation and speaker detection into narrative theme segments  Rule-based system:  Dialog = single contiguous speech segment  Punchline = dialog + laughter  Top-5 punchlines = 5 punchlines followed by the longest laughter  Scene = segment of at least 10 sec between two music events

  13. Narrative Theme Analysis  Creates icons for the GUI  Sitcom rules: actor has to be shown once a certain speaking time is exceeded  Median frame of the longest speech segment for each actor  Could use a visual approach here..  Use median frame for other events (scene, punchlines, dialog)

  14. Online Video Browser  Shows video  Allows for play/pause, seeking to random positions  Navigational panel allows to browse directly to:  Scene  Punchline  Top-5 punchlines  Dialog element  Select/deselect actors  http://www.icsi.berkeley.edu/jokeomat/HD/auto/ index.html

  15. Evaluation Phase Performance For 25min Episode Training 30% real-time 2.7min Classification 10% real-time 2.5min Narrative Theme 10% real-time 2.5min Analysis Total 7.7min  Diarization Error Rate (DER) = 46%  5% per class  Winner of the ACM Multimedia Grand Challenge 2009

  16. Limitations of the original Joke-O-Mat  Requires manual training of speaker models  Requires 60 seconds of training data for each speaker  Cannot support actors with minor roles  Does not take into account what was said

  17. Outline Original Joke-O-Mat (2009) 1. System setup - Evaluation - Limitations - Enhanced version (2010) 2. System setup - Evaluation - Future Work 3.

  18. Extended System  Enhanced Joke-O-Mat (2010)  + Speech Recognition  Keyword search  Automatic alignment of speaker ID and ASR with:  Fan-generated scripts  Closed captions  Significantly reduces manual intervention

  19. New Joke-O-Mat System

  20. New Joke-O-Mat System

  21. Context-Augmentation  Producing transcripts can be costly  Luckily we have the Internet!  Scripts and closed captions produced by fans

  22. Fan-generated data  Fan-sourced scripts  Tend to be very accurate  However, don’t contain any time information  Closed captions  Contain time information  However, do not contain speaker attribution  Less accurate, often intentionally altered  Normalize and merge them together…

  23. Fan-generated data  Normalize the scripts and the closed captions  Then, use minimum edit distance to align two sources  Start & End words in script = Start & End words in caption  Use timing from the closed caption, speaker from the script  If one speaker = single-single speaker segment  If multiple speakers = multi-speaker segment (37.3%)

  24. Forced Alignment + = Transcript Audio Alignment  Generate detailed timing information for each word  Perform all steps of a speech recognizer on the audio  But, instead of using a language model, use only the transcript sequence of words  Also does speaker adaptation over segments  Will be more accurate on speaker-homogeneous segments

  25. Forced Alignment  Run forced alignment on each segment  For 10 episodes tested – 90% of the segments aligned at the first step  Start time & end time of each word  Speaker attribution

  26. Forced Alignment  Pool segments for each speaker and train speaker models  + train a garbage model  On audio that falls between the segments  Assume that contain only laughter, music, and other non-speech

  27. Forced Alignment  For the failed single-speaker segments:  Still use segment start and end time  Don’t have a way to index exact temporal location of each word  For each failed multi-speaker segment:  Generate a HMM alternating:  Speaker states  Garbage states

  28. Forced Alignment  For each time step, advance an arc and collect probability  Ex: if move across “Patrice” arc, invoke “Patrice” speaker model at that time step  Segmentation = most probable path through the HMM  Garbage model allows for arbitrary noise between speakers  Minimum duration for each speaker  In reality, system was not sensitive the the duration

  29. Forced Alignment  Multi-speaker segments => many single-speaker segments  Run the forced alignment with ASR again

  30. Music & Laughter Segmentation  Laughter decoded using Shout speech/nonspeech decoder  Music models are trained separately (same as the original Joke-O-Mat)

  31. Putting it all together http://www.icsi.berkeley.edu/jokeomat/HD/auto/ index.html

  32. Evaluation  Compare to expert-annotated ground truth DER 1.  False alarms: closed captions spanning multiple dialog segments  Missed speech: truncation of words in forced alignment

  33. Evaluation  Compare to expert-annotated ground truth 2. User Study  25 participants  Randomly showed expert- and fan-annotated episodes  Asked to state preference

  34. Outline Original Joke-O-Mat (2009) 1. System setup - Evaluation - Limitations - Enhanced version (2010) 2. System setup - Evaluation - Future Work 3.

  35. Limitations & Future Work  Laughter and scene transition music – manually trained  Require scripts and closed captions  Available from show producers  Failed single-speaker segments – how to handle?  Retrain speaker models  HMM for the whole episode  Look at other genres (dramas, soap operas, lectures?)  New rules  Add visual data

  36. Thanks!

Recommend


More recommend