Semantic Multi-modal Analysis, Structuring, and Visualization for Candid Personal Interaction Videos Alexander Haubold Department of Computer Science Columbia University Thesis Proposal Abstract Videos are rich in multimedia content and semantics, which should be used by video browsers to better present the audio-visual information to the viewer. Ubiquitous video players allow for content to be scanned linearly, rarely providing summaries or methods for searching. Through analysis of audio and video tracks, it is possible to extract text transcripts from audio, displayed text from video, and higher-level semantics through speaker identification and scene analysis. External data sources, when available, can be used to cross-reference the video content and impose a structure for organization. Various research tools have addressed video summarization and browsing using one or more of these modalities; however, most of them assume edited videos as input. We focus our research on genres in personal interaction videos and collections of such videos in their unedited form. We present and verify formal models for their structure, and develop methods for their automatic analysis, summarization and indexing. We specify the characteristic semantic components of three related genres of candidly captured videos: formal instructions or lectures, student team project presentations, and discussions. For each genre, we design and validate a separate multi-modal approach to the segmentation and structuring of their content. We develop novel user interfaces to support browsing and searching the multi-modal video information, and introduce the tool in a classroom environment with ≈ 160 students per semester. UI elements are designed according to the underlying video structure to address video browsing in a structured multi-modal space. These user interfaces include image/video browsers, audio/video segmentation browsers, and text/filtered ASR transcript browsers. Through several user studies, we evaluate and refine our indexing methods, browser interface, and the tools usefulness in the classroom. We propose a core/module methodology to analysis, structure, and visualization of personal interaction videos. Analysis, structure, and visualization techniques in the core are common to all genres. Modular features are characteristic to video genres, and are applied selectively. Structure of interactions in each video is derived from the combination of the resulting audio, visual, and textual features. We expect that the framework can be applied to genres not covered here with the addition or replacement of few characteristic modules. 1
Contents 1 Introduction 3 1.1 Motivation 3 1.2 Background 3 2 Research Approaches 7 2.1 Three genres 8 2.2 Common Tools 9 2.2.1 Analysis and Structure 9 2.2.2 Visualization Techniques 10 2.3 Genre-specific Tools 11 2.3.1 Lectures 11 2.3.2 Presentations 12 2.3.3 Discussions 13 3 Research Progress 14 3.1 Structuring Lecture Videos using visual contents 14 3.1.1 Classification by Media Type 14 3.1.2 Topological Segmentation 15 3.2 Structuring Lecture Videos using textual contents 15 3.2.1 Data Acquisition 16 3.2.2 Analysis 16 3.2.3 Results 16 3.3 Segmentation and Augmentation for Classroom Presentation Videos 18 3.3.1 Audio Segmentation 18 3.3.2 Visual Segmentation 18 3.3.3 Combined Audio-Visual Segments 18 3.3.4 Text Augmentation 19 3.3.5 Interface 20 3.3.6 User Study 20 3.4 Research on Accommodating Sample Size Effects in symmetric KL in 21 Speaker Clustering 3.4.1 Empirical Solution 21 3.5 Summary of research progress 22 4 Proposed Work 22 4.1 High-level Structure Detection 22 4.2 Video Structure Comparison 23 4.3 Speaker Table of Contents 23 4.4 Analysis of Discussion Video and Application of Common Approaches 24 4.5 Text Indexing 24 4.6 User Interfaces and Tools 24 4.7 User Studies 25 4.8 Feedback Annotations for Videos (Optional) 25 5 Conclusion 25 5.1 Schedule 26 2
1 Introduction 1.1 Motivation With the advent of ubiquitous high-performance computers, high-speed networks, and inexpensive recording equipment, the use of video as a medium for communication and information dissemination is increasing significantly. What used to be a laborious task of recording with bulky video equipment, transferring footage to computers with expensive hardware, and disseminating video material by means of portable media, is now reduced to simple plug-and-play procedures, and easy-to-use software for compression and distribution. Due to its simplicity, video has started to play an important and integral role for many organizations – presentations, discussions, lectures, and other events are readily captured, shared, and archived. This trend is particularly strong in the University environment, where lectures are recorded for distant learning programs, student presentations and discussions are documented by instructors for providing feedback, and guest talks are captured for archiving purposes. One of the drawbacks of this intensive use of video is large accumulation of raw video footage. While it is easy to transfer video to a computer, there still exists a need for editing, organization, and effective dissemination. Without manual intervention, the raw footage is merely a serial stream of video data; however, the content is rich in information that should be indexed and searched similar to a textbook. Indices and table of content for videos require the detection of structure, from which hierarchies of contextual units can be derived. Structure is determined from segmentation of the various media used in video (imagery, audio, text, etc.), and different emphasis is placed on each of the tracks depending on the type of video content. Research problems include analysis of the types of video and their features, segmentation and clustering of audio by speakers, transcript extraction and filtering from audio, segmentation and clustering of visual content, interactive visualization approaches for structure in video, and effective distribution of content to viewers. This work contributes new methods and approaches of multi-modal video analysis for personal interaction videos. One of the genres considered (lecture video) has been explored in many prior works, while two additional genres (presentation and discussion video) are newly introduced, compared, and contrasted. The findings and implementation of this work find widespread application, in particular in environments where vigorous personal interaction plays an important role, for example in the team-oriented University classroom. 1.2 Background Personal interaction videos are rich in content but typically lack frequent action events, which are commonly found in genres of news, sports, and film. They are to the most part unedited and often contain long sequences during which a single topic is covered. Determining structure of their contents relies on approaches of content analysis for determining contextually coherent units, and their temporally recurring instances. 3
Figure 1: Classification of Content Modeling Techniques by Bashir and Khokhar [1]. Level I: Modeling of Raw Video Data, Level II: Representation of derived or logical features, Level III: Semantic level abstractions. This area of content-based video indexing and retrieval (CBVIR) builds on analysis of multi- modal sources, including imagery, motion in video, audio, text from speech, text from image, and text from other sources, to name a few. Substantial research has been carried out in these fields, to the most part isolated to unique problems in one medium. Bashir and Khokhar [1] provide a hierarchical overview of CBVIR (see Figure 1), in which analysis of a medium falls into one of three levels: low-level features from signal processing, semantic representations from computer vision methodologies, and high-level intelligent reasoning from AI and psychology/philosophy understanding. While their figure focuses solely on imagery in video, other media such as audio and text share a similar hierarchy. An approach to a complete analysis would draw techniques from all three levels: segmentation of video and audio signals using low level features, determination of content similarity at the semantic level, and presentation of the summary at high level. Segmentation of video into shots tends to be based on low-level features, such as histogram changes, MPEG motion vectors, Gabor energies, textures, etc. The Cornell Lecture Browser [2] uses histograms to detect presentation slide changes; Smith [3] and Yang [4] use it to detect cuts in news and other non-presentation videos, and Haubold [5] uses it for presentation and significant speaker pose changes. Feature vectors from such low-level features are also used for statistical approaches to segmentation, and machine learning methods of classification of shots. Souvannavong [6] applies Latent Semantic Indexing (LSI), a well-known method in text analysis discussed by Landauer [7], to clustering of shots in news videos. Dorai [8] uses low-level feature vectors to train classifiers for video shot types, such as blackboard/whiteboard, narrator, and 4
Recommend
More recommend