video skimming and characterization through the
play

Video Skimming and Characterization through the Combination of Image - PDF document

IEEE International Workshop on Content-based Access of Image and Video Databases (ICCV98 - Bombay, India) Video Skimming and Characterization through the Combination of Image and Language Understanding Michael A. Smith Takeo Kanade Department


  1. IEEE International Workshop on Content-based Access of Image and Video Databases (ICCV98 - Bombay, India) Video Skimming and Characterization through the Combination of Image and Language Understanding Michael A. Smith Takeo Kanade Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh, PA 15213 {msmith, tk}@cs.cmu.edu Abstract 1.1 Browsing Digital Video Digital video is rapidly becoming important for educa- tion, entertainment, and a host of multimedia applications. Simplistic browsing techniques, such as fast-forward With the size of the video collections growing to thousands playback and skipping video frames at fixed intervals, of hours, technology is needed to effectively browse seg- reduce video viewing time. However, fast playback per- ments in a short time without losing the content of the turbs the audio and distorts much of the image informa- video. We propose a method to extract the significant tion[2], and displaying video sections at fixed intervals audio and video information and create a “skim” video merely gives a random sample of the overall content. which represents a very short synopsis of the original. The Another idea is to present a set of “representative” video goal of this work is to show the utility of integrating lan- frames (e.g. keyframes in motion-based encoding) simul- guage and image understanding techniques for video taneously on a display screen. While useful and effective, skimming by extraction of significant information, such as such static displays miss an important aspect of video: specific objects, audio keywords and relevant video struc- video contains audio information. It is critical to use and ture. The resulting skim video is much shorter, where com- present audio information, as well as image information, paction is as high as 20:1, and yet retains the essential for browsing. Recently, researchers have proposed brows- content of the original segment. We have conducted a ing representations based on information within the video user-study to test the content summarization and effective- [8], [9], [10], [16]. These systems rely on the motion in a ness of the skim as a browsing tool. scene, placement of scene breaks, or image statistics, such as color and shape, but they do not make integrated use of image and language understanding. 1 Introduction Research at AT&T Research Laboratories has shown promising results for video summaries when closed-cap- With increased computing power and electronic storage tions are used with statistical visual attributes [15]. A sep- capacity, the potential for large digital video libraries is arate group at the University of Mannheim has proposed a growing rapidly. These libraries, such as the Informedia TM system for generating video abstracts [17]. This work uses Project at Carnegie Mellon [7] [14], will make thousands similar statistics to characterize images, and audio fre- of hours of video available to a user. For many users, the quency analysis to detect dialogue scenes. These systems video of interest is not always a full-length film. Unlike analyze the image and audio information, but they do not video-on-demand, video libraries should provide informa- extract content specific portions of the video. tional access in the form of brief, content-specific seg- An ideal browser would display only the video pertain- ments as well as full-featured videos. ing to a segment’s content, suppressing irrelevant data. It Even with intelligent content-based search algorithms would show less video than the original and could be used being developed [5], [11], multiple video segments will be to sample many segments without viewing each in its returned for a given query to insure retrieval of pertinent entirety. The amount of content displayed should be information. The users will often need to view all the seg- adjustable so the user can view as much or as little video ments to obtain their final selections. Instead, the user will as needed, from extremely compact to full-length video. want to “skim” the relevant portions of video for the seg- The audio portion of this video should also consist of the ments related to their query. significant audio or spoken words, instead of simply using

  2. Scene Changes Camera Motion pan pan pan static static pan zoom static pan static Object Detection Text Detection TF-IDF Weight e c s y f e s d s e d e s f s s w e e s n e t s e t l e s e e s f r e y d s g e e e f t h e e g o e o a r c l e r o u o e t i t n h d l r e h s e i h m a e h m s m h m a n i n h r c t w r n o r i Transcript i i r a o i o e u n w e r u o i n r a p o a t l m t t h t t d t a r r w c r i i t a i u l n a o i h u o i g n f a c s f m o w g a t a t o k e e s l e f t o c x t t c n e a e w p s s t n p h e a e a e d o o e o l s a a e b e b p d n n m h r r e i i c c c d d r Figure 1: Video Characterization: keywords, scene breaks, camera motion, significant objects (faces and text). the synchronized portion corresponding to the selected understanding entails identifying the most significant video frames. words in a given scene, and for image understanding, it entails segmentation of video into scenes, detection of 1.2 Video Skims objects of importance (face and text) and identification of the structual motion of a scene. The critical aspect of compacting a video is context 2.1 Audio and Language Characterization understanding, which is the key to choosing the “signifi- cant images and words” that should be included in the Language analysis works on the transcript to identify skim video. We characterize the significance of video important audio regions known as “keywords”. We use the through the integration of image and language understand- well-known technique of TF-IDF (Term Frequency ing. Segment breaks produced by image processing can be fs examined along with boundaries of topics identified by the (1) – DF = - - - - TF I language processing of the transcript. The relative impor- fc tance of each scene can be evaluated by 1) the objects that Inverse Document Frequency) to measure relative impor- appear in it, 2) the associated words, and 3) the structure tance of words for the video document [5]. The TF-IDF of of the video scene. The integration of language and image a word is its frequency in a given scene, f s , divided by the understanding is needed to realize this level of character- frequency, f c , of its appearance in a standard corpus. ization and is essential to skim creation. Words that appear often in a particular segment, but rela- In the sections that follow, we describe the technology tively infrequently in a standard corpus, receive the high- involved in video characterization from audio and images est TF-IDF weights. A threshold is set to extract embedded within the video, and the process of integrating keywords, as shown in the bottom rows of Figure 1. this information for skim creation. Our experiments have shown that using individual key- words creates an audio skim which is fragmented and 2 Video Characterization incomprehensible for some speakers. To increase compre- hension, we use longer audio sequences, “keyphrases”, in Through techniques in image and language understand- the audio skim. A keyphrase may be obtained by starting ing, we can characterize scenes, segments, and individual with a keyword, and extending its boundaries to areas of frames in video. Figure 1 illustrates characterization of a silence or neighboring keywords. Another method for segment taken from a video titled “Destruction of Spe- extracting significant audio is to segment actual phrases. cies”, from WQED Pittsburgh. At the moment, language To detect breaks between utterances we use a modification

Recommend


More recommend