Visual Language Perception from Videos MOHIT GUPTA ADVISOR: AMITABHA MUKERJEE
Introduction and Motivation Human’s process and store what they perceive in a highly abstracted, condensed format For e.g. … Computers on the other hand are much less efficient in this department Possibilities if computers could condense perception Significant dip in information size (less memory requirement) ‘show me who is the villain in this movie and when does he enter’ will become a valid question for a computer Absolutely new; no similar work has been done
Methodology Scene Segmentation Using change in histogram method Heuristic for start or end of speech-silence boundary Strong heuristic for change in speaker Sound Segmentation Classifying voice, silence and miscellaneous (music, audience laughing etc.) Threshold-ing energy of signal, zero-crossing rate, pitch detection by Yin algorithm Diarization of voices (separating voices of different speakers) Voice features like MFCCs are most significant for speaker recognition Associating faces with speech Detect faces in frames containing speech using Haar-based features Tag face with the speech stream for a speaker based on majority-first approach
Methodology Sound Segmentation Classifying voice, silence and miscellaneous (music, audience laughing etc.) Threshold-ing energy of signal, zero-crossing rate, pitch detection
Methodology Sound Segmentation Classifying voice, silence and miscellaneous (music, audience laughing etc.) Threshold-ing energy of signal, zero-crossing rate, pitch detection
Methodology Associating faces with speech Detect faces in frames containing speech Using acquired speech boundaries and detecting faces in each segment
Subtitles and speech The pitch plot also separates words with high recall but low precision Subtitle alignment in small-error domain successfully achieved by maximizing the common pitch-subtitle boundaries
Applications Surround Sound Effects Using the knowledge of who is speaking in a frame and the location of his face Background sounds separated from speech and attenuated to get more vocals Information abstraction and retrieval Efficiency in memory usage Model voice, face and scene; use text to produce speech and video on the fly Asking the computer to seek the video to the instance the villain is first seen
References [1] Tran, Luan, et al. "Pitch reduced patterns relative to photolithography features." U.S. Patent No. 7,253,118. 7 Aug. 2007. [2] Swe, Ei Mon Mon, and Moe Pwint. "An Efficient Approach for Classification of Speech and Music." Advances in Multimedia Information Processing-PCM 2008 . Springer Berlin Heidelberg, 2008. 50-60. [3] Cotton, Courtenay. "A Three-Feature Speech/Music Classification System." (2006). [4] Shah, Sejal, and Archana Bhise. "Fast Speaker Recognition using Efficient Feature Extraction Technique." International Journal of Computer Science 2. [5] Hossen, Abdulnasir, and Said Al-Rawahi. "A Text – Independent Speaker Identification System Based on the Zak Transform." Signal Processing an International Journal (SPIJ) 4.2: 68. [6] Zhao, Xianyu, et al. "SVM-based speaker verification by location in the space of reference speakers." Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on . Vol. 4. IEEE, 2007.
Recommend
More recommend