visual language
play

Visual Language Perception from Videos MOHIT GUPTA ADVISOR: - PowerPoint PPT Presentation

Visual Language Perception from Videos MOHIT GUPTA ADVISOR: AMITABHA MUKERJEE Introduction and Motivation Humans process and store what they perceive in a highly abstracted, condensed format For e.g. Computers on the other


  1. Visual Language Perception from Videos MOHIT GUPTA ADVISOR: AMITABHA MUKERJEE

  2. Introduction and Motivation  Human’s process and store what they perceive in a highly abstracted, condensed format  For e.g. …  Computers on the other hand are much less efficient in this department  Possibilities if computers could condense perception  Significant dip in information size (less memory requirement)  ‘show me who is the villain in this movie and when does he enter’ will become a valid question for a computer  Absolutely new; no similar work has been done

  3. Methodology  Scene Segmentation  Using change in histogram method  Heuristic for start or end of speech-silence boundary  Strong heuristic for change in speaker  Sound Segmentation  Classifying voice, silence and miscellaneous (music, audience laughing etc.)  Threshold-ing energy of signal, zero-crossing rate, pitch detection by Yin algorithm  Diarization of voices (separating voices of different speakers)  Voice features like MFCCs are most significant for speaker recognition  Associating faces with speech  Detect faces in frames containing speech using Haar-based features  Tag face with the speech stream for a speaker based on majority-first approach

  4. Methodology  Sound Segmentation  Classifying voice, silence and miscellaneous (music, audience laughing etc.)  Threshold-ing energy of signal, zero-crossing rate, pitch detection

  5. Methodology  Sound Segmentation  Classifying voice, silence and miscellaneous (music, audience laughing etc.)  Threshold-ing energy of signal, zero-crossing rate, pitch detection

  6. Methodology  Associating faces with speech  Detect faces in frames containing speech  Using acquired speech boundaries and detecting faces in each segment

  7. Subtitles and speech  The pitch plot also separates words with high recall but low precision  Subtitle alignment in small-error domain successfully achieved by maximizing the common pitch-subtitle boundaries

  8. Applications  Surround Sound Effects  Using the knowledge of who is speaking in a frame and the location of his face  Background sounds separated from speech and attenuated to get more vocals  Information abstraction and retrieval  Efficiency in memory usage  Model voice, face and scene; use text to produce speech and video on the fly  Asking the computer to seek the video to the instance the villain is first seen

  9. References [1] Tran, Luan, et al. "Pitch reduced patterns relative to photolithography features." U.S. Patent No. 7,253,118. 7 Aug. 2007. [2] Swe, Ei Mon Mon, and Moe Pwint. "An Efficient Approach for Classification of Speech and Music." Advances in Multimedia Information Processing-PCM 2008 . Springer Berlin Heidelberg, 2008. 50-60. [3] Cotton, Courtenay. "A Three-Feature Speech/Music Classification System." (2006). [4] Shah, Sejal, and Archana Bhise. "Fast Speaker Recognition using Efficient Feature Extraction Technique." International Journal of Computer Science 2. [5] Hossen, Abdulnasir, and Said Al-Rawahi. "A Text – Independent Speaker Identification System Based on the Zak Transform." Signal Processing an International Journal (SPIJ) 4.2: 68. [6] Zhao, Xianyu, et al. "SVM-based speaker verification by location in the space of reference speakers." Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on . Vol. 4. IEEE, 2007.

Recommend


More recommend