Visual Language Perception from Videos MOHIT GUPTA ADVISOR: - - PowerPoint PPT Presentation

visual language
SMART_READER_LITE
LIVE PREVIEW

Visual Language Perception from Videos MOHIT GUPTA ADVISOR: - - PowerPoint PPT Presentation

Visual Language Perception from Videos MOHIT GUPTA ADVISOR: AMITABHA MUKERJEE Introduction and Motivation Humans process and store what they perceive in a highly abstracted, condensed format For e.g. Computers on the other


slide-1
SLIDE 1

Visual Language Perception from Videos

MOHIT GUPTA ADVISOR: AMITABHA MUKERJEE

slide-2
SLIDE 2

Introduction and Motivation

  • Human’s process and store what they perceive in a highly abstracted,

condensed format

  • For e.g. …
  • Computers on the other hand are much less efficient in this department
  • Possibilities if computers could condense perception
  • Significant dip in information size (less memory requirement)
  • ‘show me who is the villain in this movie and when does he enter’ will

become a valid question for a computer

  • Absolutely new; no similar work has been done
slide-3
SLIDE 3

Methodology

  • Scene Segmentation
  • Using change in histogram method
  • Heuristic for start or end of speech-silence boundary
  • Strong heuristic for change in speaker
  • Sound Segmentation
  • Classifying voice, silence and miscellaneous (music, audience laughing etc.)
  • Threshold-ing energy of signal, zero-crossing rate, pitch detection by Yin algorithm
  • Diarization of voices (separating voices of different speakers)
  • Voice features like MFCCs are most significant for speaker recognition
  • Associating faces with speech
  • Detect faces in frames containing speech using Haar-based features
  • Tag face with the speech stream for a speaker based on majority-first approach
slide-4
SLIDE 4

Methodology

  • Sound Segmentation
  • Classifying voice, silence and miscellaneous (music, audience laughing etc.)
  • Threshold-ing energy of signal, zero-crossing rate, pitch detection
slide-5
SLIDE 5

Methodology

  • Sound Segmentation
  • Classifying voice, silence and miscellaneous (music, audience laughing etc.)
  • Threshold-ing energy of signal, zero-crossing rate, pitch detection
slide-6
SLIDE 6

Methodology

  • Associating faces with speech
  • Detect faces in frames containing speech
  • Using acquired speech boundaries and detecting faces in each segment
slide-7
SLIDE 7

Subtitles and speech

  • The pitch plot also separates words with high recall but low precision
  • Subtitle alignment in small-error domain successfully achieved by

maximizing the common pitch-subtitle boundaries

slide-8
SLIDE 8

Applications

  • Surround Sound Effects
  • Using the knowledge of who is speaking in a frame and the location of his

face

  • Background sounds separated from speech and attenuated to get more

vocals

  • Information abstraction and retrieval
  • Efficiency in memory usage
  • Model voice, face and scene; use text to produce speech and video on the fly
  • Asking the computer to seek the video to the instance the villain is first seen
slide-9
SLIDE 9

References

[1] Tran, Luan, et al. "Pitch reduced patterns relative to photolithography features." U.S. Patent No. 7,253,118. 7 Aug. 2007. [2] Swe, Ei Mon Mon, and Moe Pwint. "An Efficient Approach for Classification of Speech and Music." Advances in Multimedia Information Processing-PCM 2008. Springer Berlin Heidelberg, 2008. 50-60. [3] Cotton, Courtenay. "A Three-Feature Speech/Music Classification System." (2006). [4] Shah, Sejal, and Archana Bhise. "Fast Speaker Recognition using Efficient Feature Extraction Technique." International Journal of Computer Science 2. [5] Hossen, Abdulnasir, and Said Al-Rawahi. "A Text–Independent Speaker Identification System Based on the Zak Transform." Signal Processing an International Journal (SPIJ) 4.2: 68. [6] Zhao, Xianyu, et al. "SVM-based speaker verification by location in the space of reference speakers." Acoustics, Speech and Signal Processing,

  • 2007. ICASSP 2007. IEEE International Conference on. Vol. 4. IEEE, 2007.