audio indexing and retrieval
play

Audio Indexing and Retrieval IT6902; Semester B, 2004/2005; Leung - PowerPoint PPT Presentation

Audio Indexing and Retrieval IT6902; Semester B, 2004/2005; Leung Audio Indexing and Retrieval Motivation Main Audio Features Audio Classification Speech Recognition Music Retrieval Using Audio Features for Video


  1. Audio Indexing and Retrieval IT6902; Semester B, 2004/2005; Leung

  2. Audio Indexing and Retrieval • Motivation • Main Audio Features • Audio Classification • Speech Recognition • Music Retrieval • Using Audio Features for Video Indexing and Retrieval Audio Indexing and Retrieval 2 IT6902; Semester B, 2004/2005; Leung

  3. Scenarios • If we have an audio file of a pop singer’s concert, how can we find out when the singer is singing and when he/she is talking to the audience? • If we have recorded the phone conversations during many sessions of the conference meetings, how can we find out when and what have been discussed for a particular project XYZ? • If we have many songs in digital format, how can we search for a particular song for which we forget the title but we only know how to sing a few words or hum a few notes? • If we want to skim a horror movie file, how can we find out where are the horror scenes? Audio Indexing and Retrieval 3 IT6902; Semester B, 2004/2005; Leung

  4. Main Audio Features • Time-Domain Features – Average Energy – Zero Crossing Rate – Silence Ratio • Frequency-Domain Features – Sound Spectrum – Bandwidth – Energy Distribution – Harmonicity – Pitch • Spectrogram Audio Indexing and Retrieval 4 IT6902; Semester B, 2004/2005; Leung

  5. Time-Domain Features • Amplitude-time representation of an audio signal Audio Indexing and Retrieval 5 IT6902; Semester B, 2004/2005; Leung

  6. Time-Domain Features (2) • Average Energy – Indicates the loudness of the audio signal − ∑ N 1 2 x ( n ) = x ( n ) n 1 = E N • Zero Crossing Rate – Indicates the frequency of signal amplitude sign change − ∑ N 1 [ ] [ ] − − sgn x ( n ) sgn x ( n 1 ) = n 1 = ZC 2 N n >  1 a 0  = = sgn( a )  0 a 0  − < 1 a 0  Audio Indexing and Retrieval 6 IT6902; Semester B, 2004/2005; Leung

  7. Time-Domain Features (3) • Silence Ratio – Indicates the proportion of the sound piece that is silent – Silence is a period within which the absolute amplitude values of a certain number of samples are below a certain threshold – Silence ratio is calculated as the ratio between the sum of silent periods and the total length of the audio piece silence silence silence silence Approaches: 1. Fixed Threshold 2. Select Reference Silence Value 3. Adaptive Silence Thresholds Audio Indexing and Retrieval 7 IT6902; Semester B, 2004/2005; Leung

  8. Frequency-Domain Features • Sound Spectrum Discrete Fourier Transform (DFT) π j 2 nk − N 1 − ∑ = N X ( k ) x ( n ) e = n 0 Inverse Discrete Fourier Transform (IDFT) π j 2 nk − N 1 ∑ 1 = N x ( n ) X ( k ) e N = n 0 – For large value of N , the signal is often broken into blocks called frames and DFT is applied to each of the frames. This is known as Short Time Fourier Transform (STFT) Audio Indexing and Retrieval 8 IT6902; Semester B, 2004/2005; Leung

  9. Frequency-Domain Features (2) • Bandwidth – indicated the frequency range of a sound – can be taken as the difference between the highest frequency and lowest frequency of non-zero spectrum components – “non-zero” may be defined as at least 3dB above the silence level • Energy distribution – Signal distribution across frequency components – One important feature derived from the energy distribution is the centroid , which is the mid-point of the spectral energy distribution of a sound. Centroid is also called brightness Audio Indexing and Retrieval 9 IT6902; Semester B, 2004/2005; Leung

  10. Frequency-Domain Features (3) • Harmonicity – In harmonic sound, the spectral components are mostly whole number multiples of the lowest and most often loudest frequency – Lowest frequency is called fundamental frequency – Music is normally more harmonic than other sounds • Pitch – the distinctive quality of a sound, dependent primarily on the frequency of the sound waves produced by its source – only period sounds, such as those produced by musical instruments and the voice, give rise to a sensation of a pitch – In practice, we use the fundamental frequency as the approximation of the pitch Audio Indexing and Retrieval 10 IT6902; Semester B, 2004/2005; Leung

  11. Spectrogram • Time and frequency components are shown in the same representation Source: http://www.visualizationsoftware.com/gram.html frequency time Intensity: Power of a frequency component at a particular time interval Audio Indexing and Retrieval 11 IT6902; Semester B, 2004/2005; Leung

  12. Audio Classification • Goal – To classify the audio into speech, music and possibly into other categories/subcategories • Motivation 1. Different audio types require different processing and indexing retrieval techniques 2. Different audio types have different significance to different applications 3. The audio type or class information is itself very useful to some applications 4. The search space after classification is reduced to a particular audio class during the retrieval process Audio Indexing and Retrieval 12 IT6902; Semester B, 2004/2005; Leung

  13. Speech vs. Music Audio Indexing and Retrieval 13 IT6902; Semester B, 2004/2005; Leung

  14. Audio Classification Framework • Step by Step Classification – each feature is used individually in different classification steps – the order in which different features are used for classification is important, normally decided based on computational complexity and the differentiating power of the different features • Feature Vector Based Audio Classification – a set of features is used together as a vector to calculate the closeness of the input to the training sets – theoretically more effective because multiple features are considered in the classification decision making but more computationally demanding because of the multiple dimension feature vectors Audio Indexing and Retrieval 14 IT6902; Semester B, 2004/2005; Leung

  15. Step by Step Classification • Lu and Hankinson 1998 Audio Indexing and Retrieval 15 IT6902; Semester B, 2004/2005; Leung

  16. Feature Vector Based Audio Classification • Scheirer and Stanley 1997 music speech Audio Indexing and Retrieval 16 IT6902; Semester B, 2004/2005; Leung

  17. Example Audio Classes • Liu and Wan 2001 Audio Indexing and Retrieval 17 IT6902; Semester B, 2004/2005; Leung

  18. Audio Segmentation • a long sound track normally consists of a mixture of speech, music and other sound types • can segment the audio piece into speech and music intervals based on the classification scheme discussed earlier • Approach: – divide the audio piece into a number of small windows and then apply audio the classification method to determine if the window is speech or music. – Consecutive windows are then grouped into speech or music interval if they are of the same type … M M S S S M M M M M … M S M Audio Indexing and Retrieval 18 IT6902; Semester B, 2004/2005; Leung

  19. Speech Recognition and Retrieval • Apply speech recognition techniques to convert speech signals into text and then apply IR techniques for indexing and retrieval – Speech Recognition • Basic concepts of Automatic Speech Recognition (ASR) • Variations • Techniques based on Hidden Markov Model (HMM) – Speaker Identification Audio Indexing and Retrieval 19 IT6902; Semester B, 2004/2005; Leung

  20. Basic Concepts of ASR • General ASR System: There are two stages of ASR: 1. Training • Features of each speech unit is extracted and stored in the system 2. Recognition • Features of an input speech unit are extracted and compared with each of the stored features and the speech unit with the best matching features is taken as the recognized unit Audio Indexing and Retrieval 20 IT6902; Semester B, 2004/2005; Leung

  21. Challenges of ASR • Variations in different dimensions 1. Subject 2. Time 3. Background or environmental noise 4. Isolated words vs. continuous speech 5. Read vs. spontaneous speech 6. Size of the vocabulary Audio Indexing and Retrieval 21 IT6902; Semester B, 2004/2005; Leung

  22. Speaker Identification • Goal – find the identity of the speaker • can be used to determine the number of speaker in a particular setting, whether the speaker is male/female, adult or child, a person’s mood, emotional state and attitude, etc… Audio Indexing and Retrieval 22 IT6902; Semester B, 2004/2005; Leung

  23. Music Indexing and Retrieval • Two types of music 1. Structured music and sound effects 2. Sample-based music • Common query input form is humming, thus the term query-by-humming i. Retrieval based on a set of features ii. Retrieval based on pitch Audio Indexing and Retrieval 23 IT6902; Semester B, 2004/2005; Leung

  24. Structured Music • Represented by a set of commands or algorithms. • Most common structured music is MIDI – MIDI is a scripting language. It codes “events” that stand for the production of sounds. E.g., a MIDI event might include values for the pitch of a single note, its duration, and its volume. • MPEG-4 Structured Audio is a new standard for structured audio (music and sound effects) Audio Indexing and Retrieval 24 IT6902; Semester B, 2004/2005; Leung

Recommend


More recommend