8 audio databases
play

8. Audio databases About digital audio: Advent of digital audio CD - PowerPoint PPT Presentation

8. Audio databases About digital audio: Advent of digital audio CD in 1983. Order of magnitude improvement in overall sound quality and signal-to-noise ratio over the best analog systems. Wide bandwidth required in on-line


  1. 8. Audio databases About digital audio: � Advent of digital audio CD in 1983. � Order of magnitude improvement in overall sound quality and signal-to-noise ratio over the best analog systems. � Wide bandwidth required in on-line transmission. Converting an analog signal into digital form: � Linear Pulse Code Modulation (PCM) � Two-stage process: (a) Sampling : Observing the signal amplitude at certain time intervals; typical sampling frequencies: 16-48 kHz (b) Quantization : discrete scale for observed amplitudes, typically 16 bits per sample → 65536 possible values. � Audio-CD: 16-bit samples at 44.1 kHz rate, with two (stereo) channels: 2 x 16 x 44 100 ≈ 1.4 Mbits per second MMDB-8 J. Teuhola 2012 184

  2. Illustration of audio concepts amplitude wavelength time sampling interval MMDB-8 J. Teuhola 2012 185

  3. Audio compression techniques (a) Delta modulation : � Extremely simple, used sometimes for speech coding � 1-bit quantizer for amplitude differences: 0 = - ∆ , 1=+ ∆ (b) Adaptive Differential Pulse Code Modulation (ADPCM) � The next sample value is predicted on the basis of recent history; the prediction error is quantized and coded � Used mainly for speech coding, e.g. ITU-T G.726 (c) Subband coding � Division of the signal into frequency components (bands) � Encoding of bands separately � E.g. ITU-T recommendation G.722: High-quality speech at 64 Kbits per second MMDB-8 J. Teuhola 2012 186

  4. MPEG audio � Sampling rates 32, 44.1 or 48 kHz (or half of these); samples processed in frames ; 384/1152 samples per frame. � Subband coding with a bank of 32 filters, each with a bandwidth of 1/64 of the sampling frequency. � Samples coded with variable quantization steps. � Psychoacoustics uses the masking properties of the human ear � Compressed bitrates range from 32 to 224 Kbits per second. Compression factor from 2.7 to 24. � MPEG Layer I: best for bitrates > 128 Kbits per sec (per channel). � MPEG Layer II: best for bitrates ≈ 128 Kbits per sec (per channel). � MPEG Layer III: best for bitrates ≈ 64 Kbits per sec (per channel) = MP3 music in the Internet (compression ≈ 12:1). Discrete Cosine Transform (DCT) on subband signals. MMDB-8 J. Teuhola 2012 187

  5. Audio data retrieval (a) Based on metadata � Additional attributes can be attached to voice data (such as to images and video), e.g. speaker, date, duration, composer, orchestra, instrument, ... � Attributes can be connected to the whole audio sequence or some parts of it (e.g. parts of a symphony). � General document retrieval techniques usually apply. MMDB-8 J. Teuhola 2012 188

  6. Audio data retrieval (cont.) (b) Speech recognition : � Proximity search of the waveform; feature extraction e.g. from coefficients of DCT-transformed signal. � Some fuzzyness involved � Simple application: � Giving voice commands to a user interface. � Advanced application: � Parsing of spoken sentences and conversion e.g. to database queries � Can be coupled with natural language understanding techniques. � Usually based on a predefined set of patterns and associated phonetic rules. MMDB-8 J. Teuhola 2012 189

  7. Audio data retrieval (cont.) (c) Speaker recognition : � Application: security systems. � Sensitive to the physical condition (e.g. flu) of the speaker. � Variations: � Text-dependent recognition (simpler): Restricted set of possible words/sentences Comparison of digital waveforms. � Text-independent recognition (more difficult): Based e.g. on voice pitch recognition. More elaborate sentences from particular users must be stored, and complex verification algorithms are run against the spoken samples. MMDB-8 J. Teuhola 2012 190

  8. Audio data retrieval (cont.) (d) Recognition and retrieval of songs (recorded music) Query input alternatives: � Query-by-humming : Succeeds for clearly distinguishable melodies (or themes), in spite of small pitch errors. Similarity measure uses some kind of edit distance � Tapping the tempo : Complements humming/singing � Playing a ( virtual ) keyboard Output: � Ranked list of candidate songs Example search engine: � Musipedia (http://www.musipedia.org/) MMDB-8 J. Teuhola 2012 191

  9. Encoding and retrieval of (synthetic) music � Music encoding: � For digital electronic instruments (no singing!) � Timing of note-on/note-off events, � Control of instrument and playback parameters (pitch, loudness) � Can be played with a syntherizer � Encoding formats: � MIDI (Musical Instrument Digital Interface) � MPEG-4 SA (Structured Audio) � Music XML (Notes represented using structured markup) � Retrieval criteria: � Notes: Generalization of string matching (but: polyphony!) � Time-dependent parameters: Instruments, tempo, volume, ... � Textual metadata: Title, composer, artist, genre, date, ... MMDB-8 J. Teuhola 2012 192

  10. Indexing of audio data � Indexing of metadata (external attributes): � As with any other documents: Inverted indexes, multi- attribute indexes, signature files, etc. � Indexing of audio signal: � First split into segments (= frames, windows). Segmentation requires some rules, e.g. ‘quiet’ zones are possibly good split points. � Transformation (e.g. DCT) of each segment into features � A multidimensional index is built from groups of the features (e.g. main DCT coefficients). � Proximity queries (nearest neighbor, or k nearest neighbors of the query sample) should be supported by the index. MMDB-8 J. Teuhola 2012 193

Recommend


More recommend