the erblet transform auditory time frequency masking and
play

The ERBlet transform, auditory time-frequency masking and perceptual - PowerPoint PPT Presentation

The ERBlet transform, auditory time-frequency masking and perceptual sparsity Thibaud Necciari 1 joint work with P. Balazs 1 , B. Laback 1 , P. Soendergaard 1 , 3 , R. Kronland-Martinet 2 , S. Meunier 2 , S. Savel 2 , and S. Ystad 2 1 Acoustics


  1. The ERBlet transform, auditory time-frequency masking and perceptual sparsity Thibaud Necciari 1 joint work with P. Balazs 1 , B. Laback 1 , P. Soendergaard 1 , 3 , R. Kronland-Martinet 2 , S. Meunier 2 , S. Savel 2 , and S. Ystad 2 1 Acoustics Research Institute, Vienna, Austria 2 Laboratoire de M´ ecanique et d’Acoustique, Marseille, France 3 Technical University of Denmark 2nd SPLab Workshop, October 24–26, 2012, Brno

  2. Context: Analysis-Synthesis of Sound Signals. Idea: Integrate aspects of human auditory perception in the signal representation

  3. Goal of the Study. Achieve a perceptually-motivated and invertible TF transform based on: Properties of TF transforms: 1 Linear Allow perfect reconstruction Adapted to non-stationary signals Results on human auditory perception (psychoacoustics) 2

  4. Some Aspects of Human Auditory Perception. 1. Spectral Resolution: The Auditory Filters. = Ability to resolve sinusoidal components in complex sounds. Peripheral filtering ≡ bank of bandpass filters = auditory filters

  5. Some Aspects of Human Auditory Perception. 1. Spectral Resolution: The ERB Scale [Moore & Glasberg, 1983]. Each auditory filter is characterized by its ERB = E quivalent R ectangular B andwidth

  6. Some Aspects of Human Auditory Perception. 1. Spectral Resolution: The ERB Scale [Moore & Glasberg, 1983]. Each auditory filter is characterized by its ERB = E quivalent R ectangular B andwidth

  7. Some Aspects of Human Auditory Perception. 2. Temporal Resolution. = Ability to detect rapid changes in sounds over time. Time axis partitioned into time windows (analog to spectral resolution) Windows length = temporal resolution Windows length = frequency dependent ≈ “internal” TF analysis [van Schijndel et al. , 1999] Windows length ≈ 4 periods of center frequency e.g. , 4 ms @ 1 kHz and 1 ms @ 4 kHz

  8. Some Aspects of Human Auditory Perception. 3. Auditory Masking. = Increase in the detection threshold of a sound (“target”) in the presence of another sound (“masker”).

  9. Some Aspects of Human Auditory Perception. 3. Auditory Masking. = Increase in the detection threshold of a sound (“target”) in the presence of another sound (“masker”). Measurement Amount of masking (dB) = masked threshold − absolute threshold � �� � � �� � Detection threshold of target in Detection threshold of target in quiet presence of the masker

  10. Some Aspects of Human Auditory Perception. 3. Auditory Masking. = Increase in the detection threshold of a sound (“target”) in the presence of another sound (“masker”). Main parameters: Time Frequency Stimulus duration Stimulus level Frequency region of the audible spectrum [20 Hz . . . 20 kHz]

  11. Some Aspects of Human Auditory Perception. 3. Auditory Masking: Consequence in Signal Representation. � s ( t ) = STFT ( τ, ω ) g τ,ω ( t ) d τ d ω C g ���� � �� � R normalization TF atom

  12. Some Aspects of Human Auditory Perception. 3. Auditory Masking: Consequence in Signal Representation. � s ( t ) = STFT ( τ, ω ) g τ,ω ( t ) d τ d ω C g ���� � �� � R normalization TF atom

  13. Some Aspects of Human Auditory Perception. 3. Auditory Masking: Consequence in Signal Representation. � s ( t ) = STFT ( τ, ω ) g τ,ω ( t ) d τ d ω C g ���� � �� � R normalization TF atom Can we represent only audible atoms? If so, which atoms can be removed?

  14. Proposed Approach. To obtain a perceptually-motivated and invertible TF transform:

  15. Proposed Approach. To obtain a perceptually-motivated and invertible TF transform: Adapt the transform parameters to mimic the auditory TF 1 resolution → A variable-resolution transform is required! ֒

  16. Proposed Approach. To obtain a perceptually-motivated and invertible TF transform: Adapt the transform parameters to mimic the auditory TF 1 resolution → A variable-resolution transform is required! ֒ Use a psychoacoustic model of TF masking to represent only 2 the audible components (perceptual sparsity concept).

  17. Outline. Perceptually-based TF transform: The ERBlet 1 Perceptual sparsity concept: Investigating auditory TF masking 2 Discussion: Combination of ERBlet & perceptual sparsity? 3

  18. Outline. Perceptually-based TF transform: The ERBlet 1 Concept Implementation Example Perceptual sparsity concept: Investigating auditory TF masking 2 Discussion: Combination of ERBlet & perceptual sparsity? 3

  19. The ERBlet Transform . Concept. The non-stationary Gabor transform (NSGT) [Balazs et al. , 2011] Allows resolution to freely evolve over T and/or F We can adapt both The shape of g ( t ) either in T or F The redundancy Perfect reconstruction is achieved if the frame inequality is fulfilled Idea Develop a perceptually-motivated NSGT: Use NSGT with resolution evolving over frequency to mimic the ERB scale ֒ → The ERBlet transform .

  20. ERBlet Implementation. 1. Analysis Functions. NSGT with resolution evolving over time available in LTFAT [Soendergaard, 2010]: function nsdgt.m Applying nsdgt on the Fourier transform of s ( t ) �→ ˆ s ( ν ) allows to construct NSGT with resolution evolving over frequency (= constant-Q NSGT in [Velasco et al. , 2011] but with � = functions)

  21. ERBlet Implementation. 1. Analysis Functions. NSGT with resolution evolving over time available in LTFAT [Soendergaard, 2010]: function nsdgt.m Applying nsdgt on the Fourier transform of s ( t ) �→ ˆ s ( ν ) allows to construct NSGT with resolution evolving over frequency (= constant-Q NSGT in [Velasco et al. , 2011] but with � = functions) Analysis functions (Gaussian windows): Γ m = f( m ) � � 2 2500 ν e − π 1 ˆ Γ m √ Γ m h m ( ν ) = 2000 1500 Γ m (Hz) where 1000 m = frequency index 500 Γ m = ERB m (in Hz) 0 0 0.5 10 15 20 Frequency index m (kHz)

  22. ERBlet Implementation. 2. Spectral Resolution. Analysis windows Dual windows 0.08 0.3 0.07 0.06 0.25 0.05 0.2 Amplitude Amplitude 0.04 0.15 0.03 0.1 0.02 0.05 0.01 0 0 0 1000 2000 3000 4000 5000 6000 7000 8000 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequency [Hz] Frequency 1 window/ERB ( ≡ auditory filterbank); 34 channels @ 8 kHz, 49 channels @ 22 kHz

  23. ERBlet Implementation. 3. Temporal Resolution. Analysis windows, time −3 4.5 x 10 4 kHz: Resolution = 1.1 ms 4 (auditory = 1 ms) 3.5 1 kHz: Resolution = 3.7 ms 3 (auditory = 4 ms) Amplitude 2.5 2 1.5 1 0.5 0 −500 0 500 1000 1500 2000 2500 Time index

  24. ERBlet Example. LTFAT Speech Test Signal “greasy”. Standard Gabor (dB SPL) ERBlet (dB SPL) 8000 8000 100 100 4000 6000 80 80 Frequency (Hz) Frequency (Hz) 2000 60 60 4000 1000 40 40 500 2000 20 20 250 0 100 0 0 0 0 0.1 0.2 0.3 0 0.1 0.2 0.3 Time (s) Time (s) Frame bounds ratio = 1.5 Frame bounds ratio = 1 Redundancy ≈ 4 Redundancy ≈ 4.6 Reconstruction error < 10 − 16 Reconstruction error < 10 − 16

  25. Outline. Perceptually-based TF transform: The ERBlet 1 Perceptual sparsity concept: Investigating auditory TF masking 2 Problematic Experimental methods Results Discussion: Combination of ERBlet & perceptual sparsity? 3

  26. Auditory TF Masking: Problematic. Which atoms can be removed from the signal representation? A representation of TF masking for short and narrowband signals is required.

  27. Auditory TF Masking: Problematic. Current masking data are not suitable for prediction of masking between TF atoms

  28. Auditory TF Masking: Problematic. Current masking data are not suitable for prediction of masking between TF atoms Psychoacoustical studies mostly focused on T OR F

  29. Auditory TF Masking: Problematic. Current masking data are not suitable for prediction of masking between TF atoms Psychoacoustical studies mostly focused on T OR F Very few studies measured TF masking [Fastl, 1979; Kidd & Feth, 1981; Soderquist et al. , 1981; Moore et al. , 2002]

  30. Auditory TF Masking: Problematic. Current masking data are not suitable for prediction of masking between TF atoms Psychoacoustical studies mostly focused on T OR F Very few studies measured TF masking [Fastl, 1979; Kidd & Feth, 1981; Soderquist et al. , 1981; Moore et al. , 2002] These studies used long-duration maskers: not compatible with atomic decomposition

  31. Auditory TF Masking: Problematic. Current masking data are not suitable for prediction of masking between TF atoms Psychoacoustical studies mostly focused on T OR F Very few studies measured TF masking [Fastl, 1979; Kidd & Feth, 1981; Soderquist et al. , 1981; Moore et al. , 2002] These studies used long-duration maskers: not compatible with atomic decomposition

  32. Experimental Methods. 1. Stimuli (Masker & Target). Formula √ � � 2 πf 0 t + π e − π (Γ t ) 2 s ( t ) = A Γ sin 4 f 0 = carrier frequency π 4 phase shift: signal energy = independent of f 0 Γ = shape factor of the Gaussian window

  33. Experimental Methods. 1. Stimuli (Masker & Target). Formula √ � � 2 πf 0 t + π e − π (Γ t ) 2 s ( t ) = A Γ sin 4 f 0 = carrier frequency π 4 phase shift: signal energy = independent of f 0 Γ = shape factor of the Gaussian window Spectro-temporal characteristics ERB ⇔ Γ = 600 Hz [van Schijndel et al., 1999] ERD ⇔ Γ − 1 = 1.7 ms 0-amplitude duration = 9.6 ms

Recommend


More recommend