joint dereverberation and noise reduction using
play

Joint Dereverberation and Noise Reduction Using Beamforming and a - PowerPoint PPT Presentation

Joint Dereverberation and Noise Reduction Using Beamforming and a Single-Channel Speech Enhancement Scheme B. Cauchi, I. Kodrasi, R. Rehr, S. Gerlach, T. Gerkmann, S. Doclo, S. Goetze Fraunhofer IDMT, Project Group Hearing, Speech and Audio


  1. Joint Dereverberation and Noise Reduction Using Beamforming and a Single-Channel Speech Enhancement Scheme B. Cauchi, I. Kodrasi, R. Rehr, S. Gerlach, T. Gerkmann, S. Doclo, S. Goetze Fraunhofer IDMT, Project Group Hearing, Speech and Audio Technology Oldenburg University, Signal Processing Group Florence, May 10th 2014 benjamin.cauchi@idmt.fraunhofer.de phone 0441 2172-450 � Fraunhofer IDMT c 1/16

  2. Introduction � Overview of the proposed system � Design of the MVDR beamformer � DOA estimated using MUSIC � Estimated noise covariance � Single-channel enhancement scheme � Combination and optimization of published estimators � Results � Objective measures � MUSHRA scores � WER using a baseline recognizer � Fraunhofer IDMT c 2/16

  3. 1. Proposed System Overview VAD Γ ( k ) coherence estimation ˆ h θ DOA 1 ( n ) estimation h 2 ( n ) s ( n ) y 1 ( n ) single-channel enhancement beamformer h M ( n ) y 2 ( n ) MVDR x ( n ) ˆ s ( n ) ˆ θ y M ( n ) � Beamformer: towards estimated direction of arrival (DOA) � Single-channel enhancement: based on statistical estimators � Late reverberant spectral variance (LRSV) � Noise spectral variance (NSV) � Speech spectral variance (SSV) � Fraunhofer IDMT c 3/16

  4. 2. MVDR Beamformer With Y m ( k , ℓ ) the STFT of the input signal in the m -th microphone we define Y ( k , ℓ ) = [ Y 1 ( k , ℓ ) Y 2 ( k , ℓ ) . . . Y M ( k , ℓ )] T The output ˆ X ( k , ℓ ) of the beamformer is obtained as ˆ X ( k , ℓ ) = W H θ ( k ) Y ( k , ℓ ) where Γ − 1 ( k ) d θ ( k ) W θ ( k ) = d H θ ( k ) Γ − 1 ( k ) d θ ( k ) � Noise coherence matrix: Γ ( k ) estimated using a VAD. from ˆ � Steering vector: d θ ( k ) θ using a far-field assumption. � Fraunhofer IDMT c 4/16

  5. 2. MVDR Beamformer Estimation of noisefield coherence � Noise periods identified with a VAD � Comparison between the long-term spectral envelope and the average noise spectrum � Γ ( k ) is estimated using detected noise-only frames � Alternatively, a theoretically diffuse noise field is used: ( Γ diff ( k )+ ̺ ( k ) I M ) − 1 d θ ( k ) W θ ( k ) = θ ( k )( Γ diff ( k )+ ̺ ( k ) I M ) − 1 d θ ( k ) d H with ̺ ( k ) a constraint such that W H θ ( k ) W θ ( k ) ≤ WNG max = 10 dB Ramirez, J., Segura, J.C., Benitez, C., de la Torre, A., and Rubio, A., Efficient voice activity detection algorithms using long-term speech information , 2003. � Fraunhofer IDMT c 5/16

  6. b b b b b b b b b b b b b b b b b b b b b b b b 2. MVDR Beamformer DOA Estimation k high 1 ˆ � θ = argmax U θ ( k , ℓ ), 3 K θ k low 2 θ 1 where U θ ( k , ℓ ) is the MUSIC pseudo-spectra: 0 1 -3 -2 -1 0 1 2 3 U θ ( k , ℓ ) = d H θ ( k ) E ( k , ℓ ) E H ( k , ℓ ) d θ ( k ) -1 E ( k , ℓ ) = [ e Q +1 ( k , ℓ ) . . . e M ( k , ℓ )] -2 -3 with e m denoting eigenvectors of the covariance matrix of Y ( k , ℓ ) . Schmidt, R., Multiple emitter location and signal parameter estimation , 1986. � Fraunhofer IDMT c 6/16

  7. 3. Single-channel Enhancement Overview σ 2 σ 2 σ 2 ˆ v ( k ) ˆ r ( k ) + ˆ v ( k ) ˜ ˜ + r ( k ) T 60 σ 2 ˆ estimator speech σ 2 reverb speech σ 2 ˆ z ( k ) ˆ s ( k ) G ( k ) noise power gain power power power re- estimation T 60 function estimation estimation estimation ˆ X ( k ) ˆ S ( k ) × � σ 2 v ( k , ℓ ) estimated using Minimum Statistics ˜ � σ 2 s ( k , ℓ ) estimated using Cepstral Smoothing � σ 2 r ( k , ℓ ) estimated using Lebart’s approach Martin, R., Noise Power Spectral Density Estimation Based on Optimal Smoothing and Minimum Statistics , 2001. Breithaupt, C., Gerkmann T. and Martin, R., A Novel A Priori SNR Estimation Approach Based on Selective Cepstro-Temporal Smoothing , 2008. Eaton, J., Gaubitch, N.D., Naylor, P.A., Noise-robust reverberation time estimation using spectral decay distributions with reduced computational cost , 2012. Lebart, K., Boucher J.M. and Denbigh, P., A new method based on spectral subtraction for speech dereverberation , 2013. � Fraunhofer IDMT c 7/16

  8. 3. Single-channel Enhancement LRSV estimation � RIR modeled as Gaussian noise with decay ∆ = 3 ln 10 T 60 f s RIR Amplitude decay 0 0.1 0.2 0.3 0.4 0.5 Time [s] � Representing the variance of the reverberant speech as: � σ 2 z ( k , ℓ ) = σ 2 r ( k , ℓ ) + σ 2 s ( k , ℓ ) � Leads to the estimator σ 2 r ( k , ℓ ) = e − 2∆ T d f s σ 2 � ˆ z ( k , ℓ − T d /T s ) Lebart, K., Boucher J.M. and Denbigh, P., A new method based on spectral subtraction for speech dereverberation , 2001. � Fraunhofer IDMT c 8/16

  9. 3. Single-channel Enhancement Gain function � The output ˆ X ( k , ℓ ) of the beamformer contains the anechoic speech, remaining noise and spatially filtered reverberation X ( k , ℓ ) = S ( k , ℓ ) + ˜ ˆ V ( k , ℓ ) + R ( k , ℓ ) � � We aim to compute a real gain such that: � ˆ S ( k , ℓ ) = G ( k , ℓ ) ˆ X ( k , ℓ ) � Computation of G ( k , ℓ ) using an MMSE estimation of the speech amplitude based on a super Gaussian speech model. Breithaupt, C., Krawczyk, M., and Martin, R., Parameterized MMSE spectral magnitude estimation for the enhancement of noisy speech , 2008. � Fraunhofer IDMT c 9/16

  10. 4. Objective Measures SRMR Real Simulated 9 9 1 Channel 8 Channels 7 7 Unprocessed SRMR [dB] 5 5 3 3 1 1 near far near far near far Mean near far Mean 250 ms 500 ms 700 ms 700 ms � Illustrates dereverberation performance in all condition � Better dereverberation achieved by multichannel, except for T 60 =500 ms � Fraunhofer IDMT c 10/16

  11. 4. Objective Measures FWSSNR Simulated 14 Unprocessed 1 Channel 8 Channels FWSSNR [dB] 10 6 2 near far near far near far Mean 250 ms 500 ms 700 ms � Illustrates noise reduction in all condition � Beamforming step advantageous for the noise reduction � Fraunhofer IDMT c 11/16

  12. 4. Objective Measures PESQ Simulated 3.5 Unprocessed 1 Channel 8 Channels 2.5 PESQ 1.5 near far near far near far Mean 250 ms 500 ms 700 ms � Improvement of PESQ score in all condition illustrate the overall improvement in speech quality � Fraunhofer IDMT c 12/16

  13. 5. Subjective Tests MUSHRA test Intermediate results of the subjective test ran by the organizers: � Tests carried out separately for 1 and 8 channels Multichannel Single-channel 80 80 MUSHRA score [%] 70 Sim Real 70 Sim Real Reverberation 60 60 Overall Quality 50 50 Unprocessed 40 40 30 30 20 20 10 10 0 0 near far near far near far near far � Improvement for all tested condition � Higher improvement of the overall quality � Fraunhofer IDMT c 13/16

  14. 6. Preprocessing for ASR Word Error Rate � Baseline recognizer provided by the organizers � Using pre-trained models on clean data Real Simulated 100 100 1 Channel −12.74 −12.11 −11.48 90 90 8 Channels −18.69 80 80 −25.93 baseline −25.85 −20.31 −25.76 −30.48 70 70 WER \% 60 60 −10.18 −42.53 −15.78 50 50 −21.63 −19.17 −11.94 3.27 40 2.83 40 −24.91 2.33 1.65 30 30 20 20 10 10 250, near 250, far 500, near 500, far 700, near 700, far Mean 700, near 700, far Mean � Fraunhofer IDMT c 14/16

  15. 7. Conclusion � System based on combination of MVDR beamformer and spectral enhancement � All parameters are blindly estimated � Speech enhancement achieved in all conditions in terms of: � Objective measures � Subjective tests � Word error rate � Fraunhofer IDMT c 15/16

  16. Thank you very much for your attention Questions ? Fraunhofer IDMT Project Group Hearing, Speech and Audio Technology Oldenburg University Signal Processing Group benjamin.cauchi@idmt.fraunhofer.de House of Hearing, Oldenburg � Fraunhofer IDMT c 16/16

Recommend


More recommend