Speech recognition in the presence of highly non-stationary noise based on spatial, spectral and temporal speech/ noise modeling combined with dynamic variance adaptation M. Delcroix, K. Kinoshita, T. Nakatani, S. Araki, A. Ogawa, T. Hori, S. Watanabe, M. Fujimoto, T. Yoshioka, T. Oba, Y. Kubo, M. Souden, S. Hahm, A. Nakamura
Motivation of our system Speech enhancement - Deal with highly non-stationary noise, using all information available about speech/noise Spatial - Spectral - Temporal - Realized using two complementary enhancement processes Recognition - Interconnection of speech enhancement and recognizer using dynamic acoustic model adaptation - Use of state of the art ASR technologies (discriminative training, system combination … ) Average accuracy improves 69 % 91.7 % Average accuracy improves 69 % 91.7 % 2
Approaches for noise robust ASR Information Handling highly Interconnection w/ ASR used non-stationary noise Acoustic model compensation, Spectral e.g. VTS Speech Spatial/spectral/ enhancement, temporal e.g. BSS Proposed 3
System overview Speech enhancement ASR Word decoding Speech-noise Example based separation enhancement Spatial & spectral Spectral & temporal Use spatial spectral and temporal information Enable removal of highly Acoustic Language non-stationary noise model model 4
System overview Speech enhancement ASR Word decoding Speech-noise Example based separation enhancement Dynamic model adaptation Good interconnection Good interconnection with recognizer with recognizer Acoustic AM model 5
Approaches for noise robust ASR Information Handling highly Interconnection w/ ASR used non-stationary noise Acoustic model compensation, Spectral e.g. VTS Speech Spatial/spectral/ enhancement, temporal e.g. BSS Spatial, spectral & Proposed temporal 6
System overview Speech enhancement ASR Word decoding Speech-noise Example based separation enhancement Dynamic model adaptation AM 7
Speech-noise separation [Nakatani, 2011] Integrate spatial-based and spectral-based separation in a single framework Spatial separation Spectral separation Speech spatial Speech spectral Speech spatial Speech spectral model model model model Sparseness Log-max assumption Spectral Location L k L k assumption [Roweis, 2003] feature Feature [Yilmaz,2004] Noise spatial Noise spectral Noise spatial Noise spectral model model model model L k : dominant source index, i.e. indicates whether speech or noise is more dominant at each frequency k 8
Speech-noise separation [Nakatani, 2011] L k Combined using dominant source index Spatial separation Spectral separation Speech spatial Speech spectral Speech spatial Speech spectral model model model model Sparseness Log-max assumption Spectral Location L k assumption [Roweis, 2003] feature Feature [Yilmaz,2004] Noise spatial Noise spectral Noise spatial Noise spectral model model model model L k : dominant source index, i.e. indicates whether speech or noise is more dominant at each frequency k 9
Speech-noise separation [Nakatani, 2011] DOLPHIN dominance based locational and power-spectral characteristics integration Estimate speech spectral component sequence using EM algorithm MMSE Speech spectral Estimated speech obtained using components estimation MMSE Integrate efficiently spatial and Integrate efficiently spatial and Spectral Spatial spectral information spectral information model model to remove non-stationary noise to remove non-stationary noise 10
System overview Speech enhancement ASR Word decoding Speech-noise Example based separation enhancement Dynamic model adaptation AM 11
Example-based enhancement [Kinoshita,2011] Use a parallel corpus model (clean and processed speech) that represents the fine spectral and temporal structure of speech - Train a GMM from multi-condition training data processed with DOLPHIN - Generate corpus model Corpus model Processed ・・・ ・・・ Speech (training) GMM component ・・・ ・・・ sequence Clean speech ・・・ ・・・ example 12
Example-based enhancement [Kinoshita,2011] Look for the longest example segments Test utterance Wiener filtering Corpus model Processed ・・・ ・・・ Best-example Speech (training) searching GMM component ・・・ ・・・ sequence Corpus Clean speech ・・・ ・・・ model example Use the corresponding clean speech example for Wiener filtering Using precise model of temporal structure of speech Using precise model of temporal structure of speech remove remaining highly non-stationary noise remove remaining highly non-stationary noise recover precisely speech recover precisely speech 13
System overview Speech enhancement ASR Word decoding Speech-noise Example based separation enhancement Dynamic model adaptation AM 14
Dynamic model adaptation [Delcroix, 2009] Compensate mismatch between enhanced speech and acoustic model - Non-stationary noise & frame by frame processing Mismatch changes frame by frame (dynamic) Conventional acoustic model compensation techniques (MLLR) not sufficient Dynamic variance compensation (Uncertainty decoding) [Deng, 2005] - Mitigate the mismatch frame by frame by considering feature variance p y n p m N y μ σ σ ( | ) ( ) ( ; , ) n , m t t n m n m t , , m Enhanced speech t feature 15
Dynamic feature variance model Enhanced Observed feature feature Speech Recognizer Enhancement σ α u y 2 α ( ) ( ) t t t For each feature dimension Assumption Amount Feature variance Amount of ∝ ∝ of Noise (Feature uncertainty) noise reduction The more we process the signal, the more we introduce uncertainty 16
Dynamic feature variance model Enhanced Observed feature feature Speech Recognizer Enhancement σ α u y 2 α ( ) ( ) t t t For each feature dimension Optimized for recognition with ML criterion using adaptation data (Dynamic variance adaptation - DVA) Can be combined with MLLR for static adaptation of the acoustic model mean parameters Good interconnection Good interconnection with recognizer with recognizer 17
System overview ASR Word Word Word Word decoding Wiener MMSE filtering Speech spectral components Best-example Dynamic model estimation searching adaptation Corpus Spectral Spatial model model model Example-based AM DOLPHIN enhancement 18
Multi-condition/ discriminative training Add background noise training dMMI : differenced maximum samples to clean training data mutual information [McDermott, 2010] ASR Word Word Word Word decoding Wiener MMSE filtering Speech spectral components Best-example Dynamic model estimation searching adaptation Corpus AM Spectral Spatial model model model Example-based Clean/Multi DOLPHIN enhancement dMMI 19
System combination [Evermann, 2000] System Word combination ASR decoding Wiener MMSE filtering lattice Speech spectral components Best-example Dynamic model estimation searching adaptation Corpus AM Spectral Spatial model model model Example-based Clean/Multi DOLPHIN enhancement dMMI 20
Settings - Enhancement DOLPHIN Spatial model - 4 mixture components Spectral model - 256 mixture components - Speaker dependent model Models trained in advanced using the noise/speech training data Long windows (100 ms) to capture reverberation Corpus model Example-based - GMM w/ 4096 mixture components - Trained on DOLPHIN processed speech - Features 60 order MFCC w/ log energy 21
Settings - Recognition Recognizer SOLON [Hori, 2007] Acoustic Model Trained with SOLON (ML & discriminative (dMMI)) Clean - HMM w/ 254 states (include silent state) - HMM state modeled by GMM with 7 components Multi-condition - 20 components per HMM state - No silent model Added background noise samples to clean training data Multi-condition 7 noise environment x 6 SNR conditions data Unsupervised/speaker dependent Adaptation use all test data for a given speaker 22
Development m6dB m3dB 0dB 3dB 6dB 9dB MEAN Baseline 49.75 52.58 64.25 75.08 84.25 90.58 69.42 Proposed 84.33 88.58 90.17 92.33 94.50 95.00 90.82 System combination System combination 90.8 % 90.8 % Relative Adap . improvement Adap . 90.1 % Multi-cond 51% Adap . 89.3 % 87.3 % Dolphin 45% Adap . DOLPHIN HTK baseline 86.7 % Example-based SOLON 21% 88.9 % 86.5 % Adap. 20% Example-based DOLPHIN Ex. based 18% 83.2 % 83.4 % Multi-condition dMMI Baseline dMMI 10% Clean dMMI Baseline 85.0 % Sys comb 7% 69.6 % Multi-condition ML Baseline 83.2 % Clean ML Baseline 69.4 % 23
Evaluation m6dB m3dB 0dB 3dB 6dB 9dB MEAN Baseline 45.67 52.67 65.25 75.42 83.33 91.67 69.00 Proposed 85.58 88.33 92.33 93.67 94.17 95.83 91.65 System combination System combination 91.7 % 91.7 % Adap . 91.1 % Adap . Adap . 90.1 % 88.9 % DOLPHIN Example-based Adap . 90.2 % Example-based 88.5 % 88.0 % 84.6 % Multi-condition dMMI Baseline DOLPHIN 84.7 % 85.1 % Clean dMMI Baseline 69.0 % 24
Recommend
More recommend