speech recognition in the presence of highly non
play

Speech recognition in the presence of highly non-stationary noise - PowerPoint PPT Presentation

Speech recognition in the presence of highly non-stationary noise based on spatial, spectral and temporal speech/ noise modeling combined with dynamic variance adaptation M. Delcroix, K. Kinoshita, T. Nakatani, S. Araki, A. Ogawa, T.


  1. Speech recognition in the presence of highly non-stationary noise based on spatial, spectral and temporal speech/ noise modeling combined with dynamic variance adaptation M. Delcroix, K. Kinoshita, T. Nakatani, S. Araki, A. Ogawa, T. Hori, S. Watanabe, M. Fujimoto, T. Yoshioka, T. Oba, Y. Kubo, M. Souden, S. Hahm, A. Nakamura

  2. Motivation of our system  Speech enhancement - Deal with highly non-stationary noise, using all information available about speech/noise Spatial - Spectral - Temporal - Realized using two complementary enhancement processes  Recognition - Interconnection of speech enhancement and recognizer using dynamic acoustic model adaptation - Use of state of the art ASR technologies (discriminative training, system combination … ) Average accuracy improves 69 %  91.7 % Average accuracy improves 69 %  91.7 % 2

  3. Approaches for noise robust ASR Information Handling highly Interconnection w/ ASR used non-stationary noise Acoustic model   compensation, Spectral e.g. VTS Speech Spatial/spectral/   enhancement, temporal e.g. BSS Proposed 3

  4. System overview Speech enhancement ASR Word decoding Speech-noise Example based separation enhancement Spatial & spectral Spectral & temporal Use spatial spectral and  temporal information  Enable removal of highly Acoustic Language non-stationary noise model model 4

  5. System overview Speech enhancement ASR Word decoding Speech-noise Example based separation enhancement Dynamic model adaptation Good interconnection  Good interconnection with recognizer with recognizer Acoustic AM model 5

  6. Approaches for noise robust ASR Information Handling highly Interconnection w/ ASR used non-stationary noise Acoustic model   compensation, Spectral e.g. VTS Speech Spatial/spectral/   enhancement, temporal e.g. BSS Spatial, spectral &   Proposed temporal 6

  7. System overview Speech enhancement ASR Word decoding Speech-noise Example based separation enhancement Dynamic model adaptation AM 7

  8. Speech-noise separation [Nakatani, 2011]  Integrate spatial-based and spectral-based separation in a single framework Spatial separation Spectral separation Speech spatial Speech spectral Speech spatial Speech spectral model model model model Sparseness Log-max assumption Spectral Location L k L k assumption [Roweis, 2003] feature Feature [Yilmaz,2004] Noise spatial Noise spectral Noise spatial Noise spectral model model model model L k : dominant source index, i.e. indicates whether speech or noise is more dominant at each frequency k 8

  9. Speech-noise separation [Nakatani, 2011] L k  Combined using dominant source index Spatial separation Spectral separation Speech spatial Speech spectral Speech spatial Speech spectral model model model model Sparseness Log-max assumption Spectral Location L k assumption [Roweis, 2003] feature Feature [Yilmaz,2004] Noise spatial Noise spectral Noise spatial Noise spectral model model model model L k : dominant source index, i.e. indicates whether speech or noise is more dominant at each frequency k 9

  10. Speech-noise separation [Nakatani, 2011] DOLPHIN dominance based locational and power-spectral characteristics integration  Estimate speech spectral component sequence using EM algorithm MMSE Speech spectral  Estimated speech obtained using components estimation MMSE Integrate efficiently spatial and Integrate efficiently spatial and Spectral Spatial spectral information spectral information model model to remove non-stationary noise to remove non-stationary noise 10

  11. System overview Speech enhancement ASR Word decoding Speech-noise Example based separation enhancement Dynamic model adaptation AM 11

  12. Example-based enhancement [Kinoshita,2011]  Use a parallel corpus model (clean and processed speech) that represents the fine spectral and temporal structure of speech - Train a GMM from multi-condition training data processed with DOLPHIN - Generate corpus model Corpus model Processed ・・・ ・・・ Speech (training) GMM component ・・・ ・・・ sequence Clean speech ・・・ ・・・ example 12

  13. Example-based enhancement [Kinoshita,2011]  Look for the longest example segments Test utterance Wiener filtering Corpus model Processed ・・・ ・・・ Best-example Speech (training) searching GMM component ・・・ ・・・ sequence Corpus Clean speech ・・・ ・・・ model example  Use the corresponding clean speech example for Wiener filtering Using precise model of temporal structure of speech Using precise model of temporal structure of speech  remove remaining highly non-stationary noise  remove remaining highly non-stationary noise  recover precisely speech  recover precisely speech 13

  14. System overview Speech enhancement ASR Word decoding Speech-noise Example based separation enhancement Dynamic model adaptation AM 14

  15. Dynamic model adaptation [Delcroix, 2009]  Compensate mismatch between enhanced speech and acoustic model - Non-stationary noise & frame by frame processing  Mismatch changes frame by frame (dynamic)  Conventional acoustic model compensation techniques (MLLR) not sufficient  Dynamic variance compensation (Uncertainty decoding) [Deng, 2005] - Mitigate the mismatch frame by frame by considering feature variance     p y n p m N y μ σ σ ( | ) ( ) ( ; , ) n , m t t n m n m t , ,   m Enhanced speech t feature 15

  16. Dynamic feature variance model Enhanced Observed feature feature Speech Recognizer Enhancement   σ α u y 2 α ( ) ( ) t t t For each feature dimension  Assumption Amount Feature variance Amount of ∝ ∝ of Noise (Feature uncertainty) noise reduction The more we process the signal, the more we introduce uncertainty 16

  17. Dynamic feature variance model Enhanced Observed feature feature Speech Recognizer Enhancement   σ α u y 2 α ( ) ( ) t t t For each feature dimension  Optimized for recognition with ML criterion using adaptation data (Dynamic variance adaptation - DVA)  Can be combined with MLLR for static adaptation of the acoustic model mean parameters  Good interconnection Good interconnection with recognizer with recognizer 17

  18. System overview ASR Word Word Word Word decoding Wiener MMSE filtering Speech spectral components Best-example Dynamic model estimation searching adaptation Corpus Spectral Spatial model model model Example-based AM DOLPHIN enhancement 18

  19. Multi-condition/ discriminative training Add background noise training dMMI : differenced maximum samples to clean training data mutual information [McDermott, 2010] ASR Word Word Word Word decoding Wiener MMSE filtering Speech spectral components Best-example Dynamic model estimation searching adaptation Corpus AM Spectral Spatial model model model Example-based Clean/Multi DOLPHIN enhancement dMMI 19

  20. System combination [Evermann, 2000] System Word combination ASR decoding Wiener MMSE filtering lattice Speech spectral components Best-example Dynamic model estimation searching adaptation Corpus AM Spectral Spatial model model model Example-based Clean/Multi DOLPHIN enhancement dMMI 20

  21. Settings - Enhancement DOLPHIN  Spatial model - 4 mixture components  Spectral model - 256 mixture components - Speaker dependent model  Models trained in advanced using the noise/speech training data  Long windows (100 ms) to capture reverberation  Corpus model Example-based - GMM w/ 4096 mixture components - Trained on DOLPHIN processed speech - Features 60 order MFCC w/ log energy 21

  22. Settings - Recognition Recognizer  SOLON [Hori, 2007] Acoustic Model  Trained with SOLON (ML & discriminative (dMMI))  Clean - HMM w/ 254 states (include silent state) - HMM state modeled by GMM with 7 components  Multi-condition - 20 components per HMM state - No silent model  Added background noise samples to clean training data Multi-condition  7 noise environment x 6 SNR conditions data  Unsupervised/speaker dependent Adaptation  use all test data for a given speaker 22

  23. Development m6dB m3dB 0dB 3dB 6dB 9dB MEAN Baseline 49.75 52.58 64.25 75.08 84.25 90.58 69.42 Proposed 84.33 88.58 90.17 92.33 94.50 95.00 90.82 System combination System combination 90.8 % 90.8 % Relative Adap . improvement Adap . 90.1 % Multi-cond 51% Adap . 89.3 % 87.3 % Dolphin 45% Adap . DOLPHIN HTK baseline 86.7 % Example-based  SOLON 21% 88.9 % 86.5 % Adap. 20% Example-based DOLPHIN Ex. based 18% 83.2 % 83.4 % Multi-condition dMMI Baseline dMMI 10% Clean dMMI Baseline 85.0 % Sys comb 7% 69.6 % Multi-condition ML Baseline 83.2 % Clean ML Baseline 69.4 % 23

  24. Evaluation m6dB m3dB 0dB 3dB 6dB 9dB MEAN Baseline 45.67 52.67 65.25 75.42 83.33 91.67 69.00 Proposed 85.58 88.33 92.33 93.67 94.17 95.83 91.65 System combination System combination 91.7 % 91.7 % Adap . 91.1 % Adap . Adap . 90.1 % 88.9 % DOLPHIN Example-based Adap . 90.2 % Example-based 88.5 % 88.0 % 84.6 % Multi-condition dMMI Baseline DOLPHIN 84.7 % 85.1 % Clean dMMI Baseline 69.0 % 24

Recommend


More recommend