Feature-based Robust Techniques For Speech Recognition presented by Nguyen Duc Hoang Ha Supervisors Assoc. Prof. Chng Eng Siong Prof. Li Haizhou 08-Mar-2017
Outline ➢ An Introduction of Robust ASR ➢ The 1 st proposed method (Ch5) – The major Contribution: Feature Adaptation Using Spectro-Temporal Information ➢ The 2 nd proposed method (Ch3): Combination of Feature Enhancement and VTS Model Compensation for Non-stationary Noisy Environments ➢ The 3 rd proposed method (Ch4): A Particle Filter Compensation Approach to Robust LVCSR ➢ Conclusions and Future Directions 2
Introduction ST Transform NN-VTS PFC Conclusion Automatic Speech Recognition (ASR) [Huang2001] LM AM w hello /h e l o/ The aim is to decode the speech signal into text. 3
Introduction ST Transform NN-VTS PFC Conclusion Applications of the ASR system ➢ Siri (http://www.apple.com/ios/siri/) ➢ Amazon Echo (https://en.wikipedia.org/wiki/Amazon_Echo) ➢ Google Speech Recognition API (https://cloud.google.com/speech/) ... 4
Introduction ST Transform NN-VTS PFC Conclusion Challenges of the ASR system [Chelba2010, Li2014] ➢ Non-native speakers ➢ Dialect variations ➢ Dis-fluencies ➢ Out-of-vocabulary words ➢ Language modeling ➢ Noise robustness 5
Introduction ST Transform NN-VTS PFC Conclusion ASR in Noisy Environments [Xiao2009, Li2014] Noisy speech features Clean speech model 6
Introduction ST Transform NN-VTS PFC Conclusion Feature/Model Compensation [Xiao2009, Li2014] (B) (A) Two major approaches: (A) Feature-based approach (B) Model-based approach 7
Introduction ST Transform NN-VTS PFC Conclusion Feature/Model Compensation (A) ➢ Feature-Based Approach Examples: spectral subtraction [Boll1979], MMSE [Ephraim1984], fMLLR [Digalakis1995,Gales1998], ... ➢ Model-based Approach (B) Examples: MAP model adaptation [Gauvain1994], MLLR/CMLLR model adaptation [Leggetter1995, Gales1998], Vector Taylor series model adaptation [Acero2000, Li2009] 8
Introduction ST Transform NN-VTS PFC Conclusion Multi-condition training approach [Ng2016] (B) (A) (C) Noisy data collection / simulation 9
Introduction ST Transform NN-VTS PFC Conclusion Robust ASR (B) (C) (A) Feature-based Model-based Data Collection Approach Approach Simulation Clean Feature MAP Model Estimation (e.g. SS Deep learning Adaptation [Boll1979], MMSE approaches [Gauvain1994] [Ephraim1984], ... ) (e.g. DNN AM [Hinton2012]) MLLR, CMLLR Filtering Approach ... Model Adaptation (e.g. RASTA [ Leggetter1995, [Hermansky1994], ...) Gales1998 ] Feature Transformation VTS Model (e.g. fMLLR Compensation [Digalakis1995,Gales1998] ) [Acero2000, … Li2009] ... 10
Introduction ST Transform NN-VTS PFC Conclusion Contributions – Three Proposed Methods (B2) (A1) ST-Transform (for background noise and reverberation) (A1) (A3) (2) (A2)NN – (B2) VTS (A2) (for non-stationary noise) (A3) PFC-LVCSR (for background noise) 11
Introduction ST Transform NN-VTS PFC Conclusion Contributions – Three Proposed Methods 1) Spectra-Temporal Transformation (ST-Transform) D. H. H. Nguyen, X. Xiao, E. S. Chng, and H. Li. Generalization of temporal filter and linear transformation for robust speech recognition. In ICASSP, Italy, 2014. D. H. H. Nguyen, X. Xiao, E. S. Chng, and H. Li. Feature adaptation using linear spectro-temporal transform for robust speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, PP(99):1–1, 2016. (Contribute to success at the Reverb2014 Challenge for clean condition scheme) 2) Noise Normalization (NN) – Vector Taylor Series Model Compensation (VTS) D. H. H. Nguyen, X. Xiao, E. S. Chng, and H. Li. An analysis of vector taylor series model compensation for non-stationary noise in speech recognition. In ISCSLP, Hong Kong, 2012. 3) Particle Filter Compensation (PFC) for LVCSR D. H. H. Nguyen, A. Mushtaq, X. Xiao, E. S. Chng, H. Li, and C.-H. Lee. A particle filter compensation approach to robust lvcsr. In APSIPA ASC, Taiwan, 2013. 12
http://reverb2014.dereverberation.com/introduction.html Contributions of ST Transform 13
Introduction ST Transform NN-VTS PFC Conclusion Outline ➢ An Introduction of Robust ASR ➢ The 1 st proposed method (Ch5): Feature Adaptation Using Spectro-Temporal Informat ion ➢ The 2 nd proposed method (Ch3): Combination of Feature Enhancement and VTS Model Compensation for Non-stationary Noisy Environments ➢ The 3 rd proposed method (Ch4): A Particle Filter Compensation Approach to Robust LVCSR ➢ Conclusions and Future Directions 14
Introduction ST Transform NN-VTS PFC Conclusion Feature Adaptation Using Spectro-Temporal Information (A1) ST-Transform 15
Introduction ST Transform NN-VTS PFC Conclusion Feature Adaptation Using Spectro-Temporal Information Noisy features Transformed features ST Transform ^ y 1: T x 1: T Distribution of transformed features Kullback–Leibler divergence Distribution of training features The ST transform W is estimated to minimize a KL divergence between the distribution of the transformed features and the reference distribution of the training features. 16
Introduction ST Transform NN-VTS PFC Conclusion Changing Notations for Generalization of the Feature Transformation Feature Input features Transformed features Transformation y = f(x) x 1: T y 1: T Distribution of transformed features Kullback–Leibler divergence Distribution of training features x denotes the input feature. y denotes the output feature. Transformation “y = f(x)” is more natural. 17
Introduction ST Transform NN-VTS PFC Conclusion ST Transform: Generalized Linear Transform Input: Output: A) e.g. CMN B) e.g. C) e.g. [Atal1974], MVN fMLLR RASTA [Viikki1998] [Digalakis1 [Hermansky1994] 995,Gales TSN [Xiao2009] 1998] 18
Introduction ST Transform NN-VTS PFC Conclusion ST Transform: Generalized Linear Transform Input: Output: output feature vector input feature vectors 19
Introduction ST Transform NN-VTS PFC Conclusion ST Transform: Generalized Linear Transform 20
Introduction ST Transform NN-VTS PFC Conclusion ST Transform: Generalized Linear Transform 21 Matrix form of W
Introduction ST Transform NN-VTS PFC Conclusion EM Algorithm for Parameter Estimation From L2-Norm From KL-divergence criterion Covariance matrix of Output features Output features Ref. Model 22
Introduction ST Transform NN-VTS PFC Conclusion Insufficient Adaptation Data Issue Issues: - Unreliable statistics - Too big the degrees of freedom in ST transform Solutions: + Statistics smoothing approach + Sparse ST transform 23
Introduction ST Transform NN-VTS PFC Conclusion Statistics Smoothing Approach From training or prior data From test data From test data From training or prior data The idea of statistics smoothing is to interpolate the statistics computed from the adaptation data with the statistics computed from some prior data. 24
Introduction ST Transform NN-VTS PFC Conclusion Sparse ST Transformation – Cross Transform A) e.g. B) e.g. C) e.g. CMN, MVN, fMLLR RASTA, HEQ ARMA, TSN 25
Introduction ST Transform NN-VTS PFC Conclusion ST Transform: Generalized Linear Transform 26 Matrix form of W
Introduction ST Transform NN-VTS PFC Conclusion Matrix form of W 27
Introduction ST Transform NN-VTS PFC Conclusion Experimental Settings ➢ REVERB Challenge 2014 benchmark task for noisy and reverberant speech recognition: ➢ Clean condition training scheme: ➢ Training data: 7861 clean utterances from the WSJCAM0 database (about 17.5 hours from 92 speakers) ➢ Speech features: 13 MFCCs + 13 ∆ + 13 ∆∆ MVN post-processing ➢ Acoustic model: 3115 tied-states, 10 mixtures/state ➢ The development (dev) and evaluation (eval) data sets: ➢ Actual meeting room recording of MC-WSJ-AV corpus ➢ Near setting: 100cm distance between the microphone and the speaker ➢ Far setting: 250cm distance between the microphone and the speaker 28
Introduction ST Transform NN-VTS PFC Conclusion An Analysis of Window Length on Dev Set We will fix the window length to be 21 for temporal filter, cross transform, and the full ST transform for experiments on Eval set. 29
Introduction ST Transform NN-VTS PFC Conclusion Three different adaptation schemes Notes: Full batch mode: 1 transform for each subset (near and far) Speaker mode: 1 transform for each speaker Utterance mode: 1 transform for each utterance 30
Introduction ST Transform NN-VTS PFC Conclusion Experiments for Cascaded Transforms Cross Transform Temporal Filter ◦ fMLLR Input features FMLLR ◦ Temporal Filter Cross Transform ◦ fMLLR FMLLR ◦ Cross Transform 67 Transform 1 66 65 % Average WER 64 63 62 Transform 2 61 60 59 Output features 58 Speaker Mode Full Batch Mode Utterance Mode + Cascading transforms in tandem is an effective way of using spectro-temporal information without significant increase in the number of free parameters + Observing the best result from cascaded transform of cross transform and fMLLR 31
Recommend
More recommend