Feature-based Robust Techniques For Speech Recognition presented by - PowerPoint PPT Presentation

Feature-based Robust Techniques For Speech Recognition presented by Nguyen Duc Hoang Ha Supervisors Assoc. Prof. Chng Eng Siong Prof. Li Haizhou 08-Mar-2017

Outline ➢ An Introduction of Robust ASR ➢ The 1 st proposed method (Ch5) – The major Contribution: Feature Adaptation Using Spectro-Temporal Information ➢ The 2 nd proposed method (Ch3): Combination of Feature Enhancement and VTS Model Compensation for Non-stationary Noisy Environments ➢ The 3 rd proposed method (Ch4): A Particle Filter Compensation Approach to Robust LVCSR ➢ Conclusions and Future Directions 2

Introduction ST Transform NN-VTS PFC Conclusion Automatic Speech Recognition (ASR) [Huang2001] LM AM w hello /h e l o/ The aim is to decode the speech signal into text. 3

Introduction ST Transform NN-VTS PFC Conclusion Applications of the ASR system ➢ Siri (http://www.apple.com/ios/siri/) ➢ Amazon Echo (https://en.wikipedia.org/wiki/Amazon_Echo) ➢ Google Speech Recognition API (https://cloud.google.com/speech/) ... 4

Introduction ST Transform NN-VTS PFC Conclusion Challenges of the ASR system [Chelba2010, Li2014] ➢ Non-native speakers ➢ Dialect variations ➢ Dis-fluencies ➢ Out-of-vocabulary words ➢ Language modeling ➢ Noise robustness 5

Introduction ST Transform NN-VTS PFC Conclusion ASR in Noisy Environments [Xiao2009, Li2014] Noisy speech features Clean speech model 6

Introduction ST Transform NN-VTS PFC Conclusion Feature/Model Compensation [Xiao2009, Li2014] (B) (A) Two major approaches: (A) Feature-based approach (B) Model-based approach 7

Introduction ST Transform NN-VTS PFC Conclusion Feature/Model Compensation (A) ➢ Feature-Based Approach Examples: spectral subtraction [Boll1979], MMSE [Ephraim1984], fMLLR [Digalakis1995,Gales1998], ... ➢ Model-based Approach (B) Examples: MAP model adaptation [Gauvain1994], MLLR/CMLLR model adaptation [Leggetter1995, Gales1998], Vector Taylor series model adaptation [Acero2000, Li2009] 8

Introduction ST Transform NN-VTS PFC Conclusion Multi-condition training approach [Ng2016] (B) (A) (C) Noisy data collection / simulation 9

Introduction ST Transform NN-VTS PFC Conclusion Robust ASR (B) (C) (A) Feature-based Model-based Data Collection Approach Approach Simulation Clean Feature MAP Model Estimation (e.g. SS Deep learning Adaptation [Boll1979], MMSE approaches [Gauvain1994] [Ephraim1984], ... ) (e.g. DNN AM [Hinton2012]) MLLR, CMLLR Filtering Approach ... Model Adaptation (e.g. RASTA [ Leggetter1995, [Hermansky1994], ...) Gales1998 ] Feature Transformation VTS Model (e.g. fMLLR Compensation [Digalakis1995,Gales1998] ) [Acero2000, … Li2009] ... 10

Introduction ST Transform NN-VTS PFC Conclusion Contributions – Three Proposed Methods (B2) (A1) ST-Transform (for background noise and reverberation) (A1) (A3) (2) (A2)NN – (B2) VTS (A2) (for non-stationary noise) (A3) PFC-LVCSR (for background noise) 11

Introduction ST Transform NN-VTS PFC Conclusion Contributions – Three Proposed Methods 1) Spectra-Temporal Transformation (ST-Transform) D. H. H. Nguyen, X. Xiao, E. S. Chng, and H. Li. Generalization of temporal filter and linear transformation for robust speech recognition. In ICASSP, Italy, 2014. D. H. H. Nguyen, X. Xiao, E. S. Chng, and H. Li. Feature adaptation using linear spectro-temporal transform for robust speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, PP(99):1–1, 2016. (Contribute to success at the Reverb2014 Challenge for clean condition scheme) 2) Noise Normalization (NN) – Vector Taylor Series Model Compensation (VTS) D. H. H. Nguyen, X. Xiao, E. S. Chng, and H. Li. An analysis of vector taylor series model compensation for non-stationary noise in speech recognition. In ISCSLP, Hong Kong, 2012. 3) Particle Filter Compensation (PFC) for LVCSR D. H. H. Nguyen, A. Mushtaq, X. Xiao, E. S. Chng, H. Li, and C.-H. Lee. A particle filter compensation approach to robust lvcsr. In APSIPA ASC, Taiwan, 2013. 12

http://reverb2014.dereverberation.com/introduction.html Contributions of ST Transform 13

Introduction ST Transform NN-VTS PFC Conclusion Outline ➢ An Introduction of Robust ASR ➢ The 1 st proposed method (Ch5): Feature Adaptation Using Spectro-Temporal Informat ion ➢ The 2 nd proposed method (Ch3): Combination of Feature Enhancement and VTS Model Compensation for Non-stationary Noisy Environments ➢ The 3 rd proposed method (Ch4): A Particle Filter Compensation Approach to Robust LVCSR ➢ Conclusions and Future Directions 14

Introduction ST Transform NN-VTS PFC Conclusion Feature Adaptation Using Spectro-Temporal Information (A1) ST-Transform 15

Introduction ST Transform NN-VTS PFC Conclusion Feature Adaptation Using Spectro-Temporal Information Noisy features Transformed features ST Transform ^ y 1: T x 1: T Distribution of transformed features Kullback–Leibler divergence Distribution of training features The ST transform W is estimated to minimize a KL divergence between the distribution of the transformed features and the reference distribution of the training features. 16

Introduction ST Transform NN-VTS PFC Conclusion Changing Notations for Generalization of the Feature Transformation Feature Input features Transformed features Transformation y = f(x) x 1: T y 1: T Distribution of transformed features Kullback–Leibler divergence Distribution of training features x denotes the input feature. y denotes the output feature. Transformation “y = f(x)” is more natural. 17

Introduction ST Transform NN-VTS PFC Conclusion ST Transform: Generalized Linear Transform Input: Output: A) e.g. CMN B) e.g. C) e.g. [Atal1974], MVN fMLLR RASTA [Viikki1998] [Digalakis1 [Hermansky1994] 995,Gales TSN [Xiao2009] 1998] 18

Introduction ST Transform NN-VTS PFC Conclusion ST Transform: Generalized Linear Transform Input: Output: output feature vector input feature vectors 19

Introduction ST Transform NN-VTS PFC Conclusion ST Transform: Generalized Linear Transform 20

Introduction ST Transform NN-VTS PFC Conclusion ST Transform: Generalized Linear Transform 21 Matrix form of W

Introduction ST Transform NN-VTS PFC Conclusion EM Algorithm for Parameter Estimation From L2-Norm From KL-divergence criterion Covariance matrix of Output features Output features Ref. Model 22

Introduction ST Transform NN-VTS PFC Conclusion Insufficient Adaptation Data Issue Issues: - Unreliable statistics - Too big the degrees of freedom in ST transform Solutions: + Statistics smoothing approach + Sparse ST transform 23

Introduction ST Transform NN-VTS PFC Conclusion Statistics Smoothing Approach From training or prior data From test data From test data From training or prior data The idea of statistics smoothing is to interpolate the statistics computed from the adaptation data with the statistics computed from some prior data. 24

Introduction ST Transform NN-VTS PFC Conclusion Sparse ST Transformation – Cross Transform A) e.g. B) e.g. C) e.g. CMN, MVN, fMLLR RASTA, HEQ ARMA, TSN 25

Introduction ST Transform NN-VTS PFC Conclusion ST Transform: Generalized Linear Transform 26 Matrix form of W

Introduction ST Transform NN-VTS PFC Conclusion Matrix form of W 27

Introduction ST Transform NN-VTS PFC Conclusion Experimental Settings ➢ REVERB Challenge 2014 benchmark task for noisy and reverberant speech recognition: ➢ Clean condition training scheme: ➢ Training data: 7861 clean utterances from the WSJCAM0 database (about 17.5 hours from 92 speakers) ➢ Speech features: 13 MFCCs + 13 ∆ + 13 ∆∆ MVN post-processing ➢ Acoustic model: 3115 tied-states, 10 mixtures/state ➢ The development (dev) and evaluation (eval) data sets: ➢ Actual meeting room recording of MC-WSJ-AV corpus ➢ Near setting: 100cm distance between the microphone and the speaker ➢ Far setting: 250cm distance between the microphone and the speaker 28

Introduction ST Transform NN-VTS PFC Conclusion An Analysis of Window Length on Dev Set We will fix the window length to be 21 for temporal filter, cross transform, and the full ST transform for experiments on Eval set. 29

Introduction ST Transform NN-VTS PFC Conclusion Three different adaptation schemes Notes: Full batch mode: 1 transform for each subset (near and far) Speaker mode: 1 transform for each speaker Utterance mode: 1 transform for each utterance 30

Introduction ST Transform NN-VTS PFC Conclusion Experiments for Cascaded Transforms Cross Transform Temporal Filter ◦ fMLLR Input features FMLLR ◦ Temporal Filter Cross Transform ◦ fMLLR FMLLR ◦ Cross Transform 67 Transform 1 66 65 % Average WER 64 63 62 Transform 2 61 60 59 Output features 58 Speaker Mode Full Batch Mode Utterance Mode + Cascading transforms in tandem is an effective way of using spectro-temporal information without significant increase in the number of free parameters + Observing the best result from cascaded transform of cross transform and fMLLR 31

Feature-based Robust Techniques For Speech Recognition presented by - PowerPoint PPT Presentation

Feature-based Robust Techniques For Speech Recognition presented by Nguyen Duc Hoang Ha Supervisors Assoc. Prof. Chng Eng Siong Prof. Li Haizhou 08-Mar-2017 Outline An Introduction of Robust ASR The 1 st proposed method (Ch5) The

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 12: Acoustic

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

3. Feature Extraction 3.1 Feature Extraction from Speech or other types of audio like music

FINDING YOUR VOICE IN THE REGULATORY AGE NIGEL CANNINGS CTO nigel.cannings@intelligentvoice.com

P a rt icle Colli s ion s in Tur b u len t Flow s Michel Vok u hle Labo r a t oi r e de P hy s i

Introduction Magnetic reconnection is Very powerful energy converter Very common in the

Water Works System Master Planning For a Sustainable Community Presented By: Robert G. Mitchard

AN INCOME GENERATION MODEL Donnahae Rhoden-Salmon Antoinette Barton-Gooden Robin R. Leger Steve

R.K. INFRATEL LTD. Way to Win Technology ISO 9001:2008 CERTIFIED COMPANY Infrastructure

Financial Performance August 17, 2020 1. Introduction 2. Business Overview 3. Financials 4. Q3

AGM 26 April 2016 Shareholders Presentation 1 Disclaimer This presentation may contain