Bayesian and Discriminative Speaker Adaptation Chih- -Hsien Hsien Huang Huang Chih Supervisor : Prof. Jen-Tzung Chien National Cheng Kung University
Outline � INTRODUCTION � INTRODUCTION INTRODUCTION � � LARGE VOCABULARY CONTINUOUS SPEECH � LARGE VOCABULARY CONTINUOUS SPEECH LARGE VOCABULARY CONTINUOUS SPEECH � RECOGNITION RECOGNITION RECOGNITION � KEYPOINTS OF THIS TALK � CONTRIBUTIONS OF DISSERTATION KEYPOINTS OF THIS TALK � � BAYESIAN DURATION ADAPTATION � BAYESIAN DURATION ADAPTATION BAYESIAN DURATION ADAPTATION � � DISCRIMINATIVE LINEAR REGRESSION DISCRIMINATIVE LINEAR REGRESSION � ADAPTATION ADAPTATION � EXPERIMENTS EXPERIMENTS � � CONCLUSION AND FUTURE WORKS CONCLUSION AND FUTURE WORKS � 2
INTRODUCTION
Why Speech Recognition is Important? � Speech communication Speech communication is one of the basic and essential � capabilities of human beings. � Speech is the only way to exchange information without any tools. � Speech control Speech control is natural on mobile devices. � � Automatic speech recognition Automatic speech recognition is important to broadcast news � transcription. � High performance automatic speech recognition recognition and summarization is desirable. summarization 4
LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION
Elements of Speech Recognition � State-of-the-art speech recognizer is based on hidden Markov hidden Markov models (HMMs). models � Parameter estimation is performed through EM algorithm EM algorithm . � Decoding rule is according to MAP criterion MAP criterion . � Goal of speech recognizer is to minimize the classification classification error . error 6
Bayesian Decision Theory � Bayes rule ( | ) ( ) P X W P W = ( | ) P W X ( ) P X � MAP decoding criterion ˆ = arg max ( ) ( ) W P X W P W W 7
Hidden Markov Models a a a 22 33 11 � Left-to-Right HMM a a 23 12 1 2 3 b b b 2 3 1 λ � Parameters of HMM { } π = π � Initial probabilities i { } A = � Transition probabilities a ij { } = ( ⋅ � Output probabilities ) B b i � Mixture of Gaussians ∑ = μ Σ M ( ) ( , ) b x c N x = j jm jm jm m 1 8
Lexicon Tree Speech Signal Large Vocabulary Feature Feature Continuous Speech Extraction Vectors Recognition Recognition results Hidden n -gram Markov language models models
Lexicon Linear structure � Tree structure � 逼,弊,... ㄅ ㄧ ㄢ 辦公 辦,搬,... ㄔ ㄍ ㄨㄥ 成功 成,程,... 成功大學 ㄥ ㄨㄥ ㄉ ㄒ ㄚ ㄍ ㄩㄝ ㄓ 成長 ㄤ 10
Search Algorithm subsyllable J(k) = + − The j th state of the k th ( , , ) ( ) max ( 1 , , ' ) Q t k j b t Q t k j , k j − ≤ ≤ j 1 ' j j j subsyllable J(k) 1 1 t T The k th States Observation Transitions within subsyllable 1 The k 'th subsyllable J(k') 1 1 t T Observation Transitions across subsyllables = + ( , , 0 ) ( ) Q t k b t { } , k j − − max max ( 1 , ' , ), ( 1 , , 0 ) Q t k J Q t k ' k ≤ ≤ 1 k ' K 11
Tree-Copy Search Concept Q(word history, arc, state) P( ‧ | 了 ) P( ‧ | 在 ) P( ‧ | 我 ) P( ‧ | 開始 ) 在 P( ‧ | 從此 ) 開始 了 我 清華 從此 P( ‧ | sil) P( ‧ | 在 ) P( ‧ | 清華 ) 災 Language Model Look-ahead Acoustic Look-ahead 台視 無從 窩 在職 樂 P( ‧ | 無從 ) P( ‧ | 台視 ) P( ‧ | 樂 ) P( ‧ | 窩 ) P( ‧ | 在職 ) V trees V trees V trees V trees V trees 12
A A A A B B C C sil sil A A B B B B C C sil sil A A B B C C C C sil sil A A B B sil sil C C sil sil acoustic model acoustic model language model t t
Search Algorithm Proceed from left to right over time t Acoustic level : process states of lexical trees − = = − Initialization: ( 1 , 0 ) ( ; 1 ) Q v t s H v t − = = − ( 1 , 0 ) 1 B v t s t { } = ′ ⋅ − ′ ( , ) max ( , | ) ( 1 , ) Time alignment: Q t s p x s s Q t s v t v ′ s Propagate back pointers ( s , ) B v t Prune unlikely hypotheses Word pair level : process word ends For each pair ( ; ) w t { } = ⋅ ( ; ) max ( | ) ( , ) H w t p w v Q t s v w v { } = ⋅ ( ; ) arg max ( | ) ( , ) v w t p w v Q t s 0 v w v v = ( ; ) Store best predecessor v w t 0 0 τ = Store best boundary ( , ) B t S v v w 14 0 0
Mismatch Problem � Many mismatch sources mismatch sources exist between training and test data in real applications. � Most popular technique is to conduct speaker/environment speaker/environment adaptation . adaptation � Maximum a posteriori (MAP) � Speaker clustering � Linear regression 15
Speaker Training indepent H C T A M S I M Acoustic Speech Models Database Speaker Adaptation Testing Adapted Data
Keypoints of This Talk � Bayesian Duration Adaptation � Parametric duration modeling � Gaussian, Poisson and gamma distributions � Joint sequential learning of acoustic model and duration model � QB estimates of Gaussian and Poisson duration models were formulated. � Reproducible prior/posterior property was exploited. 17
� Aggregate a Posteriori Linear Regression � Robustness � Considering the prior information of regression matrix � The relation of AAPLR and MAPLR was illustrated. � Discriminative adaptation � The AAP criterion can be represented as the form of minimum error rate. � Rapid adaptation � AAPLR has closed-form solution. � It is superior to traditional discriminative adaptation. (MCELR) 18
BAYESIAN DURATION ADAPTATION
Background Knowledge � Speaking rate Speaking rate is one of the mismatch sources between training � and testing. � In standard HMM, the state duration is represented with transition probability . transition probability � Non-parametric approaches � Ferguson explicitly modeled the duration � Too many parameters � Parametric approaches � Russell and Moore applied Poisson distribution � Levinson applied gamma distribution 20
Parametric Duration Modeling � HMM parameter set is extended with state duration � Initial state probability, � Transition probability, � Observation density, � Duration density, � Maximum likelihood criterion 21
0.30 0.25 Relative Frequency(%) 0.20 0.15 0.10 0.05 0.00 0 2 4 6 8 10 Duration Length 22
0.30 0.25 Relative Frequency(%) 0.20 0.15 0.10 0.05 0.00 0 2 4 6 8 10 Duration Length 23
0.30 0.25 Relative Frequency(%) 0.20 0.15 0.10 0.05 0.00 0 2 4 6 8 10 Duration Length
empirical distribution 0.30 Geometric distribution Gaussian distribution 0.25 Poisson distribution gamma distribution Relative Frequency(%) 0.20 0.15 0.10 0.05 0.00 0 2 4 6 8 10 Duration Length
Parametric Duration Models � Duration models and their prior distributions � Gaussian distribution Gaussian distribution with Gaussian prior Gaussian prior � � Poisson distribution Poisson distribution with gamma prior gamma prior � � Gamma distribution Gamma distribution with Gaussian prior Gaussian prior � � Estimation Criteria � ML estimation � MAP estimation � QB estimation 26
ML Parameter Estimation � Auxiliary Q-function 27
ML Estimation for Different Duration Parameters � Gaussian Gaussian Duration Parameters � 28
� Poisson Poisson Duration Parameters � � Gamma Gamma Duration Parameters �
Bayesian Learning of Duration Models � MAP batch learning � Risk function Risk function � 30
QB Sequential Learning � Risk function Risk function � 31
MAP Estimation for Gamma Gamma Duration Parameters Gamma duration with Gaussian prior � M-step, � � for the parameter η 32
ν for the parameter for the parameter � � � No closed No closed- -form solution form solution exists. � � Newton’s algorithm can be applied. 33
QB Estimation for Gaussian Gaussian Duration Parameters � Gaussian Duration with Gaussian prior � QB estimate is obtained by 34
QB Estimation for Poisson Poisson Duration Parameters � Poisson duration with gamma prior � E-step 35
Updating Hyperparameters � Gamma Gamma hyperparameters : � � Poisson Poisson parameters : � 36
DISCRIMINATIVE LINEAR REGRESSION ADAPTATION
Estimation Criteria � Distribution estimation Distribution estimation and discriminative training discriminative training are two � categories of HMM parameter estimation approach. � Distribution estimation � Maximum likelihood Maximum likelihood criterion � � Maximum Maximum a posteriori a posteriori criterion � � Discriminative training � Minimum classification error Minimum classification error ( MCE MCE ) criterion � � Maximum mutual information Maximum mutual information ( MMI MMI ) criterion � 38
Recommend
More recommend