Discriminative Linear Transforms for Feature Normalization and Speaker Adaptation in HMM Estimation Stavros Tsakalidis, Vlasios Doumpiotis, William Byrne Center for Language and Speech Processing Department of Electrical and Computer Engineering The Johns Hopkins University Baltimore MD, USA
� � � � � � Discriminative Linear Transforms Goal: Develop discriminative versions of existing Maximum Likelihood training procedures Focus on: Techniques that incorporate ML estimation of linear transforms during training: MLLT: Transform acoustic data to ease diagonal covariance Gaussian modeling assumption SAT: Apply speaker dependent transforms to speaker independent models Prior Work: Both MLLT and SAT were developed as ML techniques, but have also been used with MMI The AT&T LVCSR-2001 system used: – feature-based transforms obtained by ML estimation techniques, and – were then fixed throughout the subsequent iterations of MMI model estimation McDonough et al. (ICASSP’02) combined SAT with MMI by – estimating the SD transforms under ML, and – subsequently using MMI for the estimation of the SI HMM Gaussian parameters. Estimation Criterion: To develop discriminative versions of these techniques, we use Conditional Maximum Likelihood (CML) estimation procedures CMLLR developed by Asela Gunawardana Used for unsupervised discriminative adaptation in the JHU LVCSR-2001 evaluation system ICSLP - Sept 2002 Center for Language and Speech Processing 2
✦ � ✚ ✙ ✘ ✑ ✂ ★ ✵ ✢ ✜ ☎ ✫ ☛ ★ ☎ ✞ ✝ ✆ ✛ ✢ ✂ ✂ ✰ ✭ ☛ ✯ ✦ � ✑ � ✴ ✮ ✭ ✣ ✭ ✂ ✁ ★ ✳ ☎ ✂ ✏ ✍ ✁ ✂ ✏ ☞ ✎ ☎ ✌ ✔ ☞ ☛ ✑ ☎ ✞ ✝ ✍ ☎ ☎ ✛ ★ ✁ ☞ ✂ ✢ ✜ ✚ ✎ ✙ ✘ ✄ ✑ ✂ ✏ ☞ ✆ CML Auxiliary function CML criterion uses a general auxiliary function similar to EM ✑✓✒ ✕✗✖ ✏✤✣ ✥✧✦ ✏✤✣ ✥✧✦ ✏✤✣ ✥✧✦ ✑✤✬ ✑✪✩ ✟✡✠ ✟✡✠ is the parameter we wish to estimate under the CML criterion Parameters values are tied over sets of states, defined by the regression classes ✥✲✱ - We apply this criterion to two estimation problems: 1. Covariance modeling 2. Speaker adaptive training - State dependent distributions are reparametrized to incorporate the linear transforms - CML versions of MLLT and SAT are readily obtained - Goal is to maximize by alternately updating the transforms and HMM parameters ✏✡✶ ✥✸✷ - As a result both transforms and HMM Gaussian parameters are estimated discriminatively ICSLP - Sept 2002 Center for Language and Speech Processing 3
✥ ☎ ✥ ✼ ☛ ✥ ✿ ✄ ❂ ✦ ❘ ✟ ✑ ■ ❍ ❋ ● ❇ ❍ ■ ❇ ● ❖ ✭ ✂ ❋ ✭ ✳ ✹ ✭ ✭ ✼ ✽ ✑ ✾ ✹ ✾ ✼ ★ ✣ ✩ ✻ ✦ ✦ ✏ ✢ ❙ ✥ ✦ ■ Discriminative Likelihood Linear Transforms Goal: Transform feature vector to capture the correlation between the vector components Apply affine transform matrix to the (extended) observation vector ✺✧✻ Under the preceding model, the reparametrized emission density of state is ▲◆▼ ✟❑❏ ❆❈❇ ✞❊❉ ✞❊❉ P✪◗ ✏✤✣ ✑❄❃ ✏❁❀ ✥✤❅ Objective: We estimate the transforms and HMM parameters under the CML criterion The transforms obtained under this criterion are termed Discriminative Likelihood Linear Transforms (DLLT) This estimation is performed as a two-stage iterative procedure: a) First maximize the CML criterion with respect to the transforms while keeping the Gaussian parameters fixed b) Subsequently, we compute the Gaussian parameters using the updated values of the affine transforms ICSLP - Sept 2002 Center for Language and Speech Processing 4
✂ ☎ ❞ ❫ ✏ ✼ ❇ ❪ ☛ ✏ ❝ ✒ ☎ ✻ ☛ ✑ ✑ ❉ ✼ ❇ ❪ ☛ ✏ ✭ ❛ ☎ ✕ ✂ ☎ ✍ ✔ ☎ ✎ ☞ ✏ ❅ ✑ ✾ ❜ ❉ ☞ ✩ ✫ ☎ ✺ ❛ ☎ ✽ ❪ ❝ ✒ ☞ ☎ ✼ ❇ ❪ ❉ ☛ ❡ ❢ ❉ ❅ ✩ ❉ ✏ ❝ ☎ ✒ ✻ ☛ ✑ ✏ ❝ ✑ ✑ ✑ ✻ ✏ ☛ ✑ ✼ ❇ ❪ ☛ ✏ ❅ ☎ ✩ ❝ ☛ ☎ ✒ ✻ ☛ ✑ ✏ ❝ ☎ ✒ ✻ ✏ ✎ ☛ ✠ ✎ ❳ ✭ ☎ ✆ ✝ ✞ ☎ ✟ ☛ ❭ ❫ ❴ ◗ ☎ ✎ ❳ ❵ ☞ ✌ ✍ ☛ ☎ ✎ ❩ ☎ ❛ ✺ ✁ ✹ ☛ ✽ ❳ ✭ ☎ ❳ ❳ ✩ ❬ ☛ ✎ ❳ ❅ ❇ ❪ ☛ ✎ ☎ ☞ ☎ ☎ ✭ ☎ ✆ ✝ ✞ ☎ ✟ ✠ ☛ ❝ ✎ ✎ ❳ ❴ ◗ ☎ ✎ ❳ ❵ ☞ ✌ ✍ ❳ ☛ ✏ ✕ ✂ ✫ ✍ ✔ ☎ ✎ ☞ ✏ ✂ ✑ ✾ ❬ ☞ ✾ ❉ ☞ ✩ ✫ ☎ ❛ ☎ ❜ ✻ Effective DLLT Estimation As in MLLT (Gales ’97), the row of the transformation matrix is found by ❚❱❯❲ ✏✤❨ ✑✤❭ where ✑✓✒ ✑✓✒ Problem: The diagonal terms of dominate when is diagonal. ☎❣✒ - The large values of as used in MMI further exaggerate this effect. - The resulting DLLT transform is effectively identity. by the estimate of its full covariance matrix. Solution: Replace in ICSLP - Sept 2002 Center for Language and Speech Processing 5
❧ ❆ ❂ ✞ ✥ ❅ ☎ ✥ ❇ ✿ ✞ ■ ❇ ❖ ♥ ■ ❧ ❫ ❏ ✢ ◗ P ✦ ✟ ❤ ■ ♥ ✭ ❇ ★ ✂ ✐ ❤ ✑ ✟ Discriminative Speaker Adaptive Training Goal: Reduce the inter-speaker variability within the training set Apply speaker dependent transforms to speaker independent means Under the preceding model, the reparametrized emission density for state and speaker is ▲◆▼ ❉✧♠ ❉✧♠ ✏✤✣ ✥✧✦ ✑❦❥ ✏❁❀ Objective: Compute the speaker dependent transforms and speaker independent parameters of the state dependent distribution under the CML criterion. We call this procedure Discriminative Speaker Adaptive Training (DSAT). This estimation is performed as a two-stage iterative procedure: a) We first maximize the CML criterion with respect to the speaker dependent affine transforms while keeping the speaker independent means fixed to their current values. b) Subsequently, we compute the speaker independent means and variances using the updated values of the speaker dependent affine transforms. ICSLP - Sept 2002 Center for Language and Speech Processing 6
� � System Description Acoustic Models – Standard HTK flat-start training procedure – Tied state, cross-word, context-dependent triphones – 4000 unique triphone states – 6 mixtures per speech state – tagged acoustic clustering to incorporate interjection and word-boundary info Training/Test Set – The collection defined the minitrain & minitest set for the 2001 JHU LVCSR system – Training: 16.4 hours from Switchboard-1 and 0.5 hour from Callhome English data – Test: 866 utterances from the 2000 Hub-5 Switchboard-1 evaluation set (Swbd1) and 913 utterances from the 1998 Hub-5 Switchboard-2 evaluation set (Swbd2) ICSLP - Sept 2002 Center for Language and Speech Processing 7
� � MMI training & Regression Class Selection Discriminative training requires alternate word sequences that are representative of the recognition errors made by the decoder: – Obtain triphone lattices generated on the training data, using the AT&T FSM decoder. – Use the Viterbi procedure over triphone segments, rather than accumulating statistics via the Forward- Backward procedure at the word level. – These triphone segments are fixed throughout MMI training. Assignment of Gaussians into classes: – Use a variation of the HTK regression class tree implementation – All states of all context-dependent phones associated with the same monophone are assigned to the same initial class – Apply the HTK splitting algorithm to each of the initial classes – Constraint: all the mixture components associated with the same state belong to the same regression class ICSLP - Sept 2002 Center for Language and Speech Processing 8
� � � Goals of the Experiments Compare ML trained transforms to CML trained transforms: – Gaussian parameters are fixed throughout transform updates – test whether CML transforms improve over ML transforms – validate CML as a modeling procedure Compare ML training techniques (MLLT, SAT) to their fully discriminative counterparts: – investigate fully discriminative training compared to ML training Identify a proper initialization point for our discriminative techniques: – proper seeding of DLLT and DSAT turns out to be crucial ICSLP - Sept 2002 Center for Language and Speech Processing 9
Recommend
More recommend