a new perspective on
play

A New Perspective on October 11-12 Combining GMM and DNN - PowerPoint PPT Presentation

SLSP-2016 A New Perspective on October 11-12 Combining GMM and DNN Frameworks for Speaker Adaptation Natalia Tomashenko 1,2,3 natalia.tomashenko@univ-lemans.fr Statistical Language and Speech Processing Yuri Khokhlov 3 khokhlov@speechpro.com


  1. SLSP-2016 A New Perspective on October 11-12 Combining GMM and DNN Frameworks for Speaker Adaptation Natalia Tomashenko 1,2,3 natalia.tomashenko@univ-lemans.fr Statistical Language and Speech Processing Yuri Khokhlov 3 khokhlov@speechpro.com 1 University of Le Mans, France Yannick Esteve 1 2 ITMO University, Saint-Petersburg, Russia yannick.esteve@univ-lemans.fr 3 STC-innovations Ltd, Saint-Petersburg, Russia

  2. Outline 1. Introduction • Speaker adaptation • GMM vs DNN acoustic models • GMM adaptation • DNN adaptation: related work • Combining GMM and DNN in speech recognition 2. Proposed approach for speaker adaptation: GMM-derived features 3. System fusion 4. Experiments 5. Conclusions 6. Future work 2

  3. Outline 1. Introduction • Speaker adaptation • GMM vs DNN acoustic models • GMM adaptation • DNN adaptation: related work • Combining GMM and DNN in speech recognition 2. Proposed approach for speaker adaptation: GMM-derived features 3. System fusion 4. Experiments 5. Conclusions 6. Future work 3

  4. Adaptation: Motivation Sources of Why do we need adaptation? speech variability Differences between training and testing conditions may significantly degrade Speaker Environment recognition accuracy in speech recognition systems. Adaptation is an efficient way to reduce gender, age, channel, the mismatch between the models and emotional state, background the data from a particular speaker or speaking rate, noises, channel. accent, style,… reverberation 4

  5. Speaker adaptation The adaptation of pre-existing models towards the optimal recognition of a new target speaker using limited adaptation data from the target speaker Adaptation General speaker independent (SI) Speaker adapted acoustic models, acoustic models trained on a large obtained from the SI model using corpus of acoustic data from different data of a new speaker speakers 5

  6. Acoustic Models: GMM vs DNN GMM DNN Deep Neural Networks Gaussian Mixture Models GMM-HMMs have a long history: Big advances in speech recognition since 1980s have been used in over the past 3-5 years speech recognition DNNs show higher performance than Speaker adaptation is a well-studied GMMs field of research Neural networks are state-of-the-art of acoustic modelling Speaker adaptation is still a very challenging task 6

  7. GMM adaptation Model based: Adapt the parameters of the acoustic models to better match the observed data • Maximum a posteriori ( MAP ) adaptation of GMM parameters MAP In MAP adaptation each Gaussian is updated individually: • Maximum likelihood linear regression ( MLLR ) of Gaussian parameters In MLLR adaptation all Gaussians of the same regression class share the same transform: Feature space: Transform features • Feature space maximum likelihood linear regression ( fMLLR ) 7

  8. DNN adaptation: Related work DNN adaptation Model- Multi-task Adaptation Linear Regularization Auxiliary space learning based on transformation techniques features adaptation (MTL) GMM 10 Xue et al, 2014 8 Swietojanski et al, 2014 13 Liu et al, 2014 1 Gemello et al, 2006 LIN 1 , 5 Liao, 2013 L2-prior 5 , Speaker LHUC 8 fMLLR 2 , fDLR 2 , 2 Seid et al, 2011 codes 10 , KL-divergence 6 , TVWR 13 , 12 Price et al, 2014 LHN 1 , 6 Yu et al, 2013 (fMAP) linear GMM- 3 Li et al, 2010 LON 3 , i-vectors 11 regression 9 Conservative derived oDLR 4 , 4 Yao et al, 2012 Training 7 , … 9 Huang et al, 2014 features 14 11 Senior et al, 2014 fMLLR 2 , … 14 Tomashenko & Kkokhlov, 2014 7 Albesano, Gemello et al, 2006 8

  9. Combining GMM and DNN in speech recognition Tandem features 17 17 Hermansky et al, 2000 Bottleneck features 18 18 Grézl et al, 2007 GMM log-likelihoods as features for MLP 19 19 Pinto & Hermansky, 2008 Log-likelihoods combination ROVER * , lattice-based combination, CNC ** , … * ROVER – Recognizer Output Voting Error Reduction ** CNC – Confusion Network Combination 9

  10. Outline 1. Introduction • Speaker adaptation • GMM vs DNN acoustic models • GMM adaptation • DNN adaptation: related work • Combining GMM and DNN in speech recognition 2. Proposed approach for speaker adaptation: GMM-derived features 3. System fusion 4. Experiments 5. Conclusions 6. Future work 10

  11. Proposed approach: Motivation • It has been shown that speaker adaptation is more effective for GMM acoustic models than for DNN acoustic models . • Many adaptation algorithms that work well for GMM systems cannot be easily applied to DNN s. • Neural networks and GMM s may be complementary and benefit from their combination. • To take advantage of existing adaptation methods developed for GMM s and apply them to DNN s. 11

  12. Proposed approach: GMM-derived features for DNN GMM GM DNN GMM-derived (GMMD) features Extract features using GMM models and feed these GMM-derived features to DNN . Train DNN model on GMM-derived features. Using GMM adaptation algorithms adapt GMM-derived features. 12

  13. Bottleneck-based GMM-derived features for DNNs the log-likelihood estimated using the GMM speaker independent For a given acoustic BN-feature vector 𝑷 𝒖 a new GMM-derived feature vector 𝒈 𝒖 is obtained by calculating likelihoods across all the states of the auxiliary adapted GMM on the given vector. 13

  14. Outline 1. Introduction • Speaker adaptation • GMM vs DNN acoustic models • GMM adaptation • DNN adaptation: related work • Combining GMM and DNN in speech recognition 2. Proposed approach for speaker adaptation: GMM-derived features 3. System fusion 4. Experiments 5. Conclusions 6. Future work 14

  15. System Fusion Feature level: fusion for training and decoding stages Input features 1 Feature Output Result DNN Decoder concatenation posteriors Input features 2 15

  16. System Fusion Posterior combination Output Input DNN 1 posteriors 1 features 1 Posterior Decoder Result combination Input Output DNN 2 features 2 posteriors 2 16

  17. System Fusion Lattice combination Output Input Lattices 1 Decoder DNN 1 Confusion posteriors 1 features 1 Result Network Input Output Combination Lattices 2 DNN 2 Decoder features 2 posteriors 2 17

  18. Outline 1. Introduction • Speaker adaptation • GMM vs DNN acoustic models • GMM adaptation • DNN adaptation: related work • Combining GMM and DNN in speech recognition 2. Proposed approach for speaker adaptation: GMM-derived features 3. System fusion 4. Experiments 5. Conclusions 6. Future work 18

  19. Experiments: Data TED-LIUM corpus: * 1495 TED talks, 207 hours: 141 hours of male, 66 hours of female speech data, 1242 speakers, 16kHz Duration, Number of Mean duration per Data set hours Speakers speaker, minutes Training 172 1029 10 Development 3.5 14 15 Test 1 3.5 14 15 Test 2 4.9 14 21 LM: ** 150K word vocabulary and publicly available trigram LM * A. Rousseau, P. Deleglise, and Y. Esteve , “Enhancing the TED - LIUM corpus with selected data for language modeling and more TED talks“ 2014 ** cantab-TEDLIUMpruned.lm31 19

  20. Experiments: Baseline systems We follow Kaldi TED-LIUM recipe for training baselines models: Speaker-adaptive training with fMLLR Train DNN Model #2 RBM, CE, sMBR Speaker-independent model Train DNN Model #1 20

  21. Experiments: Training models with GMMD features 2 types of integration of GMMD features into the baseline recipe: 1. Adapted features AF 1 (with monophone auxiliary GMM) Train DNN Models #3, #4 2. Adapted features AF 2 (with triphone auxiliary GMM) Train DNN Model #5 21

  22. Results: Adaptation performance for DNNs WER, % τ # Adaptation Features Dev Test 1 Test 2 baseline 1 No BN 12.14 10.77 13.75 2 fMLLR BN 10.64 9.52 12.78 3 MAP AF 1 2 10.27 9.59 12.94 GMMD 4 MAP AF 1 + align. # 2 5 10.26 9.40 12.52 5 MAP+fMLLR AF 2 + align. # 2 5 10.42 9.74 13.29 better than speaker-adapted baseline τ parameter in MAP adaptation 22

  23. Results: Adaptation and Fusion α is a weight of the baseline model in the fusion WER, % * α # Adaptation Features baseline WER in #1 was Dev Test 1 Test 2 calculated from lattices, in other 13.75 * 12.14 * 10.77 * 1 No BN lines – from consensus 2 fMLLR BN 10.57 9.46 12.67 hypothesis GMMD ↓ 4 MAP AF 1 + align. #2 10.23 9.31 10.46 Relative WER 5 MAP+fMLLR AF 2 + align. #2 10.37 9.69 13.23 reduction in ↓ 6.2 ↓ 4.3 ↓ 5.0 comparison 6 Posterior fusion: # 2 + # 4 0.45 9.91 9.06 12.04 with adapted fusion ↓ 6.2 ↓ 3.8 12.23 ↓ 3.5 baseline # 2 7 Posterior fusion: # 2 + # 5 0.55 9.91 9.10 ↓ 4.8 ↓ 4.0 ↓ 4.4 8 Lattice fusion: # 2 + # 4 0.44 10.06 9.09 12.12 ↓ 5.3 ↓ 3.1 ↓ 3.3 Best 9 Lattice fusion: # 2 + # 5 0.50 10.01 9.17 12.25 improvement • Two types of fusion: posterior level and lattice level provide additional comparable improvement, • In most cases posterior level fusion provides slightly better results than the lattice level fusion. 23

  24. Outline 1. Introduction • Speaker adaptation • GMM vs DNN acoustic models • GMM adaptation • DNN adaptation: related work • Combining GMM and DNN in speech recognition 2. Proposed approach for speaker adaptation: GMM-derived features 3. System fusion 4. Experiments 5. Conclusions 6. Future work 24

Recommend


More recommend