michael l p wong martin j russell
play

Michael L P Wong & Martin J Russell THE UNIVERSITY 1 of 37 OF - PowerPoint PPT Presentation

Speaker Verification Under Additive Noise Conditions With Non-stationary SNR Using PMC Michael L P Wong & Martin J Russell THE UNIVERSITY 1 of 37 OF BIRMINGHAM References M.J. Gales and S.J. Young, Robust Continuous Speech


  1. Speaker Verification Under Additive Noise Conditions With Non-stationary SNR Using PMC Michael L P Wong & Martin J Russell THE UNIVERSITY 1 of 37 OF BIRMINGHAM

  2. References • M.J. Gales and S.J. Young, “Robust Continuous Speech Recognition Using Parallel Model Combination,” IEE Transactions on Speech and Audio Processing, Vol. 4, No. 5, pp. 352-359, September 1996. • T. Matsui, T. Kanno, S. Furui, “Speaker Recognition Using HMM Composition in Noisy Environments,” Computer Speech and Language, Vol. 10, pp. 107- 116, 1996. • O. Bellot, D. Matrouf, T. Merlin and Jean-Francois Bonastre, “Additive and Convolutive Noises Compensation for Speaker Recognition”, Proceedings of the ICSLP 2000 Beijing, China, 2000. THE UNIVERSITY 2 of 37 OF BIRMINGHAM

  3. Task Definition • Clean verification speech : Good • Noise-contaminated verification speech with non-stationary SNR : Bad THE UNIVERSITY 3 of 37 OF BIRMINGHAM

  4. Preview of Results • Clean speech models tested on non- stationary SNR phrases – Speech noise : 38.55% EER – Operations room noise : 34.78% EER • Performance of compensated models – Speech noise : 19.92% EER – Operations room noise : 18.84% EER THE UNIVERSITY 4 of 37 OF BIRMINGHAM

  5. Structure of Presentation • Stage One – Evaluation of PMC on speaker verification tasks : stationary SNR conditions • Stage Two – Recognition of unknown SNR conditions • Stage Three – Modelling the dynamics of SNR in noise- contaminated verification phrases THE UNIVERSITY 5 of 37 OF BIRMINGHAM

  6. Problem Formulation • Text-dependent speaker verification • Deployment in dynamic real world environments • Model based approach • Ultimately multi noise multi SNR scenario THE UNIVERSITY 6 of 37 OF BIRMINGHAM

  7. Evaluation Using PMC • Successful in improving the performance of ASR systems • Based on work by Mark Gales • Evaluate use of PMC in text-dependent speaker verification tasks THE UNIVERSITY 7 of 37 OF BIRMINGHAM

  8. Performance of PMC in ASR Experiments 100% 90% 80% 70% Accuracy 60% 50% 40% 30% 20% 10% 0% 18 12 6 0 -6 Signal to noise ratio (dB) Un-compensated Compensated Reference : Gales THE UNIVERSITY 8 of 37 OF BIRMINGHAM

  9. Design Criteria • Additive noises considered • Scaling to be performed on noises l l l µ = µ + µ log(exp( ) g exp( )) ⊗ S N S N l l l Σ = Σ + Σ log(exp( ) g exp( )) ⊗ S N S N • Compensate only for static parameters THE UNIVERSITY 9 of 37 OF BIRMINGHAM

  10. Implementation • Selection of databases • Preparation of data • System Structure • Scoring Procedures THE UNIVERSITY 10 of 37 OF BIRMINGHAM

  11. Selection of Databases • Yoho speaker verification database – Standard database used, performance comparison available • Timit database – Used for the initialisation of isolated phone models prior to Yoho training • Noisex-92 noise database – Selection of repetitive noise sources. Two noise sources reported in this paper. Speech noise and operations room noise THE UNIVERSITY 11 of 37 OF BIRMINGHAM

  12. Preparation of Data • Scaling of both enrolment and verification data • Measurement of verification speech power – Silence periods ignored [ref 7, ITU-T Rec.] • Mixing of speech and noise from –18dB to +18dB at 6dB intervals. Retain multiplication factor, g, and take an average THE UNIVERSITY 12 of 37 OF BIRMINGHAM

  13. System Structure • Front-end – 25ms, Hamming windowed, MEL scale warped – 12 cepstral coefficients with 0 th energy appended, 1 st and 2 nd order derivatives included • HTK Software for both training and recognition • 3 state 4 component tied-triphone speaker dependent models, 1 state 4 component noise models THE UNIVERSITY 13 of 37 OF BIRMINGHAM

  14. System Structure • Training – 96 phrases per speaker – 118 authorised – 20 for General Speaker model • Recognition – 40 phrases used for both FR and FA experiments THE UNIVERSITY 14 of 37 OF BIRMINGHAM

  15. Scoring Procedures • Likelihood ratio test employed P ( X | S ) ≥ t P ( X | GSM ) • Performance quoted in % EER THE UNIVERSITY 15 of 37 OF BIRMINGHAM

  16. Experiment Methodology • Establish baseline performance using clean speaker models and clean verification data • Evaluate performance of clean speaker models under multi SNR verification data • Evaluate performance of PMC compensated speaker models under multi SNR verification data THE UNIVERSITY 16 of 37 OF BIRMINGHAM

  17. Un-compensated Models Clean speech and models performance = 0.57% 60.00% 50.00% Equal Error Rate (%) 40.00% 30.00% 20.00% 10.00% 0.00% 18 16 14 12 10 8 6 4 2 0 -2 -4 -6 -8 -10 -12 -14 -16 -18 Signal to Noise Ratio (dB) Operations Room Noise Speech Noise THE UNIVERSITY 17 of 37 OF BIRMINGHAM

  18. Compensated Models 60.00% 50.00% Equal Error Rate (%) 40.00% 30.00% 20.00% 10.00% 0.00% 18 12 6 0 -6 -12 -18 Signal to Noise Ratio (dB) Operations Room Noise Speech Noise Operations Room Noise (Std) Speech Noise (Std) THE UNIVERSITY 18 of 37 OF BIRMINGHAM

  19. Stage One Summary • Text-dependent SV task • HTK Software used with modifications for PMC • Yoho, Timit and Noisex-92 databases used • 7 SNR scenarios considered (-18dB to +18dB) THE UNIVERSITY 19 of 37 OF BIRMINGHAM

  20. Stage One Summary • PMC improves SV performance • 2 additive noises considered • Static parameters compensated • Baseline used : clean models, clean/contaminated speech THE UNIVERSITY 20 of 37 OF BIRMINGHAM

  21. Experimental Extension • We now have 7 SNR specific PMC models • Can SNR specific PMC models be used for other SNRs? How sensitive are they? • If yes, how well do they perform? THE UNIVERSITY 21 of 37 OF BIRMINGHAM

  22. Evaluation of Non-ideal PMC Models • For each SNR specific PMC model, perform SV task on noise contaminated verification phrases from –18dB to +18dB at 2dB intervals • Observe any degradation in performance from using non-ideal models THE UNIVERSITY 22 of 37 OF BIRMINGHAM

  23. Speech Noise Result Speech Noise 60.00% 50.00% Equal Error Rate (%) 40.00% 30.00% 20.00% 10.00% 0.00% 18 16 14 12 10 8 6 4 2 0 -2 -4 -6 -8 -10 -12 -14 -16 -18 Signal to Noise Ratio (dB) 18dB (0.06645) 12dB (0.132584) 6dB (0.264541) 0dB (0.527828) -6dB (1.053155) -12dB (2.101321) -18dB (4.192687) THE UNIVERSITY 23 of 37 OF BIRMINGHAM

  24. Operations Room Noise Result Operations Room Noise 60.00% 50.00% Equal Error Rate (%) 40.00% 30.00% 20.00% 10.00% 0.00% 18 16 14 12 10 8 6 4 2 0 -2 -4 -6 -8 -10 -12 -14 -16 -18 Signal to Noise Ratio (dB) 18dB (0.074531) 12dB (0.14871) 6dB (0.296715) 0dB (0.592023) -6dB (1.181242) -12dB (2.356887) -18dB (4.702608) THE UNIVERSITY 24 of 37 OF BIRMINGHAM

  25. Discussion • Allow the selection of SNR specific PMC models based on which has the highest probability for a given observation THE UNIVERSITY 25 of 37 OF BIRMINGHAM

  26. Automatic Model Selection 60.00% 50.00% Equal Error Rate (%) 40.00% 30.00% 20.00% 10.00% 0.00% 18 16 14 12 10 8 6 4 2 0 -2 -4 -6 -8 -10 -12 -14 -16 -18 Signal to Noise Ratio (dB) Operations Room Noise Speech Noise THE UNIVERSITY 26 of 37 OF BIRMINGHAM

  27. Stage Two Summary • Limiting the number of SNR specific PMC models to 7 does not affect SV performance on unknown SNR • Better performance is achieved by automatic selection of models THE UNIVERSITY 27 of 37 OF BIRMINGHAM

  28. Varying SNR Task THE UNIVERSITY 28 of 37 OF BIRMINGHAM

  29. Modelling SNR Dynamics • Operating models in parallel assumes that SNR changes occur at model boundaries • Create one model from multiple models, with the SNR dynamics embedded within the transition probabilities THE UNIVERSITY 29 of 37 OF BIRMINGHAM

  30. Implementation of a Composite HMM • Rows and columns correspond to different SNR, 1 st row = entry probability   Entry 0 . 3 0 . 2 0 . 1 0 . 1 0 . 1 0 . 1 0 . 1   + 18dB 0 . 4 0 . 1 0 . 1 0 . 1 0 . 1 0 . 1 0 . 1   + 12dB 0 . 1 0 . 4 0 . 1 0 . 1 0 . 1 0 . 1 0 . 1   + 6dB 0 . 1 0 . 1 0 . 4 0 . 1 0 . 1 0 . 1 0 . 1   0dB 0 . 1 0 . 1 0 . 1 0 . 4 0 . 1 0 . 1 0 . 1     − 6dB 0 . 1 0 . 1 0 . 1 0 . 1 0 . 4 0 . 1 0 . 1   − 12dB 0 . 1 0 . 1 0 . 1 0 . 1 0 . 1 0 . 4 0 . 1   −   18dB 0 . 1 0 . 1 0 . 1 0 . 1 0 . 1 0 . 1 0 . 4 THE UNIVERSITY 30 of 37 OF BIRMINGHAM

  31. Implementation of a Composite HMM • 3 dimensional model SNR • 1 state noise model • 3 state speech model Speech Noise • 7 state SNR model THE UNIVERSITY 31 of 37 OF BIRMINGHAM

  32. Expectations • Extracting true SNR dynamics and embedding it into the transition probabilities will further improve performance [ to be evaluated ] THE UNIVERSITY 32 of 37 OF BIRMINGHAM

  33. Varying SNR Task THE UNIVERSITY 33 of 37 OF BIRMINGHAM

Recommend


More recommend