bat system description for nist lre 2015
play

BAT System Description for NIST LRE 2015 BUT+Agnitio+Torino - PowerPoint PPT Presentation

BAT System Description for NIST LRE 2015 BUT+Agnitio+Torino Oldrich Plchot, Pavel Matejka, Radek Fer, Ondrej Glembek,Ondrej Novotny, Jan Pesan, Lukas Burget, Martin Karafiat, Karel Vesely, Lucas Ondel, Santosh Kesiraju, Frantisek Grezl, Sri


  1. BAT System Description for NIST LRE 2015 BUT+Agnitio+Torino Oldrich Plchot, Pavel Matejka, Radek Fer, Ondrej Glembek,Ondrej Novotny, Jan Pesan, Lukas Burget, Martin Karafiat, Karel Vesely, Lucas Ondel, Santosh Kesiraju, Frantisek Grezl, Sri Harish Mallidi (JHU), Ruizhi Li (JHU), Niko Brummer, Albert Swart, Sandro Cumani June 22, Bilbao, Odyssey 2016

  2. Data ● Fixed training condition ○ Train - 60% of training data, short cuts generated evenly from 3 to 30 seconds ○ Dev - 40% - short cuts ranging from 3 to 30 seconds with uniform distribution ● Open training condition ○ all relevant data we managed to find ;) (no Babel data for i-vec, just for BN features) ○ main additions are KALAKA-3 (European Spanish, British English) and Arabic - Al Jazeera free corpus ● Details in our system description / Odyssey paper

  3. Stacked Bottleneck features (SBN) ● Based on a hierarchy of two NNs. Bottlenecks from the first network are stacked in time and used as inputs to the second NN. ● Bottlenecks from the second NN are the final features. ● Fixed condition training data ○ Switchboard with ~7k triphone state targets ○ LRE15 training data with labels obtained using acoustic unit discovery tool (200 3-state units) ● Open condition training data ○ 17 languages from Babel project (IARPA) as Multilingual BN - with ~100 phone states per language

  4. General system overview ● i-vector based systems : ○ Features: ■ DNN bottlenecks trained on ● Switchboard English (Fixed cond.) ● Babel data – multilingual bottleneck features (Open cond.) ■ MFCC-SDC+PLLR (phone LLH ratios) ○ 2048 Full or Diagonal GMM/UBM, 600 dimensional i-vectors ○ Gaussian Linear Classifier (GLC) seems sufficient ■ Including i-vector uncertainty in scoring helps ● Frame Level Sequence Summarizing NN ( SSNN )

  5. Fusion with Prior-weighted Logistic Regression ● Fusion is trained on dev data in score domain ● One weight per system and one bias per language ● Cluster prior: For the data of each cluster, we used a cluster-specific prior, with zero probabilities for out-of-cluster languages and equal weights within the cluster. ● Alternative system to allow between cluster analysis: Uniform prior: (flat) over all languages

  6. Fixed Training Condition Fusion EVL Single systems DEV DEV EVL System name cavg* cavg/cavg* cavg* cavg System name classf Primary 1.9 SBN80-SWB1-KALDI--CD GLC COV 2.41 Alternate1 1.24 SBN80-SWB1--CD NN 2.80 SDC-PLLR--CD GLC 4.72 SBN80-AUTO600-KALDI--CD GLC COV 5.46 SSNN/ Alternate 2 NN 10.46 SBN80-SWB1-KALDI--CD/ Alt3 GLC 2.31

  7. Fixed Training Condition Fusion EVL Single systems DEV DEV EVL System name cavg* cavg/cavg* cavg* cavg System name classf Primary 1.9 18.1 / 13.5 SBN80-SWB1-KALDI--CD GLC COV 2.41 16.9 Alternate1 1.24 19.4 / 13.4 SBN80-SWB1--CD NN 2.80 19.9 SDC-PLLR--CD GLC 4.72 22.0 SBN80-AUTO600-KALDI--CD GLC COV 5.46 27.0 SSNN/ Alternate 2 NN 10.46 35.0 SBN80-SWB1-KALDI--CD/ Alt3 GLC 2.31 18.48 ● Eval: Single best system better than Primary fusion ● Calibration ○ Almost no calibration loss on Dev ○ Fairly large calibration loss on eval

  8. Cluster dependent i--vector ● Average of scores from 6 systems, where ● UBM is trained only on data in a given cluster DEV EVAL Fixed Training Condition cavg* cavg cavg* SYSTEM NAME SBN80-SWB1-KALDI 2.9 20.1 16.2 SBN80-SWB1-KALDI-CD 2.5 19.7 15.4 SBN80-SWB1-KALDI-CD diag 2.3 18.5 14.9

  9. Sequence Summarizing NN

  10. Open Training Condition Single systems Fusion EVL DEV DEV EVL cavg* cavg System name cavg* cavg/cavg* System name classf SSNN NN 30.0 Primary 7.14 ML-17-SBN-CD GLC 8.8 Alternate1 7.15 MultilangRDT GLC 10.4 SBN80-SWB1-KALDI--CD GLC 10.4 SDC-PLLR-CD NN 12.7 SNB80-AUTO600-KALDI NN 15.6 ML-17-SBN - trained on Open GLC COV 8.9

  11. Open Training Condition Single systems Fusion EVL DEV DEV EVL cavg* cavg System name cavg* cavg/cavg* System name classf SSNN NN 30.0 41.3 Primary 7.14 14.1 / 10.3 ML-17-SBN-CD GLC 8.8 13.9 Alternate1 7.15 14.1 / 10.4 MultilangRDT GLC 10.4 13.6 SBN80-SWB1-KALDI--CD GLC 10.4 17.6 SDC-PLLR-CD NN 12.7 21.4 SNB80-AUTO600-KALDI NN 15.6 25.0 ML-17-SBN - trained on Open GLC COV 8.9 12.0 ● Single best system trained fully on Open Training condition better than fusion

  12. Analysis of training data - Analysis of using different training data for UBM/ivec and classifier - Important to train i--vector and classifier on Open dataset F …. Fixed Training data UBM/IVEC_Classifier O … Open Submission

  13. Comparison of different features - Fixed Training Condition - all systems with 2048G FullCov UBM, 600 ivec and Gaussian classifier * 16.1 20.1 * 19.7 22.1 28.9 23.8 - Violates fixed data condition * (post eval analysis only)

  14. French cluster disaster ● Radio vs. Telephone in DEV - most probably overtrained for channel ● Channel is taking over on the EVAL data ● Calibration on eval data is not able to fix a wrong classifier

  15. Comparison of different i-vector classifiers - Different classifiers performs similarly - Gaussian Linear Classifier (GLC) - Language Dependent Ivector (LDI) - Multiclass Multivariate Fully Bayesian Gaussian Classifier (MMFBG) - Neural Network - Logistic Regression

  16. Automatically derived acoustic units for BN training - Variational Bayes trained Dirichlet Process mixture of HMMs - Open loop of infinite number of phone-like units - 3 state HMMs, 2 Gaussians per state - 2048G FullCov UBM, 600 ivec and GLC + cuts DEV EVAL Fixed data condition cavg* cavg/cavg* Features MFCC-SDC 6.3 23.8 / 21.5 SBN80-AUTO600-KALDI 5.4 28.9 / 24.2 SBN80-SWB1-KALDI 2.9 20.1 / 16.2 - we can do better than SDC baseline on DEV even without transcription - conventional bottleneck trained on (probably) any data is still better

  17. Conclusion - lessons learned ● State-of-the-art system is i--vector system with Bottleneck features ● GLC with uncertainty performs similar to GLC trained with a lot of small cuts ● Phonotactic systems do not contribute to the final fusion ● Data engineering is always important ● Frame level NN approaches ○ prone to overtraining ○ Better to use NN as a source of counts which are modelled by other classifier ● Other systems ○ Denoising/Dereverberation with NN - helping on EVL but not on DEV ○ Phonotactic systems - with Switchboard phoneme recognizer ○ Frame level DNN

  18. THANK YOU

Recommend


More recommend