the sheffield language recognition system in nist lre 2015
play

The Sheffield language recognition system in NIST LRE 2015 Raymond - PowerPoint PPT Presentation

The Sheffield language recognition system in NIST LRE 2015 Raymond Ng, Mauro Nicolao, Oscar Saz, Madina Hasan, Bhusan Chettri, Mortaza Doulaty, Tan Lee and Thomas Hain University of Sheffield, UK & The Chinese University of Hong Kong 22


  1. The Sheffield language recognition system in NIST LRE 2015 Raymond Ng, Mauro Nicolao, Oscar Saz, Madina Hasan, Bhusan Chettri, Mortaza Doulaty, Tan Lee and Thomas Hain University of Sheffield, UK & The Chinese University of Hong Kong 22 June 2016

  2. Introduction Segmentation Component systems System fusion Conclusion 2 of 24

  3. Introduction � Background: � Classical approaches with acoustic-phonetic and phonotactic features [ Zissman 1996; Ambikairajah et al. 2011; Li et al. 2013 ] � Shifted-delta cepstral coefficient [ Torres-Carrasquillo et al. 2002 ] � I-vectors [ Dehak, Torres-Carrasquillo, et al. 2011; Martinez et al. 2012 ], DNN [ Ferrer et al. 2014; Richardson et al. 2015 ] and their combination [ Ferrer et al. 2016 ] � Sheffield LRE system: four LR systems � I-vector � Phonotactic � “Direct” DNN � Bottleneck + I-vector 3 of 24

  4. Data and target language � Training data � Switchboard 1, Switchboard Cellular Part 2 � LRE 2015 Training data Cluster Target languages Arabic Egyptian (ara-arz), Iraqi (ara-acm), Levantine (ara-apc), Maghrebi (ara-ary), Modern Standard (ara-arb) English British (eng-gbr), General American (eng-usg), Indian (eng-sas) French West African (fre-waf), Haitian Creole (fre-hat) Slavic Polish (qsl-pol), Russian (qsl-rus) Iberian Caribbean Spanish (spa-car), European Spanish (spa-eur), Latin American Spanish (spa-lac), Brazilian Portuguese (por-brz) Chinese Cantonese (zho-yue), Mandarin (zho-cmn), Min (zho-cdo), Wu (zho-wuu) 4 of 24

  5. Voice activity detection � Training data: � CTS data: CMLLR+BMMI SWB model → alignment → SIL vs non-SIL � BNBS data: VAD reference from 1% of VOA2, VOA3 files � Class balancing: add more non-speech data Duration Dataset (Speech) (Non–speech) Switchboard 1 210h 288h VOA2 55h 61h VOA3 93h 72h Total 358h 421h 5 of 24

  6. Voice activity detection � DNN frame-based Speech / non-speech classifier � Features: Filterbank (23D) ± 15 frames, DCT across time → 368 � Framewise classification: DNN 368-1000-1000-2, lr: 0.001, newbob � Sequence alignment: 2-state HMM, minimum state-duration 20 frames (200ms) � Smoothing: Merging heuristics to bridge non-speech gaps < 2 seconds � Results (collar:10ms) Dataset Duration Miss False alarm SER Switchboard 1 17.3h 2.21% 2.63% 4.84% VOA2-test 7.9h 19.43% 78.61% 98.04% 6 of 24

  7. Segmentation of LRE data � V1 (30s) and V3 (3s, 10s, 30s) � V1 data � VAD, sequence alignment, smoothing � Filtering (20s ≤ segment length < 45s) � Total 147.8h � V3 data � Phone decoding with SWB tokeniser (and V1 segmentation) � Resegmentation � (30s) 320.5h (10s) 262.0h (3s) 308.4h � Data partition � 80% train, 10% development, 10% internal test 7 of 24

  8. NIST LRE 2015 primary system VAD ¡ DNN ¡ Phonotac0c ¡ I-­‑vector ¡ BoCleneck ¡ UBM ¡/ ¡Tv ¡ Switchboard ¡ features ¡ training ¡ tokeniser ¡ Frame-­‑based ¡ Language ¡DNN ¡ SVM ¡/ ¡LogReg ¡ SVM ¡classifier ¡ Gaussian ¡backend ¡ Gaussian ¡backend ¡ System ¡fusion ¡ 8 of 24

  9. I-vector LR system � Feature processing � Normalisation: VTLN � Shifted delta cepstrum: 7 + 7-1-3-7 [ Torres-carrasquillo et al. 2002 ] � Mean normalisation and frame-based VAD � UBM and total variability � UBM: 2048, full-covariance GMM � Total variability: 114688 × 600 [ Dehak, Kenny, et al. 2011 ] � Language classifier � Support vector machine � Logistic regression classifier � Focus of study � Data in UBM, total variability matrix training � Language classifier � Global / within-cluster classifier 9 of 24

  10. I-vector LR system: results on V1 data � Configurations: 10 Global 10.75 A : UBM & total variability (Tv) matrix Within - cluster Min DCF (%) 9 trained on 148h selected data B : Augmenting the UBM & Tv training 8 data in A to full training set (884h) 7 C : Using Logistic regression (LogReg) 6.35 instead of SVM as LR classifier 6.00 6 D : Augmenting LogReg training data in C to full training set (884h) 5 4.54 4.42 � Observations: 4 � Within-cluster classifier outperforms A A B C D global classifier; Configurations � Best training data (UBM and Tv): 887h; 10 of 24

  11. I-vector LR system: results on V3 data � Configurations: 10 Global B : Augmenting the UBM & Tv training Within - cluster Min DCF (%) 9 data in A to full training set (884h) 7.90 C : Using Logistic regression (LogReg) 8 7.74 7.48 instead of SVM as LR classifier 6.78 7 D : Augmenting LogReg training data in C 6.09 to full training set (884h) 6 � Observations: 5 � Within-cluster classifier outperforms global classifier; 4 � Best training data (LR classifier): 332h; B C(V1) C D C Configurations � Logistic regression outperforms SVM. 11 of 24

  12. Phonotactic LR system � DNN phone tokeniser � LDA, Speaker CMLLR � 400-2048( × 6)-64-3815 DNN � Phone-bigram LM (scale factor = 0.5) � (Optional) sequence training on SWB data � Langauge classifier: phone n -gram tf-idf statistics: � Phone bi-gram / phone tri-gram ( 5M dimension) 12 of 24

  13. Phonotactic LR system: results � Test on V1 30-second internal test data: 11 DNN fMPE DNN Min DCF (%) 10.7 10.5 10 9.8 9.8 9.5 9.0 9 2-gram 3-gram � Observations � 3-gram tf-idf outperforms 2-gram � Discriminatively trained DNN ✗ � Test on V3 30s data → 11.3% 13 of 24

  14. DNN LR system � Features: � 64-dimensional bottleneck features from the Switchboard tokeniser � Feature splicing ± 4 frames � Language recogniser: 576 - 750 × 4 - 20 � Prior normalisation: Test probabilities multiplied by inverse of language prior (train) � Decision: Frame-based language likelihood averaged over whole utterance 14 of 24

  15. DNN LR system: results � Test on V1 and V3 (internal test) data with different durations 25 30 sec 21.55 21.71 10 sec Min DCF (%) 20 18.74 18.07 3 sec 15.96 15 10 5 0 V1 V3 Test data 15 of 24

  16. Enhanced system ASR-­‑based ¡silence ¡ VAD ¡ detec0on ¡ DNN ¡ Phonotac0c ¡ BoCleneck ¡I-­‑vector ¡ I-­‑vector ¡ BoCleneck ¡ BoCleneck ¡ UBM ¡/ ¡Tv ¡ Switchboard ¡ UBM ¡/ ¡Tv ¡ features ¡ features ¡ training ¡ tokeniser ¡ training ¡ Frame-­‑based ¡ Language ¡ DNN ¡ Log ¡Reg ¡ Log ¡Reg ¡ SVM ¡ Gaussian ¡backend ¡ Gaussian ¡backend ¡ System ¡fusion ¡ 16 of 24

  17. Bottleneck I-vector system � Feature processing � 64-dimensional bottleneck features from the Switchboard tokeniser � No VTLN, No SDC, No mean/variance normalisation � Frame-based VAD � UBM and total variability � UBM: 2048, full-covariance GMM � Total variability: 131072 × 600 [ Dehak, Kenny, et al. 2011 ] � Language classifier � Logistic regression classifier 17 of 24

  18. Bottleneck I-vector system: results � i-vector v.s. bottleneck i-vector systems on internal test data 17.2 ¡ 18 ¡ ivector ¡ bn-­‑ivector ¡ 16 ¡ 13.69 ¡ 14 ¡ minDCF ¡ 12.23 ¡ 12 ¡ 9.06 ¡ 10 ¡ 8 ¡ 6.09 ¡ 5.13 ¡ 6 ¡ 4 ¡ 30s ¡ 10s ¡ 3s ¡ 18 of 24

  19. System calibration and fusion � Gaussian backend applied on single system output � GMM (4/8/16 components) trained on the score vectors from training data (30s) � GMMs are target language dependent � Logistic regresion � Log-likelihood-ratio conversion � System combination weight trained on dev data (10%) [ Br¨ ummer et al. 2006 ] 19 of 24

  20. System calibration results � Overall min DCF on internal test - 30s 30 ¡ No ¡calibra:on ¡ 29.51 ¡ Gaussain ¡backend ¡ 26.54 ¡ minDCF ¡(global ¡Thr) ¡ 25 ¡ 22.5 ¡ 22 ¡ 20.17 ¡ 20 ¡ 15 ¡ 12.48 ¡ 10 ¡ Ivector ¡ DNN ¡ Phonotac:c ¡ 20 of 24

  21. System fusion results 30 ¡ Internal ¡test ¡– ¡30s ¡ min ¡DCF ¡(Global ¡Thr) ¡ 25 ¡ 19.97 ¡ 19.21 ¡ 20 ¡ 15 ¡ 10.84 ¡ 10.21 ¡ 9.42 ¡ 8.87 ¡ 10 ¡ 5 ¡ Ivector ¡ DNN ¡ PhonotacCc ¡ 3-­‑sys ¡ bn-­‑ivector ¡ 4-­‑sys ¡ 40 ¡ 36.11 ¡ Internal ¡test ¡– ¡3s ¡ 35 ¡ min ¡DCF ¡(Global ¡Thr) ¡ 30 ¡ 22.83 ¡ 25 ¡ 21.81 ¡ 18.47 ¡ 17.7 ¡ 20 ¡ 15.53 ¡ 15 ¡ 10 ¡ 5 ¡ Ivector ¡ DNN ¡ PhonotacCc ¡ 3-­‑sys ¡ bn-­‑ivector ¡ 4-­‑sys ¡ 21 of 24

  22. System fusion results - LR2015EVAL � Overal eval system results 40.16 ¡ 40 ¡ 36.93 ¡ min ¡DCF ¡(Global ¡Thr) ¡ 35 ¡ 32.92 ¡ 32.44 ¡ 29.56 ¡ 29.2 ¡ 30 ¡ 25 ¡ Ivector ¡ DNN ¡ Phonotac8c ¡ 3-­‑sys ¡ bn-­‑ivector ¡ 4-­‑sys ¡ 22 of 24

  23. Pairwise system contribution � Results shown on eval 30s data � System fusion always improves performance except for fusion with DNN � For any given system, pairwise system fusion with a better sysetm generally gives better results. 23 of 24

  24. Conclusion � Introduction to 3 LR component systems submitted to NIST LRE 2015 � Descriptions to segmentation, data selection plans and classifier training � An enhanced bottleneck i-vector system demonstrated good performance � Future work � Data selection and augmentation � multi-lingual NN, bottleneck � Variability compensation � Suggestions and collaborations welcome 24 of 24

Recommend


More recommend