The Sheffield language recognition system in NIST LRE 2015 Raymond Ng, Mauro Nicolao, Oscar Saz, Madina Hasan, Bhusan Chettri, Mortaza Doulaty, Tan Lee and Thomas Hain University of Sheffield, UK & The Chinese University of Hong Kong 22 June 2016
Introduction Segmentation Component systems System fusion Conclusion 2 of 24
Introduction � Background: � Classical approaches with acoustic-phonetic and phonotactic features [ Zissman 1996; Ambikairajah et al. 2011; Li et al. 2013 ] � Shifted-delta cepstral coefficient [ Torres-Carrasquillo et al. 2002 ] � I-vectors [ Dehak, Torres-Carrasquillo, et al. 2011; Martinez et al. 2012 ], DNN [ Ferrer et al. 2014; Richardson et al. 2015 ] and their combination [ Ferrer et al. 2016 ] � Sheffield LRE system: four LR systems � I-vector � Phonotactic � “Direct” DNN � Bottleneck + I-vector 3 of 24
Data and target language � Training data � Switchboard 1, Switchboard Cellular Part 2 � LRE 2015 Training data Cluster Target languages Arabic Egyptian (ara-arz), Iraqi (ara-acm), Levantine (ara-apc), Maghrebi (ara-ary), Modern Standard (ara-arb) English British (eng-gbr), General American (eng-usg), Indian (eng-sas) French West African (fre-waf), Haitian Creole (fre-hat) Slavic Polish (qsl-pol), Russian (qsl-rus) Iberian Caribbean Spanish (spa-car), European Spanish (spa-eur), Latin American Spanish (spa-lac), Brazilian Portuguese (por-brz) Chinese Cantonese (zho-yue), Mandarin (zho-cmn), Min (zho-cdo), Wu (zho-wuu) 4 of 24
Voice activity detection � Training data: � CTS data: CMLLR+BMMI SWB model → alignment → SIL vs non-SIL � BNBS data: VAD reference from 1% of VOA2, VOA3 files � Class balancing: add more non-speech data Duration Dataset (Speech) (Non–speech) Switchboard 1 210h 288h VOA2 55h 61h VOA3 93h 72h Total 358h 421h 5 of 24
Voice activity detection � DNN frame-based Speech / non-speech classifier � Features: Filterbank (23D) ± 15 frames, DCT across time → 368 � Framewise classification: DNN 368-1000-1000-2, lr: 0.001, newbob � Sequence alignment: 2-state HMM, minimum state-duration 20 frames (200ms) � Smoothing: Merging heuristics to bridge non-speech gaps < 2 seconds � Results (collar:10ms) Dataset Duration Miss False alarm SER Switchboard 1 17.3h 2.21% 2.63% 4.84% VOA2-test 7.9h 19.43% 78.61% 98.04% 6 of 24
Segmentation of LRE data � V1 (30s) and V3 (3s, 10s, 30s) � V1 data � VAD, sequence alignment, smoothing � Filtering (20s ≤ segment length < 45s) � Total 147.8h � V3 data � Phone decoding with SWB tokeniser (and V1 segmentation) � Resegmentation � (30s) 320.5h (10s) 262.0h (3s) 308.4h � Data partition � 80% train, 10% development, 10% internal test 7 of 24
NIST LRE 2015 primary system VAD ¡ DNN ¡ Phonotac0c ¡ I-‑vector ¡ BoCleneck ¡ UBM ¡/ ¡Tv ¡ Switchboard ¡ features ¡ training ¡ tokeniser ¡ Frame-‑based ¡ Language ¡DNN ¡ SVM ¡/ ¡LogReg ¡ SVM ¡classifier ¡ Gaussian ¡backend ¡ Gaussian ¡backend ¡ System ¡fusion ¡ 8 of 24
I-vector LR system � Feature processing � Normalisation: VTLN � Shifted delta cepstrum: 7 + 7-1-3-7 [ Torres-carrasquillo et al. 2002 ] � Mean normalisation and frame-based VAD � UBM and total variability � UBM: 2048, full-covariance GMM � Total variability: 114688 × 600 [ Dehak, Kenny, et al. 2011 ] � Language classifier � Support vector machine � Logistic regression classifier � Focus of study � Data in UBM, total variability matrix training � Language classifier � Global / within-cluster classifier 9 of 24
I-vector LR system: results on V1 data � Configurations: 10 Global 10.75 A : UBM & total variability (Tv) matrix Within - cluster Min DCF (%) 9 trained on 148h selected data B : Augmenting the UBM & Tv training 8 data in A to full training set (884h) 7 C : Using Logistic regression (LogReg) 6.35 instead of SVM as LR classifier 6.00 6 D : Augmenting LogReg training data in C to full training set (884h) 5 4.54 4.42 � Observations: 4 � Within-cluster classifier outperforms A A B C D global classifier; Configurations � Best training data (UBM and Tv): 887h; 10 of 24
I-vector LR system: results on V3 data � Configurations: 10 Global B : Augmenting the UBM & Tv training Within - cluster Min DCF (%) 9 data in A to full training set (884h) 7.90 C : Using Logistic regression (LogReg) 8 7.74 7.48 instead of SVM as LR classifier 6.78 7 D : Augmenting LogReg training data in C 6.09 to full training set (884h) 6 � Observations: 5 � Within-cluster classifier outperforms global classifier; 4 � Best training data (LR classifier): 332h; B C(V1) C D C Configurations � Logistic regression outperforms SVM. 11 of 24
Phonotactic LR system � DNN phone tokeniser � LDA, Speaker CMLLR � 400-2048( × 6)-64-3815 DNN � Phone-bigram LM (scale factor = 0.5) � (Optional) sequence training on SWB data � Langauge classifier: phone n -gram tf-idf statistics: � Phone bi-gram / phone tri-gram ( 5M dimension) 12 of 24
Phonotactic LR system: results � Test on V1 30-second internal test data: 11 DNN fMPE DNN Min DCF (%) 10.7 10.5 10 9.8 9.8 9.5 9.0 9 2-gram 3-gram � Observations � 3-gram tf-idf outperforms 2-gram � Discriminatively trained DNN ✗ � Test on V3 30s data → 11.3% 13 of 24
DNN LR system � Features: � 64-dimensional bottleneck features from the Switchboard tokeniser � Feature splicing ± 4 frames � Language recogniser: 576 - 750 × 4 - 20 � Prior normalisation: Test probabilities multiplied by inverse of language prior (train) � Decision: Frame-based language likelihood averaged over whole utterance 14 of 24
DNN LR system: results � Test on V1 and V3 (internal test) data with different durations 25 30 sec 21.55 21.71 10 sec Min DCF (%) 20 18.74 18.07 3 sec 15.96 15 10 5 0 V1 V3 Test data 15 of 24
Enhanced system ASR-‑based ¡silence ¡ VAD ¡ detec0on ¡ DNN ¡ Phonotac0c ¡ BoCleneck ¡I-‑vector ¡ I-‑vector ¡ BoCleneck ¡ BoCleneck ¡ UBM ¡/ ¡Tv ¡ Switchboard ¡ UBM ¡/ ¡Tv ¡ features ¡ features ¡ training ¡ tokeniser ¡ training ¡ Frame-‑based ¡ Language ¡ DNN ¡ Log ¡Reg ¡ Log ¡Reg ¡ SVM ¡ Gaussian ¡backend ¡ Gaussian ¡backend ¡ System ¡fusion ¡ 16 of 24
Bottleneck I-vector system � Feature processing � 64-dimensional bottleneck features from the Switchboard tokeniser � No VTLN, No SDC, No mean/variance normalisation � Frame-based VAD � UBM and total variability � UBM: 2048, full-covariance GMM � Total variability: 131072 × 600 [ Dehak, Kenny, et al. 2011 ] � Language classifier � Logistic regression classifier 17 of 24
Bottleneck I-vector system: results � i-vector v.s. bottleneck i-vector systems on internal test data 17.2 ¡ 18 ¡ ivector ¡ bn-‑ivector ¡ 16 ¡ 13.69 ¡ 14 ¡ minDCF ¡ 12.23 ¡ 12 ¡ 9.06 ¡ 10 ¡ 8 ¡ 6.09 ¡ 5.13 ¡ 6 ¡ 4 ¡ 30s ¡ 10s ¡ 3s ¡ 18 of 24
System calibration and fusion � Gaussian backend applied on single system output � GMM (4/8/16 components) trained on the score vectors from training data (30s) � GMMs are target language dependent � Logistic regresion � Log-likelihood-ratio conversion � System combination weight trained on dev data (10%) [ Br¨ ummer et al. 2006 ] 19 of 24
System calibration results � Overall min DCF on internal test - 30s 30 ¡ No ¡calibra:on ¡ 29.51 ¡ Gaussain ¡backend ¡ 26.54 ¡ minDCF ¡(global ¡Thr) ¡ 25 ¡ 22.5 ¡ 22 ¡ 20.17 ¡ 20 ¡ 15 ¡ 12.48 ¡ 10 ¡ Ivector ¡ DNN ¡ Phonotac:c ¡ 20 of 24
System fusion results 30 ¡ Internal ¡test ¡– ¡30s ¡ min ¡DCF ¡(Global ¡Thr) ¡ 25 ¡ 19.97 ¡ 19.21 ¡ 20 ¡ 15 ¡ 10.84 ¡ 10.21 ¡ 9.42 ¡ 8.87 ¡ 10 ¡ 5 ¡ Ivector ¡ DNN ¡ PhonotacCc ¡ 3-‑sys ¡ bn-‑ivector ¡ 4-‑sys ¡ 40 ¡ 36.11 ¡ Internal ¡test ¡– ¡3s ¡ 35 ¡ min ¡DCF ¡(Global ¡Thr) ¡ 30 ¡ 22.83 ¡ 25 ¡ 21.81 ¡ 18.47 ¡ 17.7 ¡ 20 ¡ 15.53 ¡ 15 ¡ 10 ¡ 5 ¡ Ivector ¡ DNN ¡ PhonotacCc ¡ 3-‑sys ¡ bn-‑ivector ¡ 4-‑sys ¡ 21 of 24
System fusion results - LR2015EVAL � Overal eval system results 40.16 ¡ 40 ¡ 36.93 ¡ min ¡DCF ¡(Global ¡Thr) ¡ 35 ¡ 32.92 ¡ 32.44 ¡ 29.56 ¡ 29.2 ¡ 30 ¡ 25 ¡ Ivector ¡ DNN ¡ Phonotac8c ¡ 3-‑sys ¡ bn-‑ivector ¡ 4-‑sys ¡ 22 of 24
Pairwise system contribution � Results shown on eval 30s data � System fusion always improves performance except for fusion with DNN � For any given system, pairwise system fusion with a better sysetm generally gives better results. 23 of 24
Conclusion � Introduction to 3 LR component systems submitted to NIST LRE 2015 � Descriptions to segmentation, data selection plans and classifier training � An enhanced bottleneck i-vector system demonstrated good performance � Future work � Data selection and augmentation � multi-lingual NN, bottleneck � Variability compensation � Suggestions and collaborations welcome 24 of 24
Recommend
More recommend