The Sheffield language recognition system in NIST LRE 2015 Raymond - PowerPoint PPT Presentation

The Sheffield language recognition system in NIST LRE 2015 Raymond Ng, Mauro Nicolao, Oscar Saz, Madina Hasan, Bhusan Chettri, Mortaza Doulaty, Tan Lee and Thomas Hain University of Sheffield, UK & The Chinese University of Hong Kong 22 June 2016

Introduction Segmentation Component systems System fusion Conclusion 2 of 24

Introduction � Background: � Classical approaches with acoustic-phonetic and phonotactic features [ Zissman 1996; Ambikairajah et al. 2011; Li et al. 2013 ] � Shifted-delta cepstral coefficient [ Torres-Carrasquillo et al. 2002 ] � I-vectors [ Dehak, Torres-Carrasquillo, et al. 2011; Martinez et al. 2012 ], DNN [ Ferrer et al. 2014; Richardson et al. 2015 ] and their combination [ Ferrer et al. 2016 ] � Sheffield LRE system: four LR systems � I-vector � Phonotactic � “Direct” DNN � Bottleneck + I-vector 3 of 24

Data and target language � Training data � Switchboard 1, Switchboard Cellular Part 2 � LRE 2015 Training data Cluster Target languages Arabic Egyptian (ara-arz), Iraqi (ara-acm), Levantine (ara-apc), Maghrebi (ara-ary), Modern Standard (ara-arb) English British (eng-gbr), General American (eng-usg), Indian (eng-sas) French West African (fre-waf), Haitian Creole (fre-hat) Slavic Polish (qsl-pol), Russian (qsl-rus) Iberian Caribbean Spanish (spa-car), European Spanish (spa-eur), Latin American Spanish (spa-lac), Brazilian Portuguese (por-brz) Chinese Cantonese (zho-yue), Mandarin (zho-cmn), Min (zho-cdo), Wu (zho-wuu) 4 of 24

Voice activity detection � Training data: � CTS data: CMLLR+BMMI SWB model → alignment → SIL vs non-SIL � BNBS data: VAD reference from 1% of VOA2, VOA3 files � Class balancing: add more non-speech data Duration Dataset (Speech) (Non–speech) Switchboard 1 210h 288h VOA2 55h 61h VOA3 93h 72h Total 358h 421h 5 of 24

Voice activity detection � DNN frame-based Speech / non-speech classifier � Features: Filterbank (23D) ± 15 frames, DCT across time → 368 � Framewise classification: DNN 368-1000-1000-2, lr: 0.001, newbob � Sequence alignment: 2-state HMM, minimum state-duration 20 frames (200ms) � Smoothing: Merging heuristics to bridge non-speech gaps < 2 seconds � Results (collar:10ms) Dataset Duration Miss False alarm SER Switchboard 1 17.3h 2.21% 2.63% 4.84% VOA2-test 7.9h 19.43% 78.61% 98.04% 6 of 24

Segmentation of LRE data � V1 (30s) and V3 (3s, 10s, 30s) � V1 data � VAD, sequence alignment, smoothing � Filtering (20s ≤ segment length < 45s) � Total 147.8h � V3 data � Phone decoding with SWB tokeniser (and V1 segmentation) � Resegmentation � (30s) 320.5h (10s) 262.0h (3s) 308.4h � Data partition � 80% train, 10% development, 10% internal test 7 of 24

NIST LRE 2015 primary system VAD ¡ DNN ¡ Phonotac0c ¡ I-‑vector ¡ BoCleneck ¡ UBM ¡/ ¡Tv ¡ Switchboard ¡ features ¡ training ¡ tokeniser ¡ Frame-‑based ¡ Language ¡DNN ¡ SVM ¡/ ¡LogReg ¡ SVM ¡classifier ¡ Gaussian ¡backend ¡ Gaussian ¡backend ¡ System ¡fusion ¡ 8 of 24

I-vector LR system � Feature processing � Normalisation: VTLN � Shifted delta cepstrum: 7 + 7-1-3-7 [ Torres-carrasquillo et al. 2002 ] � Mean normalisation and frame-based VAD � UBM and total variability � UBM: 2048, full-covariance GMM � Total variability: 114688 × 600 [ Dehak, Kenny, et al. 2011 ] � Language classifier � Support vector machine � Logistic regression classifier � Focus of study � Data in UBM, total variability matrix training � Language classifier � Global / within-cluster classifier 9 of 24

I-vector LR system: results on V1 data � Configurations: 10 Global 10.75 A : UBM & total variability (Tv) matrix Within - cluster Min DCF (%) 9 trained on 148h selected data B : Augmenting the UBM & Tv training 8 data in A to full training set (884h) 7 C : Using Logistic regression (LogReg) 6.35 instead of SVM as LR classifier 6.00 6 D : Augmenting LogReg training data in C to full training set (884h) 5 4.54 4.42 � Observations: 4 � Within-cluster classifier outperforms A A B C D global classifier; Configurations � Best training data (UBM and Tv): 887h; 10 of 24

I-vector LR system: results on V3 data � Configurations: 10 Global B : Augmenting the UBM & Tv training Within - cluster Min DCF (%) 9 data in A to full training set (884h) 7.90 C : Using Logistic regression (LogReg) 8 7.74 7.48 instead of SVM as LR classifier 6.78 7 D : Augmenting LogReg training data in C 6.09 to full training set (884h) 6 � Observations: 5 � Within-cluster classifier outperforms global classifier; 4 � Best training data (LR classifier): 332h; B C(V1) C D C Configurations � Logistic regression outperforms SVM. 11 of 24

Phonotactic LR system � DNN phone tokeniser � LDA, Speaker CMLLR � 400-2048( × 6)-64-3815 DNN � Phone-bigram LM (scale factor = 0.5) � (Optional) sequence training on SWB data � Langauge classifier: phone n -gram tf-idf statistics: � Phone bi-gram / phone tri-gram ( 5M dimension) 12 of 24

Phonotactic LR system: results � Test on V1 30-second internal test data: 11 DNN fMPE DNN Min DCF (%) 10.7 10.5 10 9.8 9.8 9.5 9.0 9 2-gram 3-gram � Observations � 3-gram tf-idf outperforms 2-gram � Discriminatively trained DNN ✗ � Test on V3 30s data → 11.3% 13 of 24

DNN LR system � Features: � 64-dimensional bottleneck features from the Switchboard tokeniser � Feature splicing ± 4 frames � Language recogniser: 576 - 750 × 4 - 20 � Prior normalisation: Test probabilities multiplied by inverse of language prior (train) � Decision: Frame-based language likelihood averaged over whole utterance 14 of 24

DNN LR system: results � Test on V1 and V3 (internal test) data with different durations 25 30 sec 21.55 21.71 10 sec Min DCF (%) 20 18.74 18.07 3 sec 15.96 15 10 5 0 V1 V3 Test data 15 of 24

Enhanced system ASR-‑based ¡silence ¡ VAD ¡ detec0on ¡ DNN ¡ Phonotac0c ¡ BoCleneck ¡I-‑vector ¡ I-‑vector ¡ BoCleneck ¡ BoCleneck ¡ UBM ¡/ ¡Tv ¡ Switchboard ¡ UBM ¡/ ¡Tv ¡ features ¡ features ¡ training ¡ tokeniser ¡ training ¡ Frame-‑based ¡ Language ¡ DNN ¡ Log ¡Reg ¡ Log ¡Reg ¡ SVM ¡ Gaussian ¡backend ¡ Gaussian ¡backend ¡ System ¡fusion ¡ 16 of 24

Bottleneck I-vector system � Feature processing � 64-dimensional bottleneck features from the Switchboard tokeniser � No VTLN, No SDC, No mean/variance normalisation � Frame-based VAD � UBM and total variability � UBM: 2048, full-covariance GMM � Total variability: 131072 × 600 [ Dehak, Kenny, et al. 2011 ] � Language classifier � Logistic regression classifier 17 of 24

Bottleneck I-vector system: results � i-vector v.s. bottleneck i-vector systems on internal test data 17.2 ¡ 18 ¡ ivector ¡ bn-‑ivector ¡ 16 ¡ 13.69 ¡ 14 ¡ minDCF ¡ 12.23 ¡ 12 ¡ 9.06 ¡ 10 ¡ 8 ¡ 6.09 ¡ 5.13 ¡ 6 ¡ 4 ¡ 30s ¡ 10s ¡ 3s ¡ 18 of 24

System calibration and fusion � Gaussian backend applied on single system output � GMM (4/8/16 components) trained on the score vectors from training data (30s) � GMMs are target language dependent � Logistic regresion � Log-likelihood-ratio conversion � System combination weight trained on dev data (10%) [ Br¨ ummer et al. 2006 ] 19 of 24

System calibration results � Overall min DCF on internal test - 30s 30 ¡ No ¡calibra:on ¡ 29.51 ¡ Gaussain ¡backend ¡ 26.54 ¡ minDCF ¡(global ¡Thr) ¡ 25 ¡ 22.5 ¡ 22 ¡ 20.17 ¡ 20 ¡ 15 ¡ 12.48 ¡ 10 ¡ Ivector ¡ DNN ¡ Phonotac:c ¡ 20 of 24

System fusion results 30 ¡ Internal ¡test ¡– ¡30s ¡ min ¡DCF ¡(Global ¡Thr) ¡ 25 ¡ 19.97 ¡ 19.21 ¡ 20 ¡ 15 ¡ 10.84 ¡ 10.21 ¡ 9.42 ¡ 8.87 ¡ 10 ¡ 5 ¡ Ivector ¡ DNN ¡ PhonotacCc ¡ 3-‑sys ¡ bn-‑ivector ¡ 4-‑sys ¡ 40 ¡ 36.11 ¡ Internal ¡test ¡– ¡3s ¡ 35 ¡ min ¡DCF ¡(Global ¡Thr) ¡ 30 ¡ 22.83 ¡ 25 ¡ 21.81 ¡ 18.47 ¡ 17.7 ¡ 20 ¡ 15.53 ¡ 15 ¡ 10 ¡ 5 ¡ Ivector ¡ DNN ¡ PhonotacCc ¡ 3-‑sys ¡ bn-‑ivector ¡ 4-‑sys ¡ 21 of 24

System fusion results - LR2015EVAL � Overal eval system results 40.16 ¡ 40 ¡ 36.93 ¡ min ¡DCF ¡(Global ¡Thr) ¡ 35 ¡ 32.92 ¡ 32.44 ¡ 29.56 ¡ 29.2 ¡ 30 ¡ 25 ¡ Ivector ¡ DNN ¡ Phonotac8c ¡ 3-‑sys ¡ bn-‑ivector ¡ 4-‑sys ¡ 22 of 24

Pairwise system contribution � Results shown on eval 30s data � System fusion always improves performance except for fusion with DNN � For any given system, pairwise system fusion with a better sysetm generally gives better results. 23 of 24

Conclusion � Introduction to 3 LR component systems submitted to NIST LRE 2015 � Descriptions to segmentation, data selection plans and classifier training � An enhanced bottleneck i-vector system demonstrated good performance � Future work � Data selection and augmentation � multi-lingual NN, bottleneck � Variability compensation � Suggestions and collaborations welcome 24 of 24

The Sheffield language recognition system in NIST LRE 2015 Raymond - PowerPoint PPT Presentation

The Sheffield language recognition system in NIST LRE 2015 Raymond Ng, Mauro Nicolao, Oscar Saz, Madina Hasan, Bhusan Chettri, Mortaza Doulaty, Tan Lee and Thomas Hain University of Sheffield, UK & The Chinese University of Hong Kong 22

The MITLL NIST LRE 2015 Language Recognition System* Contributors in alphabetical order Najim

Evaluation of an LSTM-RNN System in Different NIST Language Recognition Frameworks Ruben Zazo,

Summary of the 2015 NIST Language Recognition i-Vector Machine Learning Challenge Craig Greenberg

Language Recognition for Dialects and Closely Related Languages NIST OpenLRE 2015 G. Gelly,

Usefulness of Existing Iris Databases and Future Priorities George W. Quinn NIST gw@nist.gov

Brief Introduction to Continuous Sign Language Recognition 2019.1.19 Introduction

NIST Gaithersburgs Approach to a Solar PV Array Project John.R.Bollinger@nist.gov 2 NIST

BAT System Description for NIST LRE 2015 BUT+Agnitio+Torino Oldrich Plchot, Pavel Matejka, Radek

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 15: Language

Local System Voting Feature for Machine Translation System Combination Markus Freitag,

Dilated Convolutional Network with Iterative Optimization for Continuous Sign Language Recognition

Coral Reef Species Recognition and Whale Individual Recognition 09/09/2015 09/09/2015 1 1 1

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 16: Language

Evaluation of Spoken Language Recognition Technology Using Broadcast Speech: Performance and

NIST Cybersecurity Framework Sean Sweeney, Information Security Officer 5/20/2015 Overview

Speaker line-up calibration of the i-vector based speaker recognition system for forensic

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 14: Language

Welcome to Manning Oaks University!!! Ava Sheffield Curriculum Support 1 Reading/Language Arts

Nicholas Northall English Language Teaching Centre The University of Sheffield The University of

Style Guide for Voting System Documentation: Why User-Centered Documentation Matters to Voting

Speech Recognition and Synthesis for Conversational AI Mari Ostendorf University of Washington

QATIP An Optical Character Recognition System for Arabic Heritage Collections in Libraries QCRI,