IBM Research The IBM 2016 Speaker Recognition System Seyed Omid Sadjadi, Sriram Ganapathy, Jason Pelecanos
IBM Research Outline Introduction Speaker Recognition System Experimental Setup Results Conclusions 2
IBM Research Introduction 3
IBM Research Recent Progress Major advancements over the past several years. SOTA i-vector systems use UBMs to estimate stats Previous Work: Gaussian mixture model UBM [Reynolds 1997] Phonetically-inspired UBM (PI-UBM) [Omar 2010] DNN-based phonetically-aware UBM [Lei 2014] TDNN-based UBM (full-covariance) [Snyder 2015] DNN bottleneck based features are also used in SOTA systems [Heck 1998; Richardson 2015; Matějka 2016 ] 4
IBM Research Objectives To share state-of-the-art results on the NIST 2010 SRE To present the key system components that helped us achieve these results • Speaker- and channel-adapted fMLLR based features • A DNN acoustic model with a large number of senones (10k) • A nearest-neighbor discriminant analysis (NDA) technique To quantify the contribution of each component 5
IBM Research Speaker Recognition System 6
IBM Research Speaker Recognition System SAD LDA/NDA PLDA T matrix Acoustic Suff. i-vector Dim. fMLLR Score Feats. Extraction Stats Reduc. Speech Our i-vector based speaker recognition system • Speaker- and channel-normalized fMLLR based features • i-vectors are estimated using DNN senone posteriors (~10k) • LDA NDA based intersession variability compensation 7
IBM Research Feature space MLLR SAD LDA/NDA PLDA T matrix Acoustic Suff. i-vector Dim. fMLLR Score Feats. Extraction Stats Reduc. Speech 8
IBM Research DNN Senone I-vectors SAD LDA/NDA PLDA T matrix Acoustic Suff. i-vector Dim. fMLLR Score Feats. Extraction Stats Reduc. Speech Senones (10k) Posteriors B-W Statistics 9
IBM Research Linear Discriminant Analysis (LDA) SAD LDA/NDA PLDA T matrix Acoustic Suff. i-vector Dim. fMLLR Score Feats. Extraction Stats Reduc. Speech LDA assumes unimodal and Gaussian distributions It cannot effectively handle multimodal data It can be rank deficient 10
IBM Research Nearest Neighbor Discriminant Analysis (NDA) Class 2 Class 1 NDA LDA C ∑ N ( )( ) ( )( ) C C ∑∑∑ i = − − T T S μ μ μ μ = − − p ij i ij i ij M M S x x w b b i i i l l l l l = = = = i 1 i 1 j 1 l 1 ≠ j i local k -NN means global class means { } ( ) ( ) α α i i i i min d x , NN ( x , ) , i d x , NN ( x , ) j emphasize samples near l K l l K l = ij w ( ) ( ) α + α l boundary i i i i x x x x d , NN ( , ) i d , NN ( , ) j l K l l K l 11
IBM Research Experimental Setup 12
IBM Research Data Training Data • NIST 2004-2008 SRE (English telephony and microphone data) • Switchboard (SWB) cellular Parts I and II, SWB2 Phases II and III • Total of 60,178 recordings Evaluation data • NIST 2010 SRE (extended evaluation set) Cond. Enroll Test Mismatch #Targets #Impostors C1 Int. mic. Int. mic. (same type) No 4,034 795,995 C2 Int. mic. Int. mic. (different type) Yes 15,084 2,789,534 C3 Int. mic. Telephony Yes 3,989 637,850 C4 Int. mic. Room microphone Yes 3,637 756,775 C5 Telephony Telephony (different type) Yes 7,169 408,950 13
IBM Research DNN System Configuration 6 fully connected hidden layers with 2048 units The bottleneck layer has 512 units Trained using 600 hours of speech from Fisher Input is a 9-frame context of 40-D fMLLR feats. Estimates posterior probabilities of 10k senones 2k and 4k posteriors are also explored 14
IBM Research Speaker Recognition System Configuration 500-dimensional total variability subspace trained using a subset of 48,325 recordings from NIST SRE, SWBCELL, and SWB2 Sufficient statistics are generated using posteriors from: • Gender independent 2048-component GMM-UBM (21,207 recordings) • DNN with 7 hidden layers and 2k, 4k, or 10k senones MFCCs and fMLLR based features are evaluated LDA/NDA is applied to obtain 250-dimensional feature vectors Gaussian PLDA backend trained with 60,178 speech segments Evaluation metrics: Equal error rate (EER) and minDCF’08,’10 15
IBM Research Results 16
IBM Research LDA vs NDA (MFCC, 2048-GMM, 10k DNN, C5) System EER [%] minDCF08 minDCF10 GMM-MFCC-LDA 2.40 0.12 0.439 GMM-MFCC-NDA 1.55 0.076 0.286 DNN-MFCC-LDA 1.02 0.045 0.168 DNN-MFCC-NDA 0.76 0.036 0.147 NDA outperforms LDA for both GMM and DNN based systems 17
IBM Research MFCC vs fMLLR (10k DNN, C5) System EER [%] minDCF08 minDCF10 DNN-MFCC-LDA 1.02 0.045 0.168 DNN-fMLLR-LDA 0.82 0.032 0.120 DNN-MFCC-NDA 0.76 0.036 0.147 DNN-fMLLR-NDA 0.67 0.028 0.092 Speaker- and channel-normalized fMLLRs outperform MFCCs 18
IBM Research Impact of #Senones (fMLLR, C5) System #Senones EER [%] minDCF08 minDCF10 DNN-LDA 1.19 0.054 0.212 2k DNN-NDA 0.95 0.043 0.166 DNN-LDA 0.98 0.041 0.169 4k DNN-NDA 0.86 0.033 0.116 DNN-LDA 0.82 0.032 0.120 10k DNN-NDA 0.67 0.028 0.092 Using 10k senones gives the best performance NDA consistently outperforms LDA for 2k, 4k, and 10k senones Note: in contrast to DNNs, increasing the number of components in GMMs (beyond 2k, with diag. cov. matrices) does not improve the results [Lei 2014; Snyder 2015]. 19
IBM Research DET Plot Performance (C5) 20
IBM Research System Progression (C5) System EER [%] minDCF08 minDCF10 GMM-MFCC-LDA 2.40 0.120 0.439 1.55 0.076 0.286 0.76 0.036 0.147 0.67 0.028 0.092 Achieved the best published performance (EER = 0.67%) on NIST 2010 SRE (C5). Building upon previous best results: EER = 1.09% [Snyder 2015] Gender-dependent (both genders) • EER = 0.94% [ Matějka 2016] Gender-dependent (female trials) • 21
IBM Research Conclusions 22
IBM Research Conclusions Presented the IBM i-vector speaker recognition system: Speaker- and channel-normalized fMLLR based features may be more effective than raw MFCCs in matched conditions Using a DNN-UBM with 10k senones to partition the acoustic space provides the best performance NDA more effective than LDA for channel compensation in the i-vector space (with multimodal data) Achieved the best published performance ( EER = 0.67% ) on NIST 2010 SRE (C5) For further progress on our system see us at IS-2016 23
Recommend
More recommend