the ibm 2016 speaker recognition system
play

The IBM 2016 Speaker Recognition System Seyed Omid Sadjadi, Sriram - PowerPoint PPT Presentation

IBM Research The IBM 2016 Speaker Recognition System Seyed Omid Sadjadi, Sriram Ganapathy, Jason Pelecanos IBM Research Outline Introduction Speaker Recognition System Experimental Setup Results Conclusions 2 IBM Research


  1. IBM Research The IBM 2016 Speaker Recognition System Seyed Omid Sadjadi, Sriram Ganapathy, Jason Pelecanos

  2. IBM Research Outline  Introduction  Speaker Recognition System  Experimental Setup  Results  Conclusions 2

  3. IBM Research Introduction 3

  4. IBM Research Recent Progress  Major advancements over the past several years.  SOTA i-vector systems use UBMs to estimate stats  Previous Work:  Gaussian mixture model UBM [Reynolds 1997]  Phonetically-inspired UBM (PI-UBM) [Omar 2010]  DNN-based phonetically-aware UBM [Lei 2014]  TDNN-based UBM (full-covariance) [Snyder 2015]  DNN bottleneck based features are also used in SOTA systems [Heck 1998; Richardson 2015; Matějka 2016 ] 4

  5. IBM Research Objectives  To share state-of-the-art results on the NIST 2010 SRE  To present the key system components that helped us achieve these results • Speaker- and channel-adapted fMLLR based features • A DNN acoustic model with a large number of senones (10k) • A nearest-neighbor discriminant analysis (NDA) technique  To quantify the contribution of each component 5

  6. IBM Research Speaker Recognition System 6

  7. IBM Research Speaker Recognition System SAD LDA/NDA PLDA T matrix Acoustic Suff. i-vector Dim. fMLLR Score Feats. Extraction Stats Reduc. Speech  Our i-vector based speaker recognition system • Speaker- and channel-normalized fMLLR based features • i-vectors are estimated using DNN senone posteriors (~10k) • LDA  NDA based intersession variability compensation 7

  8. IBM Research Feature space MLLR SAD LDA/NDA PLDA T matrix Acoustic Suff. i-vector Dim. fMLLR Score Feats. Extraction Stats Reduc. Speech 8

  9. IBM Research DNN Senone I-vectors SAD LDA/NDA PLDA T matrix Acoustic Suff. i-vector Dim. fMLLR Score Feats. Extraction Stats Reduc. Speech Senones (10k) Posteriors B-W Statistics 9

  10. IBM Research Linear Discriminant Analysis (LDA) SAD LDA/NDA PLDA T matrix Acoustic Suff. i-vector Dim. fMLLR Score Feats. Extraction Stats Reduc. Speech  LDA assumes unimodal and Gaussian distributions  It cannot effectively handle multimodal data  It can be rank deficient 10

  11. IBM Research Nearest Neighbor Discriminant Analysis (NDA) Class 2 Class 1 NDA LDA C ∑ N ( )( ) ( )( ) C C ∑∑∑ i = − − T  T S μ μ μ μ = − − p ij i ij i ij M M S x x w b b i i i l l l l l = = = = i 1 i 1 j 1 l 1 ≠ j i local k -NN means global class means { } ( ) ( ) α α i i i i min d x , NN ( x , ) , i d x , NN ( x , ) j emphasize samples near l K l l K l = ij w ( ) ( ) α + α l boundary i i i i x x x x d , NN ( , ) i d , NN ( , ) j l K l l K l 11

  12. IBM Research Experimental Setup 12

  13. IBM Research Data  Training Data • NIST 2004-2008 SRE (English telephony and microphone data) • Switchboard (SWB) cellular Parts I and II, SWB2 Phases II and III • Total of 60,178 recordings  Evaluation data • NIST 2010 SRE (extended evaluation set) Cond. Enroll Test Mismatch #Targets #Impostors C1 Int. mic. Int. mic. (same type) No 4,034 795,995 C2 Int. mic. Int. mic. (different type) Yes 15,084 2,789,534 C3 Int. mic. Telephony Yes 3,989 637,850 C4 Int. mic. Room microphone Yes 3,637 756,775 C5 Telephony Telephony (different type) Yes 7,169 408,950 13

  14. IBM Research DNN System Configuration  6 fully connected hidden layers with 2048 units  The bottleneck layer has 512 units  Trained using 600 hours of speech from Fisher  Input is a 9-frame context of 40-D fMLLR feats.  Estimates posterior probabilities of 10k senones  2k and 4k posteriors are also explored 14

  15. IBM Research Speaker Recognition System Configuration  500-dimensional total variability subspace trained using a subset of 48,325 recordings from NIST SRE, SWBCELL, and SWB2  Sufficient statistics are generated using posteriors from: • Gender independent 2048-component GMM-UBM (21,207 recordings) • DNN with 7 hidden layers and 2k, 4k, or 10k senones  MFCCs and fMLLR based features are evaluated  LDA/NDA is applied to obtain 250-dimensional feature vectors  Gaussian PLDA backend trained with 60,178 speech segments  Evaluation metrics: Equal error rate (EER) and minDCF’08,’10 15

  16. IBM Research Results 16

  17. IBM Research LDA vs NDA (MFCC, 2048-GMM, 10k DNN, C5) System EER [%] minDCF08 minDCF10 GMM-MFCC-LDA 2.40 0.12 0.439 GMM-MFCC-NDA 1.55 0.076 0.286 DNN-MFCC-LDA 1.02 0.045 0.168 DNN-MFCC-NDA 0.76 0.036 0.147  NDA outperforms LDA for both GMM and DNN based systems 17

  18. IBM Research MFCC vs fMLLR (10k DNN, C5) System EER [%] minDCF08 minDCF10 DNN-MFCC-LDA 1.02 0.045 0.168 DNN-fMLLR-LDA 0.82 0.032 0.120 DNN-MFCC-NDA 0.76 0.036 0.147 DNN-fMLLR-NDA 0.67 0.028 0.092  Speaker- and channel-normalized fMLLRs outperform MFCCs 18

  19. IBM Research Impact of #Senones (fMLLR, C5) System #Senones EER [%] minDCF08 minDCF10 DNN-LDA 1.19 0.054 0.212 2k DNN-NDA 0.95 0.043 0.166 DNN-LDA 0.98 0.041 0.169 4k DNN-NDA 0.86 0.033 0.116 DNN-LDA 0.82 0.032 0.120 10k DNN-NDA 0.67 0.028 0.092  Using 10k senones gives the best performance  NDA consistently outperforms LDA for 2k, 4k, and 10k senones  Note: in contrast to DNNs, increasing the number of components in GMMs (beyond 2k, with diag. cov. matrices) does not improve the results [Lei 2014; Snyder 2015]. 19

  20. IBM Research DET Plot Performance (C5) 20

  21. IBM Research System Progression (C5) System EER [%] minDCF08 minDCF10 GMM-MFCC-LDA 2.40 0.120 0.439 1.55 0.076 0.286 0.76 0.036 0.147 0.67 0.028 0.092  Achieved the best published performance (EER = 0.67%) on NIST 2010 SRE (C5).  Building upon previous best results: EER = 1.09% [Snyder 2015]  Gender-dependent (both genders) • EER = 0.94% [ Matějka 2016]  Gender-dependent (female trials) • 21

  22. IBM Research Conclusions 22

  23. IBM Research Conclusions  Presented the IBM i-vector speaker recognition system:  Speaker- and channel-normalized fMLLR based features may be more effective than raw MFCCs in matched conditions  Using a DNN-UBM with 10k senones to partition the acoustic space provides the best performance  NDA more effective than LDA for channel compensation in the i-vector space (with multimodal data)  Achieved the best published performance ( EER = 0.67% ) on NIST 2010 SRE (C5)  For further progress on our system see us at IS-2016 23

Recommend


More recommend