The IBM 2016 Speaker Recognition System Seyed Omid Sadjadi, Sriram - PowerPoint PPT Presentation

IBM Research The IBM 2016 Speaker Recognition System Seyed Omid Sadjadi, Sriram Ganapathy, Jason Pelecanos

IBM Research Outline  Introduction  Speaker Recognition System  Experimental Setup  Results  Conclusions 2

IBM Research Introduction 3

IBM Research Recent Progress  Major advancements over the past several years.  SOTA i-vector systems use UBMs to estimate stats  Previous Work:  Gaussian mixture model UBM [Reynolds 1997]  Phonetically-inspired UBM (PI-UBM) [Omar 2010]  DNN-based phonetically-aware UBM [Lei 2014]  TDNN-based UBM (full-covariance) [Snyder 2015]  DNN bottleneck based features are also used in SOTA systems [Heck 1998; Richardson 2015; Matějka 2016 ] 4

IBM Research Objectives  To share state-of-the-art results on the NIST 2010 SRE  To present the key system components that helped us achieve these results • Speaker- and channel-adapted fMLLR based features • A DNN acoustic model with a large number of senones (10k) • A nearest-neighbor discriminant analysis (NDA) technique  To quantify the contribution of each component 5

IBM Research Speaker Recognition System 6

IBM Research Speaker Recognition System SAD LDA/NDA PLDA T matrix Acoustic Suff. i-vector Dim. fMLLR Score Feats. Extraction Stats Reduc. Speech  Our i-vector based speaker recognition system • Speaker- and channel-normalized fMLLR based features • i-vectors are estimated using DNN senone posteriors (~10k) • LDA  NDA based intersession variability compensation 7

IBM Research Feature space MLLR SAD LDA/NDA PLDA T matrix Acoustic Suff. i-vector Dim. fMLLR Score Feats. Extraction Stats Reduc. Speech 8

IBM Research DNN Senone I-vectors SAD LDA/NDA PLDA T matrix Acoustic Suff. i-vector Dim. fMLLR Score Feats. Extraction Stats Reduc. Speech Senones (10k) Posteriors B-W Statistics 9

IBM Research Linear Discriminant Analysis (LDA) SAD LDA/NDA PLDA T matrix Acoustic Suff. i-vector Dim. fMLLR Score Feats. Extraction Stats Reduc. Speech  LDA assumes unimodal and Gaussian distributions  It cannot effectively handle multimodal data  It can be rank deficient 10

IBM Research Nearest Neighbor Discriminant Analysis (NDA) Class 2 Class 1 NDA LDA C ∑ N ( )( ) ( )( ) C C ∑∑∑ i = − − T  T S μ μ μ μ = − − p ij i ij i ij M M S x x w b b i i i l l l l l = = = = i 1 i 1 j 1 l 1 ≠ j i local k -NN means global class means { } ( ) ( ) α α i i i i min d x , NN ( x , ) , i d x , NN ( x , ) j emphasize samples near l K l l K l = ij w ( ) ( ) α + α l boundary i i i i x x x x d , NN ( , ) i d , NN ( , ) j l K l l K l 11

IBM Research Experimental Setup 12

IBM Research Data  Training Data • NIST 2004-2008 SRE (English telephony and microphone data) • Switchboard (SWB) cellular Parts I and II, SWB2 Phases II and III • Total of 60,178 recordings  Evaluation data • NIST 2010 SRE (extended evaluation set) Cond. Enroll Test Mismatch #Targets #Impostors C1 Int. mic. Int. mic. (same type) No 4,034 795,995 C2 Int. mic. Int. mic. (different type) Yes 15,084 2,789,534 C3 Int. mic. Telephony Yes 3,989 637,850 C4 Int. mic. Room microphone Yes 3,637 756,775 C5 Telephony Telephony (different type) Yes 7,169 408,950 13

IBM Research DNN System Configuration  6 fully connected hidden layers with 2048 units  The bottleneck layer has 512 units  Trained using 600 hours of speech from Fisher  Input is a 9-frame context of 40-D fMLLR feats.  Estimates posterior probabilities of 10k senones  2k and 4k posteriors are also explored 14

IBM Research Speaker Recognition System Configuration  500-dimensional total variability subspace trained using a subset of 48,325 recordings from NIST SRE, SWBCELL, and SWB2  Sufficient statistics are generated using posteriors from: • Gender independent 2048-component GMM-UBM (21,207 recordings) • DNN with 7 hidden layers and 2k, 4k, or 10k senones  MFCCs and fMLLR based features are evaluated  LDA/NDA is applied to obtain 250-dimensional feature vectors  Gaussian PLDA backend trained with 60,178 speech segments  Evaluation metrics: Equal error rate (EER) and minDCF’08,’10 15

IBM Research Results 16

IBM Research LDA vs NDA (MFCC, 2048-GMM, 10k DNN, C5) System EER [%] minDCF08 minDCF10 GMM-MFCC-LDA 2.40 0.12 0.439 GMM-MFCC-NDA 1.55 0.076 0.286 DNN-MFCC-LDA 1.02 0.045 0.168 DNN-MFCC-NDA 0.76 0.036 0.147  NDA outperforms LDA for both GMM and DNN based systems 17

IBM Research MFCC vs fMLLR (10k DNN, C5) System EER [%] minDCF08 minDCF10 DNN-MFCC-LDA 1.02 0.045 0.168 DNN-fMLLR-LDA 0.82 0.032 0.120 DNN-MFCC-NDA 0.76 0.036 0.147 DNN-fMLLR-NDA 0.67 0.028 0.092  Speaker- and channel-normalized fMLLRs outperform MFCCs 18

IBM Research Impact of #Senones (fMLLR, C5) System #Senones EER [%] minDCF08 minDCF10 DNN-LDA 1.19 0.054 0.212 2k DNN-NDA 0.95 0.043 0.166 DNN-LDA 0.98 0.041 0.169 4k DNN-NDA 0.86 0.033 0.116 DNN-LDA 0.82 0.032 0.120 10k DNN-NDA 0.67 0.028 0.092  Using 10k senones gives the best performance  NDA consistently outperforms LDA for 2k, 4k, and 10k senones  Note: in contrast to DNNs, increasing the number of components in GMMs (beyond 2k, with diag. cov. matrices) does not improve the results [Lei 2014; Snyder 2015]. 19

IBM Research DET Plot Performance (C5) 20

IBM Research System Progression (C5) System EER [%] minDCF08 minDCF10 GMM-MFCC-LDA 2.40 0.120 0.439 1.55 0.076 0.286 0.76 0.036 0.147 0.67 0.028 0.092  Achieved the best published performance (EER = 0.67%) on NIST 2010 SRE (C5).  Building upon previous best results: EER = 1.09% [Snyder 2015]  Gender-dependent (both genders) • EER = 0.94% [ Matějka 2016]  Gender-dependent (female trials) • 21

IBM Research Conclusions 22

IBM Research Conclusions  Presented the IBM i-vector speaker recognition system:  Speaker- and channel-normalized fMLLR based features may be more effective than raw MFCCs in matched conditions  Using a DNN-UBM with 10k senones to partition the acoustic space provides the best performance  NDA more effective than LDA for channel compensation in the i-vector space (with multimodal data)  Achieved the best published performance ( EER = 0.67% ) on NIST 2010 SRE (C5)  For further progress on our system see us at IS-2016 23

The IBM 2016 Speaker Recognition System Seyed Omid Sadjadi, Sriram - PowerPoint PPT Presentation

IBM Research The IBM 2016 Speaker Recognition System Seyed Omid Sadjadi, Sriram Ganapathy, Jason Pelecanos IBM Research Outline Introduction Speaker Recognition System Experimental Setup Results Conclusions 2 IBM Research

4CSLL5 IBM Translation Models Martin Emms October 22, 2020 4CSLL5 IBM Translation Models IBM

Speech Processing 15-492/18-492 Speaker ID Who is speaking? Speaker ID, Speaker Recognition

4CSLL5 IBM Translation Models IBM models Probabilities and Translation Alignments Martin Emms

Speaker Recognition and Speaker Recognition and the ETSI Standard the ETSI Standard Distributed

Combining Speech and Speaker Recognition - A Joint Modeling Approach Hang Su Supervised by:

Problem solved: IBM Notes Replacement 2 IBM Notes Replacement Migrating from IBM Notes to

IBM i Its been a challenge to determine how to distill the essence of IBM i. Since IBM i is

A summary of deep models for face recognition Qianli Liao Face recognition Face recognition:

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

SimOSPPC Full System Simulation of PowerPC Architecture Tom Keller Austin Research Lab IBM

Apache Beam Dan Debrunner Programming Model Architect IBM Streams STSM, IBM Background

Infuse AI to Your Enterprise Yonghua LIN, IBM Research IBM Distinguished Engineer Leader of AI

Latest on IBM i Therese Eaton Client Technical Specialist Top IBM i Client Projects IBM i

Pattern Recognition Part 9: Speaker and Speech Recognition Gerhard Schmidt

IBM Cloud Private on Linux on IBM Z & LinuxONE Presentation for Vicom Infinity Kershaw Mehta -

Kai-Wei Chang Joint work with Scott Wen-tau Yih, Chris Meek Microsoft Research Build an

Beyond Academics: Measures for Social-Emotional Learning, Mental Health, and Implementation

PARENTING DURING COVID-19 A GUIDE TO MANAGING (CHILD) ANXIETY PRESENTED BY THE CENTER FOR

The Torah does not prohibit crying. Baruch HaTov VeHameitiv. When receiving bad news we

Ground: A Data Context Service Joe Hellerstein, Vikram Sreekanti, Joey Gonzalez, et al . CIDR 2017

Social Emotional Development in the Early Years: Enriching social emotional literacy

Heres a shocking fact: for most writers, the second draft is worse than the first. 1. Writers

Compositional Semantics Jacob Andreas Problem 1 Each of the three girls has a platypus. Each of

The IBM 2016 Speaker Recognition System Seyed Omid Sadjadi, Sriram - PowerPoint PPT Presentation

IBM Research The IBM 2016 Speaker Recognition System Seyed Omid Sadjadi, Sriram Ganapathy, Jason Pelecanos IBM Research Outline Introduction Speaker Recognition System Experimental Setup Results Conclusions 2 IBM Research

4CSLL5 IBM Translation Models Martin Emms October 22, 2020 4CSLL5 IBM Translation Models IBM

Speech Processing 15-492/18-492 Speaker ID Who is speaking? Speaker ID, Speaker Recognition

4CSLL5 IBM Translation Models IBM models Probabilities and Translation Alignments Martin Emms

Speaker Recognition and Speaker Recognition and the ETSI Standard the ETSI Standard Distributed

Combining Speech and Speaker Recognition - A Joint Modeling Approach Hang Su Supervised by:

Problem solved: IBM Notes Replacement 2 IBM Notes Replacement Migrating from IBM Notes to

IBM i Its been a challenge to determine how to distill the essence of IBM i. Since IBM i is

A summary of deep models for face recognition Qianli Liao Face recognition Face recognition:

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

SimOSPPC Full System Simulation of PowerPC Architecture Tom Keller Austin Research Lab IBM

Apache Beam Dan Debrunner Programming Model Architect IBM Streams STSM, IBM Background

Infuse AI to Your Enterprise Yonghua LIN, IBM Research IBM Distinguished Engineer Leader of AI

Latest on IBM i Therese Eaton Client Technical Specialist Top IBM i Client Projects IBM i

Pattern Recognition Part 9: Speaker and Speech Recognition Gerhard Schmidt

IBM Cloud Private on Linux on IBM Z &amp; LinuxONE Presentation for Vicom Infinity Kershaw Mehta -

Kai-Wei Chang Joint work with Scott Wen-tau Yih, Chris Meek Microsoft Research Build an

Beyond Academics: Measures for Social-Emotional Learning, Mental Health, and Implementation

PARENTING DURING COVID-19 A GUIDE TO MANAGING (CHILD) ANXIETY PRESENTED BY THE CENTER FOR

The Torah does not prohibit crying. Baruch HaTov VeHameitiv. When receiving bad news we

Ground: A Data Context Service Joe Hellerstein, Vikram Sreekanti, Joey Gonzalez, et al . CIDR 2017

Social Emotional Development in the Early Years: Enriching social emotional literacy

Heres a shocking fact: for most writers, the second draft is worse than the first. 1. Writers

Compositional Semantics Jacob Andreas Problem 1 Each of the three girls has a platypus. Each of

IBM Cloud Private on Linux on IBM Z & LinuxONE Presentation for Vicom Infinity Kershaw Mehta -