APSIPA APSIPA Asia-Pacific Signal and Information Processing Association Asia-Pacific Signal and Information Processing Association Speaker Verification – The present and future of voiceprint based security Prof. Eliathamby Ambikairajah Head of School of Electrical Engineering & Telecommunications, University of New South Wales, Australia 21 Oct 2013
APSIPA Asia-Pacific Signal and Information Processing Association Outline • Introduction • Speaker Verification Applications • Speaker Verification System • Performance measure • NIST Speaker Recognition Evaluation (SRE) • Discussion 1 APSIPA Distinguished Lecture Series @ IIU, Malaysia
APSIPA Asia-Pacific Signal and Information Processing Association Introduction “How are you?” Language Speech Speaker Emotion Accent Recognition Recognition Recognition Recognition Recognition “How are you?” English Taiwanese Hsing Ming Happy Linguistic Paralinguistic • Speech conveys several types of information – Linguistic: message and language information – Paralinguistic : emotional and physiological characteristics 2 APSIPA Distinguished Lecture Series @ IIU, Malaysia
APSIPA Asia-Pacific Signal and Information Processing Association “How are you?” Introduction Speech Language Speaker Emotion Accent Recognition Recognition Recognition Recognition Recognition Speaker Diarization Speaker Identification Speaker Verification partition an input audio determines who is determines if the stream into speaking given a set of unknown voice is from homogeneous segments enrolled speakers the claimed speaker according to the speaker identity 3 APSIPA Distinguished Lecture Series @ IIU, Malaysia
APSIPA Asia-Pacific Signal and Information Processing Association Speaker Diarization Speaker Identification Speaker Verification partition an input audio determines who is determines if the stream into speaking given a set of unknown voice is from homogeneous segments enrolled speakers the claimed speaker according to the speaker identity Model repository Model repository Speaker 1 Speaker 1 Model Model Speaker 1 Speaker 1 Speaker 2 Speaker 2 Best Reject Unknown Model Claimed Model Matching Speaker Speaker 2 Speaker 2 Speaker Speaker M Speaker M Model Model 4 APSIPA Distinguished Lecture Series @ IIU, Malaysia
APSIPA Asia-Pacific Signal and Information Processing Association Speaker Verification Applications - Biometrics Transaction Access control authentication Telephone credit Physical card purchases facilities 5 APSIPA Distinguished Lecture Series @ IIU, Malaysia
APSIPA Asia-Pacific Signal and Information Processing Association Speaker Verification System – Basic Overview Speaker Model Feature Accept/ Classification Decision Making Speech Extraction Reject Front-end Back-end • In automatic speaker verification, – The front-end converts speech signal into a more convenient representation (typically a set of feature vectors) – The back-end compares this representation to a model of a speaker to determine how well they match 6 APSIPA Distinguished Lecture Series @ IIU, Malaysia
Speaker Verification System I am John Feature Extraction c 0 c 1 c 2 c n Speaker Models Universal Background Determine level Determine level Models (UBM) of Match of Match Speaker 1 Model Generic Male Likelihood of Likelihood of Generic Male John Generic Female John’s Model Likelihood Ratio Decision Making NOT JOHN UBM: represent general, speaker independent model to be compared against a person-specific model when making an accept or reject decision.
Speaker Verification System – Speaker Enrolment Creating a male UBM Speaker 1 Speaker 2 Universal Background Models (UBM) Feature Model Step 1 Generic Male Extraction Training Speaker N Generic Female Background male speaker data Creating male speaker-specific models Speaker Models Model Feature Speaker x 1 Speaker x 1 Adaptation Model Extraction Model Speaker x 2 Feature Speaker x 2 Step 2 Model Adaptation Extraction Feature Model Speaker x M Speaker x M Extraction Adaptation Model Target male speaker data 8
APSIPA Asia-Pacific Signal and Information Processing Association Detailed Speaker Verification System Feature Feature Speaker Model Classification Score Decision Accept/ Extraction Normalisation Modelling Normalisation (Scoring) Normalisation Making Speech Reject · Nuisance Attribute Projection (NAP) · Cepstral Mean Subtraction · Joint Factor Analysis (JFA) · Zero-normalisation (CMS) · i-vectors (Z-norm) · RelAtive SpecTrAl (RASTA) · Within Class Covariance · Test-normalisation · Feature Warping Normalisation (WCCN) (T-Norm) · Feature Mapping · Linear Discriminant Analysis (LDA) · Probabilistic Linear Discriminant Analysis (PLDA) 9 APSIPA Distinguished Lecture Series @ IIU, Malaysia
Front-end: Feature Extraction 25ms 25ms 25ms 25ms Frame 1 Frame 2 Frame 3 Frame N Windowing Windowing Windowing Windowing Feature Feature Feature Feature Extraction Extraction Extraction Extraction C o distribution Feature Feature Feature Feature Vector Vector Vector Vector c 0 c 1 c 2 c n c 0 c 1 c 2 c n c 0 c 1 c 2 c n c 0 c 1 c 2 c n BASIC FEATURES -5 0 5 Feature Normalisation Normalised Normalised Normalised Normalised Feature Feature Feature Feature Vector Vector Vector Vector c 0 c 1 c 2 c 0 c 1 c 2 c 0 c 1 c 2 c 0 c 1 c 2 c n c n c n c n NORMALISED FEATURES -5 0 5 Normalised C o distribution 10
APSIPA Asia-Pacific Signal and Information Processing Association Normalised Feature vectors Normalised Feature vectors Normalised Feature vectors (Frame 1) (Frame 2) (Frame P ) c 0 c 1 c 2 c 0 c 1 c 2 c 0 c 1 c 2 c n c n c n Temporal Derivative Delta Feature vectors Delta Feature vectors Delta Feature vectors (Frame 1) (Frame 2) (Frame P ) d 0 d 1 d 2 d 0 d 1 d 2 d n d n d 0 d 1 d 2 d n Temporal Derivative Acceleration Feature vectors Acceleration Feature vectors Acceleration Feature vectors (Frame 1) (Frame 2) (Frame P ) a 0 a 1 a 2 a 0 a 1 a 2 a n a n a 0 a 1 a 2 a n c 0 c 1 c 2 c n a 0 a 1 a 2 d 0 d 1 d 2 d n a n Frame 1 Features: (e.g: 39 dimensions) 11 APSIPA Distinguished Lecture Series @ IIU, Malaysia
APSIPA Asia-Pacific Signal and Information Processing Association Detailed Speaker Verification System Feature Feature Speaker Model Classification Score Decision Accept/ Extraction Normalisation Modelling Normalisation (Scoring) Normalisation Making Speech Reject · Nuisance Attribute Projection (NAP) · Cepstral Mean Subtraction · Joint Factor Analysis (JFA) · Zero-normalisation (CMS) · i-vectors (Z-norm) · RelAtive SpecTrAl (RASTA) · Within Class Covariance · Test-normalisation · Feature Warping Normalisation (WCCN) (T-Norm) · Feature Mapping · Linear Discriminant Analysis (LDA) · Probabilistic Linear Discriminant Analysis (PLDA) 12 APSIPA Distinguished Lecture Series @ IIU, Malaysia
Speaker Modelling MODELLING PROBABILITY DISTRIBUTION 9 FEATURE SPACE 0.4 8 0.2 7 Dimension 1 (C 0 ) 0 6 9 8 5 7 4 6 5 3 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9 8 4 Dimension 2 (C 1 ) 7 6 5 3 4 0.16 All Weights Overall 0.14 must sum to 1 PDF Weighted Gaussian 1 0.12 0.1 Weighted Gaussian 3 Probability density function approximated by 3- 0.08 component Gaussian mixture models 0.06 Weighted Gaussian 2 0.04 Each Gaussian mixture consist of a mean (µ), 0.02 covariance ( Σ ) and weight ( w ) 0 -8 -6 -4 -2 0 2 4 6 8 10 13
APSIPA Asia-Pacific Signal and Information Processing Association Database for creating UBM (example) • Training set – 56 male speakers (each speaker consists of 2 minutes of active speech) for creating the UBM • Target set – 20 male speakers (each speaker consists of 2 minutes of active speech) for speaker-specific model • Test set – 250 male utterances (each speaker has many test utterances) with the known identity 14 APSIPA Distinguished Lecture Series @ IIU, Malaysia
APSIPA Asia-Pacific Signal and Information Processing Association Target Speaker Data Universal Background Weight = 0.2 Feature Dimension 2 Model (UBM) consists of 1024 Gaussian Covariance = 0.9 1024 mixtures 998 Mean = 0.9 2 UBM 1 Feature Dimension 1 Gaussian mixture Weight = 0.3 consists of a mean 1024 (µ), covariance ( Σ ) Feature Dimension 2 998 Target speaker model and weight ( w ) consists of 1024 Covariance = 0.5 Gaussian mixtures Target Model Mean = 0.8 Feature Dimension 1 15 APSIPA Distinguished Lecture Series @ IIU, Malaysia
Representing GMMs GMM Mixture 1 Mixture 2 Mixture 1024 39x1 Covariances 39x1 Covariances 39x1 Covariances 1x1 - Weight 1x1 - Weight 1x1 - Weight 39x1 Means 39x1 Means 39x1 Means 1x1024 Weight vector The UBM and each speaker model 39x1024 Means matrix is a GMM Each of them will be represented by a vector of weights, a matrix of means and a matrix of covariances 39x1024 Covariances matrix GMM REPRESENTATION 16
Decision Making Feature Extraction Determine level of Determine level of Speaker Models Universal Background Match Match Models Speaker 1 Model Generic Male Likelihood of Likelihood of John Generic Male Generic Female John’s Model Likelihood S came from speaker model Score, L = log Likelihood S did not come from speaker model 𝑴 ≶ 𝜾 Reject/Accept 17
Recommend
More recommend