uncertainty modeling without subspace methods for text
play

Uncertainty Modeling without Subspace Methods for Text-Dependent - PowerPoint PPT Presentation

Speaker Recognition Task and Features Two Backends Experiments Uncertainty Modeling without Subspace Methods for Text-Dependent Speaker Recognition Patrick Kenny, Themos Stafylakis, Md. Jahangir Alam and Marcel Kockmann Odyssey Speaker and


  1. Speaker Recognition Task and Features Two Backends Experiments Uncertainty Modeling without Subspace Methods for Text-Dependent Speaker Recognition Patrick Kenny, Themos Stafylakis, Md. Jahangir Alam and Marcel Kockmann Odyssey Speaker and Language Recognition Workshop Bilbao, Spain June, 2016 1 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

  2. Speaker Recognition Task and Features Two Backends Experiments Uncertainty Modeling in Text-Dependent Speaker Recognition Large numbers of mixture components are surprisingly effective in text-dependent speaker recognition where utterances are typically of 1 or 2 seconds duration The number of times a mixture component is observed typically << 1 and it could be 0 (particularly at test time) so observations ought to be treated as being noisy in the statistical sense Some progress has been made in uncertainty modeling in text-independent speaker recognition with subspace methods (i-vectors, speaker factors) but these are of limited use in text-dependent speaker recognition We tackle the problem of uncertainty modeling without resorting to subspace methods 2 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

  3. Speaker Recognition Task and Features Two Backends Experiments RSR2015 Part III (Random Digits) Background set (97 speakers) used for JFA and backend training Results reported on development set Enrollment consists of 3 utterances of the 10 digits in random order Each test utterance consists of a random string of 5 digits Error rates are much higher than on Part I Counterintuitively, it is hard to beat a naive GMM/UBM benchmark using HMMs We focus on backend modeling with a standard 60-dimensional PLP front end 3 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

  4. Speaker Recognition Task and Features Two Backends Experiments JFA for Speaker Recognition with Digits Given a speaker and a collection of enrollment recordings, the recordings are modeled by supervectors of the form m + Ux r + Dz (1) Speakers are characterized by z -vectors (supervector sized); the x -vectors (low-dimensional) model channel effects To perform speaker recognition, for each digit d in a test utterance compare the vectors supervectors z e and z t where z e is extracted from the enrollment utterances z t is extracted from the test utterance z vectors may be digit-independent (global) or digit-dependent (local) 4 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

  5. Speaker Recognition Task and Features Two Backends Experiments Two Backends The Joint Density Backend uses point estimates of z e and z t The Hidden Supervector Backend treats z e and z t as latent variables. Inference requires Baum-Welch statistics A joint prior distribution (under the same-speaker hypothesis) P ( w ) where w = ( z e , z t ) Calculating the posterior of w given Baum-Welch statistics 5 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

  6. Speaker Recognition Task and Features Two Backends Experiments Joint Density Backend The joint distribution for target trials, P T ( z e , z t ) , is modeled by a Gaussian for each mixture component Insufficient data to train full covariance Gaussians and diagonal Gaussians obviously incorrect “Semi-diagonal” constraints (see paper) Gaussians estimated by arranging the background set into a collection of target trials For non-target trials, assume statistical independence, i.e. P N ( z e , z t ) = P T ( z e ) × P T ( z t ) Likelihood ratio for speaker verification: � P T ( z e , z t ) P N ( z e , z t ) where the product ranges over the digits in the test utterance and mixture components in the UBM 6 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

  7. Speaker Recognition Task and Features Two Backends Experiments Hidden Supervector Backend For each mixture component treat z e , z t as a pair of hidden mean vectors which are correlated in the case of a target trial Use an “i-vector extractor” to do probability calculations (not to extract factors) The “i-vector” w is the pair z e , z t so its dimension is twice that of the acoustic feature vectors The i-vector model has full rank so we can take the total variability matrix to be the identity and shift the burden of modeling the correlation between z e and z t to the prior The prior cannot be standard normal so it needs to be estimated 7 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

  8. Speaker Recognition Task and Features Two Backends Experiments Posterior Calculations For an i-vector extractor with a non-standard prior, � − 1 � � N c T ⊤ Cov ( w , w ) = P + c T c c � � � T ⊤ � w � = Cov ( w , w ) P µ + c F c c where µ is the prior expectation and P the precision. (In the standard case, µ = 0 and P = I .) 8 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

  9. Speaker Recognition Task and Features Two Backends Experiments Minimum Divergence Estimation of the Prior We need to supply the mean µ and precision matrix P that specifies the prior distribution of “i-vectors” for same-speaker trials. Arrange the background set into a collection of target trials indexed by s = 1 , . . . , S and let w ( s ) be the “i-vector” for trial s . 1 � = � w ( s ) � µ S s 1 � � P − 1 � w ( s ) w ⊤ ( s ) − µµ ⊤ = S s Minor modifications to make µ and P digit dependent or impose semi-diagonal constraints. 9 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

  10. Speaker Recognition Task and Features Two Backends Experiments For the different speaker hypothesis, treat z e and z t as being statistically independent. In other words, suppress the cross correlations in the covariance matrix P − 1 that defines the prior under the same-speaker hypothesis. 10 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

  11. Speaker Recognition Task and Features Two Backends Experiments Likelihood Ratio Given data and a probability model with hidden variables, the evidence is the likelihood of the data calculated by integrating out the hidden variables For an i-vector model the integral can be evaluated in closed form (it is a Gaussian integral) and expressed in terms of the Baum-Welch statistics (see paper) To evaluate the likelihood ratio for a speaker verification trial, evaluate the evidence twice Using the prior for the same-speaker hypothesis Using the prior for the different speaker hypothesis 11 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

  12. Speaker Recognition Task and Features Two Backends Experiments Preparing the Baum-Welch Statistics For each speaker, we have a collection of (enrollment or test) recordings indexed by r For each mixture component c , zero and first order c and F r statistics denoted by N r c Remove the channel effects from each recording and pool over recordings � N r N c = c r � ( F r c − N r c U c � x r � ) F c = r � x r � is a point-estimate of the hidden variable x r in (1) One set of “synthetic” statistics per speaker (regardless of the number of recordings) 12 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

  13. Speaker Recognition Task and Features Two Backends Experiments “Length Normalization” of the Synthetic Statistics In the JFA model (1), z c is a hidden variable The posterior covariance and expectation C c and � z c � , are given by ( I + N c D ∗ c D c ) − 1 C c = C c D ∗ � z c � = c F c so that � � z c � � 2 + trace ( C c ) � � z c � 2 � = For each speaker, we scale the synthetic first order � � z c � 2 � statistics so that � is the same for all speakers c 13 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

  14. Speaker Recognition Task and Features Two Backends Experiments The dominant term in (2) is trace ( C c ) An experiment in the Appendix A demonstrates its usefulness The posterior covariance matrix C c depends critically on the relevance factor 14 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

  15. Speaker Recognition Task and Features Two Backends Experiments 128 Mixture Components, Global z -vectors norm.? EER (M/F) DCF (M/F) 1 GMM - 4.8%/8.0% 0.217/0.356 2 JDB - 4.8%/7.6% 0.219/0.353 × 3 HSB 4.5%/6.8% 0.201/0.338 4 HSB 3.9 %/ 6.1 % 0.177 / 0.307 � Table 1: Results on the development set obtained with 128 Gaussians. The systems are a GMM/UBM system, the Joint Density Backend (JDB) and the Hidden Supervector Backend (HSB) both with global z -vectors. Baum-Welch statistics normalization is indicated by “norm”. 15 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

  16. Speaker Recognition Task and Features Two Backends Experiments 512 Components, Global z -vectors r EER (M/F) DCF (M/F) 1 GMM 2 4.7%/8.2% 0.195/0.336 2 JDB 2 4.3%/6.1% 0.196/0.288 5 HSB 1 3.3 %/ 4.6 % 0.148 / 0.234 Table 2: Results on the development set obtained with 512 Gaussians and global z -vectors. 16 / 19 P . Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

Recommend


More recommend