i vector transformation and scaling for plda based
play

Ivector transformation and scaling for PLDA based speaker - PowerPoint PPT Presentation

Ivectors and PLDA Ivector Transformation Dataset mismatch compensation Conclusions Ivector transformation and scaling for PLDA based speaker recognition Sandro Cumani, Pietro Laface Politecnico di Torino, Italy Sandro Cumani, Pietro


  1. I–vectors and PLDA I–vector Transformation Dataset mismatch compensation Conclusions I–vector transformation and scaling for PLDA based speaker recognition Sandro Cumani, Pietro Laface Politecnico di Torino, Italy Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition

  2. I–vectors and PLDA I–vector Transformation Dataset mismatch compensation Conclusions Outline I–vectors and PLDA I–vector transformation Dataset mismatch compensation Results and conclusions Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition

  3. I–vectors and PLDA I–vector Transformation PLDA assumptions Dataset mismatch compensation HT–PLDA / length norm Conclusions PLDA assumptions I–vectors are sampled from a Gaussian distribution Similar development and evaluation i–vector distributions 0 . 025 0 . 6 Dev φ 1 0 . 5 φ 2 0 . 020 Eval χ 2 N (0 , 1) 0 . 4 0 . 015 0 . 3 0 . 010 0 . 2 0 . 005 0 . 1 0 . 000 0 . 0 0 50 100 150 200 250 300 − 4 − 3 − 2 − 1 0 1 2 3 4 Distribution of squared i–vector norms Histogram of two (whitened) i–vector components with highest skewness φ 1 , φ 2 Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition

  4. I–vectors and PLDA I–vector Transformation PLDA assumptions Dataset mismatch compensation HT–PLDA / length norm Conclusions HT–PLDA / length norm HT–PLDA tries to deal with non–Gaussianity Length normalization (LN) Mainly deals with dataset mismatch 0 . 6 0 . 6 � φ 1 φ 1 0 . 5 0 . 5 φ 2 � φ 2 N (0 , 1) N (0 , 1) 0 . 4 0 . 4 0 . 3 0 . 3 0 . 2 0 . 2 0 . 1 0 . 1 0 . 0 0 . 0 − 4 − 3 − 2 − 1 0 1 2 3 4 − 4 − 3 − 2 − 1 0 1 2 3 4 5 Histogram of two (whitened) i–vector components Histogram of two (whitened) i–vector components with highest skewness φ 1 , φ 2 before LN with highest skewness � φ 1 , � φ 2 after LN Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition

  5. I–vectors and PLDA I–vector Transformation PLDA assumptions Dataset mismatch compensation HT–PLDA / length norm Conclusions HT–PLDA / length norm Length–normalized i–vectors are still far from Gaussian Can we transform i–vectors as to better fit Gaussian PLDA assumptions and at the same time perform a similar dataset mismatch compensation? Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition

  6. I–vectors and PLDA I–vector Transformation I–vector Transformation Transformation Model Dataset mismatch compensation SAS Transformation on SRE ’10 data Conclusions I–vector Transformation Assume that i–vectors are sampled from R.V. Φ Represent Φ as a function of a Standard Normal R.V. Y Φ = f − 1 ( Y ) The (log) p.d.f. of Φ is given by � � � � � J f ( φ ) log P Φ ( φ ) = log P Y ( f ( φ )) + log � � � = − 1 � � 2 f ( φ ) T f ( φ ) + log � J f ( φ ) � + c Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition

  7. I–vectors and PLDA I–vector Transformation I–vector Transformation Transformation Model Dataset mismatch compensation SAS Transformation on SRE ’10 data Conclusions I–vector Transformation How do we model the (unknown) function f ? Neural network style approach Composition of (invertible) layers f ( φ, θ 1 , · · · , θ n ) = f 1 ( · , θ 1 ) ◦ · · · ◦ f n ( · , θ n ) f is estimated as to maximize the likelihood of the samples of Φ (in our case the i–vectors) The function f allows transforming Φ –distributed samples into (almost) Gaussian–distributed samples Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition

  8. I–vectors and PLDA I–vector Transformation I–vector Transformation Transformation Model Dataset mismatch compensation SAS Transformation on SRE ’10 data Conclusions I–vector Transformation 0 . 6 φ 1 4 0 . 5 p.d.f. of X 2 0 . 4 0 . 3 0 0 . 2 − 2 0 . 1 − 4 0 . 0 − 3 − 2 − 1 0 1 2 3 4 5 − 3 − 2 − 1 0 1 2 3 4 5 (a) (b) 0 . 45 f ( φ 1 ) 0 . 40 (a) Histogram of φ 1 and estimated N (0 , 1) 0 . 35 p.d.f. of Φ 0 . 30 0 . 25 (b) Estimated transformation 0 . 20 function f 0 . 15 (c) Histogram of f ( φ 1 ) 0 . 10 and p.d.f. of Y ∼ N ( 0 , 1 ) 0 . 05 0 . 00 − 4 − 2 0 2 4 (c) Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition

  9. I–vectors and PLDA I–vector Transformation I–vector Transformation Transformation Model Dataset mismatch compensation SAS Transformation on SRE ’10 data Conclusions Transformation Model Simple structure (cascade of two types of transformations) Affine layer: f A ( φ ; A , b ) = A φ + b SAS layer (acting as non–linear units): cascade of Inverse sinh layer: f S 1 ( φ i ) = sinh − 1 ( φ i ) Diagonal affine layer: f S 2 ( φ i ; δ i , ε i ) = δ i φ i + ε i sinh layer: f S 3 ( φ i ) = sinh ( φ i ) The SAS layer can be summarized as f SAS ( φ i ; δ i , ε i ) = sinh ( δ i sinh − 1 ( φ i ) + ε i ) Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition

  10. I–vectors and PLDA I–vector Transformation I–vector Transformation Transformation Model Dataset mismatch compensation SAS Transformation on SRE ’10 data Conclusions Transformation Model The parameters of the transformation function f are estimated using a Maximum Likelihood criterion Gradients can be computed using an algorithm similar to back–propagation with MSE loss (but we need to take into account Jacobian log–determinants) Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition

  11. I–vectors and PLDA I–vector Transformation I–vector Transformation Transformation Model Dataset mismatch compensation SAS Transformation on SRE ’10 data Conclusions SAS Transformation on SRE ’10 data (female) 0 . 025 0 . 025 Dev Dev Eval Eval 0 . 020 0 . 020 χ 2 χ 2 0 . 015 0 . 015 0 . 010 0 . 010 0 . 005 0 . 005 0 . 000 0 . 000 0 50 100 150 200 250 300 0 50 100 150 200 250 300 Distribution of squared i–vector norms Distribution of squared norms of 1–layer–SAS transformed i–vector Cond 1 Cond 2 Cond 3 Cond 4 Cond 5 System EER DCF10 EER DCF10 EER DCF10 EER DCF10 EER DCF10 PLDA (w/o LN) 2.06 0.288 3.60 0.541 3.27 0.481 1.71 0.335 3.91 0.417 1–layer AS 2.15 0.221 3.36 0.462 2.96 0.414 1.61 0.290 3.19 0.391 PLDA (with LN) 1.81 0.255 2.83 0.476 1.95 0.367 1.21 0.295 2.19 0.347 Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition

  12. I–vectors and PLDA I–vector Transformation Dataset mismatch compensation Dataset mismatch compensation Results Conclusions Dataset mismatch compensation Length–norm can be interpreted as the ML solution for the estimate of a scaling parameter of the i–vector distribution An i–vector φ i is sampled from the R.V. Φ i � � 0 , α − 2 Φ i ∼ N Σ i Given Σ the ML estimate for α i is � φ T i Σ − 1 φ i − 1 = α i � D Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition

  13. I–vectors and PLDA I–vector Transformation Dataset mismatch compensation Dataset mismatch compensation Results Conclusions Dataset mismatch compensation Φ i can be represented as 1 Φ i = α − 1 2 Y , Σ Y ∼ N ( 0 , I ) i The transformation function f is given by A = Σ − 1 f ( φ i ; A , α i ) = α i A φ i , 2 Applying whitening followed by length–norm is equivalent to applying the transformation f using the ML estimates of A and α i Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition

  14. I–vectors and PLDA I–vector Transformation Dataset mismatch compensation Dataset mismatch compensation Results Conclusions Dataset mismatch compensation We introduce an α –layer (scaling layer), whose single parameter is i–vector depedent The function f is a cascade of the α –layer and the original SAS layers For efficiency reasons we perform alternate estimation of SAS parameters and α i ’s At testing time, for each test i–vector we need to estimate the corresponding α i Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition

  15. I–vectors and PLDA I–vector Transformation Dataset mismatch compensation Dataset mismatch compensation Results Conclusions SRE ’10 Results (female) 400–dimensional i–vectors reduced to 150–dimensions through LDA i–vectors are whitened (this allows initializing the transformation as the identity function) Results of α –scaled SAS transformation on the female set of NIST SRE 2010 dataset Cond 1 Cond 2 Cond 3 Cond 4 Cond 5 System EER DCF10 EER DCF10 EER DCF10 EER DCF10 EER DCF10 PLDA (w/o LN) 2.06 0.288 3.60 0.541 3.27 0.481 1.71 0.335 3.91 0.417 1–layer AS 2.15 0.221 3.36 0.462 2.96 0.414 1.61 0.290 3.19 0.391 PLDA (with LN) 1.81 0.255 2.83 0.476 1.95 0.367 1.21 0.295 2.19 0.347 1–layer α –AS iter. 1 1.80 0.204 2.83 0.424 2.15 0.373 1.20 0.280 2.03 0.333 1–layer α –AS iter. 3 1.38 0.192 2.58 0.406 2.30 0.361 1.20 0.237 2.16 0.322 Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition

Recommend


More recommend