This work was financially supported by the Ministry of Education and Science of the Russian Federation (14.578.21.0126 (RFMEFI57815X0126) On autoencoders in the i-vector space for speaker recognition Timur Pekhovsky Sergey Novoselov Aleksey Sholokhov Oleg Kudashev Speech Technology Center Ltd., Russia
OUTLINE Motivation and goals Detailed study of the DAE system Datasets and experimental setup Front-End and i-vector extractor DAE system description & DAE training procedure Back-End and scoring. Replacing back-end Analysis of the DAE system performance An improved DAE system Dropout regularization Deep architectures DAE system in the domain mismatch scenario Dataset. Back-Ends Results Conclusions 2
Motivation and goals The denoising autoencoder (DAE) based speaker verification system achieved performance improvement compared to the commonly used baseline (i.e. PLDA on raw i-vectors) [1]. This motivated us for detailed investigation: to study the properties of DAE in the i-vector space o to analyze different strategies of initialization and training of the o back-end parameters to investigate dropout regularization o to explore different deep architectures of DAE o to investigate DAE based system in case of domain mismatch o condition [1] Sergey Novoselov, Timur Pekhovsky, Oleg Kudashev, Valentin Mendelev, and Alexey Prudnikov , “ Non-linear PLDA for i- vector speaker verification,” in INTERSPEECH 2015, Dresden, Germany, September 6-10, 2015, 2015, pp. 214 – 218. 3
Detailed study of the DAE system 4
Datasets and experimental setup Training data: telephone channel recordings from the NIST SRE 1998-2008 o corpora 16618 sessions of 1763 male speakers (only English language) o Evaluation data: The NIST 2010 SRE protocol (condition 5 extended, males, o English language) Operating points: equal error rate (EER) o minimum detection cost function (minDCF 2010) o 5
Front-End and i-vector extractor 20 MFCC (including C0) with their first- and second-order derivatives (Kaldi o version) DNN based posteriors extraction with 11 frames splicing for DNN input o DNN with 2700 triphone states and 20 non-speech states o (trained on Switchboard corpus using Kaldi) “ SoftVAD ” solution using DNN outputs: o , 𝐺 𝑑 = 𝐺 𝑑 − 𝑛𝑂 𝑑 𝜏 𝑑 ∈ 𝐽 𝑢𝑠 𝐺 𝑇 𝑑 , 𝜏 2 = 𝑑 𝑑∈𝐽𝑢𝑠 𝑑∈𝐽𝑢𝑠 − 𝑛 2 𝑛 = 𝑂 𝑑 𝑂 𝑑 𝑑∈𝐽𝑢𝑠 𝑑∈𝐽𝑢𝑠 𝐽 𝑢𝑠 - DNN output indexes corresponding to triphone states; 𝐎 𝑑 , 𝑮 𝑑 , 𝑻 𝑑 are 0- 1st- and 2nd-order statistics 400-dimensional i-vectors o 6
DAE system description & DAE training procedure Learning denoising transform: 𝑗(𝑡, ℎ) is the i-vector representing ℎ -th session of s -th speaker • 𝑗(𝑡) is the mean i-vector for speaker 𝑡 • RBM parametrs are used to initialize denoising neural network • 7
DAE system description & DAE training procedure Block diagram of speaker recognition systems compared in our experiments 8
Back-End and scoring Two-covariance model: (1) T T T Score i Qi i Qi 2 Pi i 1 1 2 2 1 2 where square matrices 𝑄 and 𝑅 can be expressed in terms of (2) and (3). (2) 1 T ( i )( i ) B s s S s S 1 1 (3) T ( i i )( i i ) W s , h s s , h s S H s h 9
Back-End and scoring. Replacing back-end 10
Analysis of the DAE system performance Table 2: “ Rus- Telecom” * test Table 1: NIST SRE 2010 test System EER(%) minDCF System EER(%) minDCF Baseline 1.67 0.347 Baseline 1.63 0.64 RBM 1.55 0.332 RBM 1.65 0.63 DAE 1.43 0.55 DAE 1.43 0.284 * Rus-Telecom is the Russian-language corpus of telephone recordings. Training set consists of 6508 male speakers and 33678 speech cuts. Evaluation part consists of 235 male speakers and 4210 speech cuts. Evaluation protocol (singlesession enrollments) contains 37184 target trials and 111660 impostor trials. 11
Analysis of the DAE system performance Assessing denoising transform: Class-separability criterion: Figure 1: Eigenvalues of the matrix 𝐺 . 1 J Tr ( ) Tr ( F ) W B where Σ 𝑋 and Σ 𝐶 are the within-speaker and between-speaker covariance matrices No normalization was applied to the outputs of RBM and DAE! Table 3: NIST SRE 2010 test. Cos scoring System EER(%) minDCF J Baseline 5.34 0.603 501.45 RBM 5.27 0.611 525.65 DAE 3.19 0.427 537.76 AE 5.42 0.583 494.13 12
Analysis of the DAE system performance Effect of normalization: Figure 2: Eigenvalues of the matrix 𝐺 . 1 J Tr ( ) Tr ( F ) W B Whitening & LN was applied to the outputs of RBM and DAE! Table 4: NIST SRE 2010 test. Cos scoring System EER(%) minDCF J Baseline 5.34 0.603 501.45 RBM 4.96 0.565 525.35 DAE 4.95 0.558 533.37 13
Analysis of the DAE system performance Effect of replacing whitening parameters: Whitening & LN were applied to Figure 3: Eigenvalues of the matrix 𝐺 . the outputs of RBM and DAE! Whitening parameters of the DAE system are replaced by the RBMs ones Table 5: NIST SRE 2010 test. Cos scoring System EER(%) minDCF J Baseline 5.34 0.603 501.45 RBM 4.96 0.565 525.35 DAE 2.83 0.393 537.32 14
Analysis of the DAE system performance Effect of replacing whitening parameters: Table 6: NIST SRE 2010 test. Cos scoring Whitening: System EER(%) minDCF J 𝐵, 𝜈 Baseline raw 5.34 0.603 501.45 RBM no 5.27 0.611 525.65 DAE no 3.19 0.427 537.76 RBM RBM 4.96 0.565 525.35 4.95 0.558 533.37 DAE DAE DAE RBM 2.83 0.393 537.32 15
Analysis of the DAE system performance Effect of replacing back-end parameters: Table 7: Performance comparison for different configurations of the DAE system. NIST SRE 2010 test PLDA: Whitening: System EER(%) minDCF *𝑄, 𝑅+ 𝐵/𝜈 Baseline Raw raw/raw 1.67 0.347 RBM RBM RBM/RBM 1.55 0.332 DAE DAE DAE/DAE 1.58 0.336 DAE DAE DAE/RBM 1.55 0.338 DAE RBM DAE/DAE 1.56 0.330 DAE DAE RBM/DAE 1.43 0.291 DAE DAE RBM/RBM 1.44 0.287 DAE RBM RBM/RBM 1.43 0.284 16
An improved DAE system 17
Dropout regularization Dropout for RBM training: Table 8: Effect of dropout for RBM training. RBM is used to initialize DAE. NIST SRE 2010 test System EER(%) minDCF DAE 1.43 0.284 DAE+dropout 1.41 0.270 Applying dropout at the stage of discriminative fine-tuning was not helpful! 18
An improved DAE system Deep denoising autoencoders : Stacking RBMs Table 9: NIST SRE 2010 test. PLDA scoring System EER(%) minDCF Baseline 1.67 0.347 DAE 1.43 0.284 DAE 5 1.43 0.297 19
Aimproved DAE system Deep denoising autoencoders : Stacking DAEs Table 10: NIST SRE 2010 test. PLDA scoring System EER(%) minDCF Baseline 1.67 0.347 RBM 1 1.55 0.332 DAE 1 1.43 0.284 RBM 2 1.58 0.329 DAE 2 1.30 0.282 20
DAE system in the domain mismatch scenario 21
Domain Adaptation Challenge DAC setup: GMM-UBM based i-vector extractor (600 dimentional i-vectors) o In-domain SRE set (SRE 04, 05, 06, and 08 ). o Out-of-domain Switchboard set o Evaluation data: The NIST 2010 SRE protocol (condition 5 extended, males, o English language) Operating points: equal error rate (EER) o minimum detection cost function (minDCF 2010) o 22
Back-Ends The results are presented for the following scoring types: Cosine scoring o Two-covariance model (referred to as PLDA) o Simplified PLDA with 400-dimensional speaker subspace (referred to as o SPLDA) In our experiments we ignore labels of the in-domain data. We used in-domain SRE set only to estimate the whitening parameters of our systems.. 23
Results Table 11: Performance summary of speaker verification systems with PLDA and cosine back-ends Cos PLDA Whitening/ System Training EER minDCF EER minDCF 5.45 0.621 2.18 0.360 Baseline 5.47 0.634 2.16 0.348 SRE/SRE RBM DAE 3.67 0.467 1.67 0.307 9.13 0.788 6.45 0.660 Baseline 8.97 0.778 6.28 0.667 RBM SWB/SWB DAE 8.97 0.764 6.01 0.644 Baseline 5.45 0.621 4.23 0.554 SRE/SWB RBM 5.35 0.631 2.97 0.447 DAE 4.62 0.560 2.63 0.401 24
Results Table 12: Performance summary of speaker verification systems with SPLDA. PLDA Whitening/ System Training EER minDCF Baseline 2.23 0.312 RBM 2.07 0.317 SRE/SRE DAE 1.61 0.292 Baseline 4.21 0.531 RBM SRE/SWB 2.66 0.410 DAE 2.36 0.400 25
CONCLUSIONS A study of denoising autoencoders in the i-vector space was presented We figured out the observed performance gain of DAE based system is due to employing Back-End parameters (Whitening & PLDA) derived from RBM outputs The question why RBM transform provides better Back-End parameters for a test Set is still open Dropout helps when applied to RBM training stage and does not help when implemented in the fine-tuning stage Deep architecture in the form of stacked DAE provides further improvements All our findings regarding speaker verification systems in matched conditions hold true for mismatched conditions case Using whitening parameters from the target domain along with DAE trained on the out-of-domain set allows to avoid significant performance gap caused by domain mismatch 26
Recommend
More recommend