on the use of plda i vector scoring for clustering short
play

On the Use of PLDA i-Vector Scoring for Clustering Short Segments - PowerPoint PPT Presentation

On the Use of PLDA i-Vector Scoring for Clustering Short Segments Itay Salmun Irit Opher Itshak Lapidot itshakl@afeka.ac.il itaysa@afeka.ac.il irito@afeka.ac.il Outline Shortly about DNNs Motivation Problem Definition Basic


  1. On the Use of PLDA i-Vector Scoring for Clustering Short Segments Itay Salmun Irit Opher Itshak Lapidot itshakl@afeka.ac.il itaysa@afeka.ac.il irito@afeka.ac.il

  2. Outline • Shortly about DNNs • Motivation • Problem Definition • Basic Mean-Shift Algorithm • Modified Mean-Shift Algorithm • Speaker Clustering System • Experiments and Results • Summary June 2016

  3. Shortly about DNNs June 2016

  4. Motivation Bli… Blo… Blu… Pli… Pla… Tra… Tra Ta Ta Gla… Ta Ta… Dla… June 2016

  5. Motivation (cont.) Bli… Gla… Blu… Pla… Tra… Tra Ta Ta Ta Ta… Blo… Pli… Dla… June 2016

  6. Motivation (cont.) Bli… Gla… Blu… Pla… Tra… Tra Ta Ta Ta Ta… Blo… Pli… Dla… June 2016

  7. Problem Definition • Given many short speech segments, required to cluster them into homogeneous groups, such that: – Each cluster will occupied mostly by one speaker only (cluster purity). – Each speaker will mostly belongs to one cluster only (speaker purity). June 2016

  8. Mean-Shift Algorithm Basic  Objective : Find the densest Region of interest region Center of mass  The Mean Shift vector:   2 φ − φ n ∑ φ   i g i h   = 1 i φ = − φ ( ) m   h Mean Shift 2 φ − φ n ∑ vector   i g h   = 1 i  The uniform kernel with bandwidth for Euclidean pairwise distances :  2 φ − φ ≤ 2  1 h φ φ =  i ( , , ) g h i φ − φ 2 > 2  0  h June 2016 i

  9. Mean-Shift Algorithm (cont.) Modified  The Mean Shift vector: k ∑ φ φ φ ( , , ) g h il il i i φ = − φ = 1 ( ) l m h i i k ∑ i φ φ ( , , ) g h il i i = 1 l  The adaptive bandwidth parameter is h i calculated using K-Nearest neighbor φ φ algorithm. If is the K th nearest neighbor of iK i then the bandwidth is calculated as: = s φ φ ( , ) h i i iK s φ φ  Where is the two-covariance scoring. ( , ) i ik June 2016

  10. Mean-Shift Algorithm (cont.) Modified φ  We select a subset of data points in ( ) S h i φ i which the PLDA pairwise score with are i h larger or equal to the adaptive bandwidth i φ = φ φ φ ≥ ( ) { : ( , ) } S s h h i il i il i i  We use Mean shift weighted kernel of: φ φ φ φ ≥ φ φ  ( , | ) ( , ) ( , ) p H s s h φ φ = φ φ =  1 2 s i il i il i ( , ) log s ( , , ) g h φ φ 1 2 φ φ < i il i ( , | ) p H  0 ( , ) s h 1 2 d i il i June 2016

  11. Speaker Clustering System Mean Shift Clustering “i-vectors” algorithm results PLDA score * In previous work: I. fixed h threshold II. a cosine distance instead of PLDA III. Random Mean Shift June 2016

  12. Speaker Clustering System Before clustering: • Train the UBM and TV matrix. • Train the PCA matrix T and the Whitening transformation matrix C. φ CT ϕ = • Calculate the low rank i-vectors: φ CT • Using the low rank i-vectors, train the two- covariance model parameters. June 2016

  13. Speaker Clustering System Given a set of speech segments, cluster them according to the following steps: 1. For each speech segment extract the i- { } φ vectors: i { } ϕ 2. Calculate low rank i-vectors: i 3. Apply two-covariance score mean-shift. 4. Merge all shifted points, according to Euclidian distance with fixed threshold June 2016

  14. Experiments and Results Experiments Setup Experiments on telephone conversations  Cutting NIST-2008 into segments according to a given statistic.  Average segment length: 2.5 Sec  Average number of segments per speaker: 33 Clustering evaluation 1. Average Speaker Purity ( ASP ). 2. Average Cluster purity ( ACP ). = ⋅ K acp asp 3. K : . 4. Average Number of Detected Speakers ( ANDS ). June 2016

  15. Experiments and Results (cont.) Bandwidth parameter h (for 30 speakers) Cosine based random mean shift clustering: adaptive threshold using kNN VS a fixed threshold June 2016

  16. Experiments and Results (cont.) Mean Shift’s selecting point configuration (for 30 speakers) Cosine based mean shift clustering with adaptive threshold: full mean shift VS random mean shift June 2016

  17. Experiments and Results (cont.) PLDA based Mean Shift (for 30 speakers) Clustering with adaptive threshold: PLDA based mean shift VS cosine based mean shift June 2016

  18. Experiments and Results (cont.) PLDA training (for 30 speakers) PLDA based mean shift: PLDA model trained on short segments VS PLDA model trained on long segments June 2016

  19. Experiments and Results (cont.) Summary of Mean Shift configuration (for 30 speakers) Comparing K value of mean shift configurations June 2016

  20. Experiments and Results (cont.) Summary of Mean Shift configuration (for 30 speakers) Comparing the average number of detected speakers (ANDS) of mean shift configurations. June 2016

  21. Experiments and Results (cont.) Influence of the Population Size (Baseline System) Table 1: Results for different number of speakers for the cosine based mean shift ( baseline system) Number of h ACP ASP K ANDS Speakers 0.35 3 92.2 80.1 85.7 6.1 0.40 7 89.5 71.6 79.9 21.1 0.45 15 77.6 63.3 70.0 60.6 0.50 22 85.0 57.6 69.9 136.6 0.50 30 81.7 53.2 65.9 195.0 0.55 60 84.6 44.3 61.2 614.1 0.55 188 68.4 42.8 54.1 1742.1 June 2016

  22. Experiments and Results (cont.) Influence of the Population Size (Proposed System) Table 2: Results for different number of speakers for the PLDA based mean shift ( proposed system) Number of k (kNN) ACP ASP K ANDS Speakers 19 3 90.0 71.3 79.8 5.0 17 7 11.2 84.8 67.5 75.5 15 15 26.9 86.6 63.6 74.1 15 36.4 22 86.6 65.3 75.1 17 30 80.8 64.3 72.1 46.6 17 60 73.8 61.1 67.2 90.0 17 188 283.0 61.4 53.1 57.1 June 2016

  23. Experiments and Results (cont.) Baseline VS New system Table 2: Results for different number of speakers for the PLDA based mean shift ( proposed system) Number of K ANDS Speakers 5.0 (6.1) 3 79.8 (85.7) 7 11.2 (21.1) 75.5 (79.9) 15 26.9 (60.6) 74.1 (70.0) 22 36.4 (136.6) 75.1 (69.9) 30 72.1 (65.9) 46.6 (195.0) 60 90.0 (614.1) 67.2 (61.2) 188 57.1 (54.1) 283.0 (1742.1) June 2016

  24. Summary  While the proposed system is more time consuming, it outperforms the baseline system in the following aspects: 1. it yields better results when clustering large numbers of speakers 2. it is more robust to changes in the number of speakers 3. no bandwidth adjustment is needed (almost) 4. The average number of detected speakers is by far more accurate June 2016

  25. Thanks June 2016

Recommend


More recommend