first investigations on self trained speaker diarization
play

First Investigations on Self Trained Speaker Diarization el Le Lan 1 - PowerPoint PPT Presentation

First Investigations on Self Trained Speaker Diarization el Le Lan 1 , 2 Sylvain Meignier 2 Ga Delphine Charlet 1 Anthony Larcher 2 1 Orange Labs, France first.lastname@orange.com 2 LIUM, Universit e du Maine, France


  1. First Investigations on Self Trained Speaker Diarization el Le Lan 1 , 2 Sylvain Meignier 2 Ga¨ Delphine Charlet 1 Anthony Larcher 2 1 Orange Labs, France first.lastname@orange.com 2 LIUM, Universit´ e du Maine, France first.lastname@lium.univ-lemans.fr June 22, 2016 Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 1 / 15

  2. Context Cross-recording speaker diarization of French TV archives Speaker indexing of collections of multiple recordings Two-pass approach Speaker segmentation and clustering, within each recording Cross-recording speaker linking State of the art speaker recognition framework i-vector/PLDA hierarchical agglomerative clustering PLDA maximizes the inter-speaker variability, while minimizing the intra-speaker. Using the target data as training material, how good can we estimate this variability ? Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 2 / 15

  3. State-of-the-Art two-pass Diarization Framework (baseline) target data unlabeled Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 3 / 15

  4. State-of-the-Art two-pass Diarization Framework (baseline) target data unlabeled frontend Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 3 / 15

  5. State-of-the-Art two-pass Diarization Framework (baseline) target data unlabeled frontend speaker segmentation Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 3 / 15

  6. State-of-the-Art two-pass Diarization Framework (baseline) target data unlabeled frontend speaker segmentation Universal Background Model i-vector extraction Total Variability Matrix Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 3 / 15

  7. State-of-the-Art two-pass Diarization Framework (baseline) target data unlabeled frontend speaker segmentation Universal Background Model i-vector extraction Total Variability Matrix similarity scoring PLDA parameters Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 3 / 15

  8. State-of-the-Art two-pass Diarization Framework (baseline) target data unlabeled frontend speaker segmentation Universal Background Model i-vector extraction Total Variability Matrix similarity scoring PLDA parameters speaker clustering Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 3 / 15

  9. State-of-the-Art two-pass Diarization Framework (baseline) target data unlabeled frontend speaker segmentation Universal Background Model i-vector extraction Total Variability Matrix similarity scoring PLDA parameters speaker clustering Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 3 / 15

  10. State-of-the-Art two-pass Diarization Framework (baseline) target data unlabeled frontend speaker segmentation Universal Background Model i-vector extraction Total Variability Matrix similarity scoring PLDA parameters speaker clustering Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 3 / 15

  11. State-of-the-Art two-pass Diarization Framework (baseline) target data unlabeled frontend speaker segmentation Universal Background Model i-vector extraction Total Variability Matrix similarity scoring PLDA parameters speaker clustering for each recording Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 3 / 15

  12. State-of-the-Art two-pass Diarization Framework (baseline) target data unlabeled frontend speaker segmentation Universal Background Model i-vector extraction Total Variability Matrix similarity scoring PLDA parameters speaker clustering for each recording cross-recording similarity scoring Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 3 / 15

  13. State-of-the-Art two-pass Diarization Framework (baseline) target data unlabeled frontend speaker segmentation Universal Background Model i-vector extraction Total Variability Matrix similarity scoring PLDA parameters speaker clustering for each recording cross-recording diarization output speaker linking similarity scoring (speaker clusters) Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 3 / 15

  14. State-of-the-Art two-pass Diarization Framework (baseline) target data train data unlabeled labeled by speaker frontend frontend speaker segmentation Universal Background Model i-vector extraction i-vector extraction Total Variability Matrix similarity scoring PLDA parameters baseline - supervised speaker clustering for each recording cross-recording diarization output speaker linking similarity scoring (speaker clusters) Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 3 / 15

  15. State-of-the-Art two-pass Diarization Framework (baseline) target data train data acoustic mismatch unlabeled labeled by speaker frontend frontend speaker segmentation Universal Background Model i-vector extraction i-vector extraction Total Variability Matrix similarity scoring PLDA parameters baseline - supervised speaker clustering for each recording cross-recording diarization output speaker linking similarity scoring (speaker clusters) Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 3 / 15

  16. ”Self Trained” Framework target data train data acoustic mismatch unlabeled labeled by speaker frontend frontend speaker segmentation Universal Background Model i-vector extraction i-vector extraction Total Variability Matrix similarity scoring PLDA parameters baseline - supervised speaker clustering self trained - unsup. for each recording cross-recording diarization output speaker linking similarity scoring (speaker clusters) Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 4 / 15

  17. Adapted Framework target data train data acoustic mismatch unlabeled labeled by speaker frontend frontend speaker segmentation Universal Background Model i-vector extraction i-vector extraction Total Variability Matrix similarity scoring PLDA parameters baseline - supervised speaker clustering self trained - unsup. for each recording adapted - unsup. cross-recording diarization output speaker linking similarity scoring (speaker clusters) Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 5 / 15

  18. ”Self Trained” Diarization ? (1/2) Goal: avoid acoustic mismatch using the target data as training material Requirements to train an i-vector/PLDA system UBM/TV: clean speech segments, straightforward PLDA: several sessions per speaker, in various acoustic conditions Are there several speakers appearing in different episodes ? Assuming we know how to effectively cluster the target data, can we train a system with those ? Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 6 / 15

  19. Which Data ? 200 hours of French broadcast news (drawn from REPERE, ETAPE and ESTER evaluation campaigns) Two shows selected as target corpora: LCP Info and BFM Story train corpus: all other recordings Corpus LCP target BFM target #Episodes 45 42 Episode duration 25m 60m Evaluated (labeled) speech duration 10h08m 19h57m One-Time speakers 127 345 Recurring speakers (2+ occurrences) 93 77 R. speakers (3+ occurrences) 48 35 Total speakers 220 422 O.T. speakers speech proportion 20.12% 44,84% R. speakers (2+ occurrences) s.p. 79.88% 55,16% R. speakers (3+ occurrences) s.p. 67.06% 45.94% Average speaker time per episode 1m08s 1m58s Table: Composition of target corpora. Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 7 / 15

  20. Oracle Framework target data target data train data acoustic mismatch unlabeled labels labeled by speaker frontend frontend LCP target BFM target 10,87 X speaker segmentation Universal Background Model i-vector extraction i-vector extraction Total Variability Matrix similarity scoring PLDA parameters baseline - supervised speaker clustering oracle - supervised for each recording cross-recording diarization output speaker linking similarity scoring (speaker clusters) Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 8 / 15

  21. Oracle Framework target data target data train data acoustic mismatch unlabeled labels labeled by speaker frontend frontend LCP target BFM target 17,72 13,22 10,87 X speaker segmentation Universal Background Model i-vector extraction i-vector extraction Total Variability Matrix similarity scoring PLDA parameters baseline - supervised speaker clustering oracle - supervised for each recording cross-recording diarization output speaker linking similarity scoring (speaker clusters) Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 8 / 15

  22. Minimum Requirements for PLDA Parameters Estimation Oracle Experiment For the LCP target corpus, we can estimate suitable PLDA parameters with a minimum of 37 episodes 40 recurring speakers, appearing in 7.2 episodes, in average As for the BFM target corpus, the EM algorithm does not converge all episodes, 35 recurring speakers, appearing in 5.45 episodes Ga¨ el Le Lan (Orange Labs/LIUM) Self Trained Speaker Diarization June 22, 2016 9 / 15

Recommend


More recommend