segmentation of broadcast news
play

Segmentation of Broadcast News Brecht Desplanques, Kris Demuynck - PowerPoint PPT Presentation

Soft VAD in Factor Analysis Based Speaker Segmentation of Broadcast News Brecht Desplanques, Kris Demuynck & Jean-Pierre Martens ELIS Data Science Lab Ghent University/iMinds 21 June 2016 Speaker Diarization for Automatic Subtitle


  1. Soft VAD in Factor Analysis Based Speaker Segmentation of Broadcast News Brecht Desplanques, Kris Demuynck & Jean-Pierre Martens ELIS Data Science Lab – Ghent University/iMinds 21 June 2016

  2. Speaker Diarization for Automatic Subtitle Generation • VRT STON project: Subtitling of TV shows by using speech technology • Subtitle generation is a time-consuming process which can be (partially) automated Diarization Speaker 1 Speaker 1 Speaker 2 Speaker 2 Speaker 2 Why solve the “who -spoke- when?” problem? • Subtitles with color codes • Enable the use of speaker-adapted models for speech recognition (SR) • Extra information for the SR language model through detected sentence boundaries 2 22 June 2016

  3. STON platform http://www.esat.kuleuven.be/psi/spraak/demo/STON/ 3 22 June 2016

  4. Main Approach to Diarization STON subtitling workflow Audio Speech/nonspeech Speaker Language Speech Post- signal segmentation diarization detection recognition processing Two step diarization process: 1. Speaker segmentation or speaker change point detection 2. Speaker clustering Focus on more accurate speaker segmentation: • Too short segments do not provide enough data for reliable speaker models • Non-homogeneous segments result in error propagation • Oversegmentation makes clustering a lot slower 4 22 June 2016

  5. Speaker Diarization Architecture 1 st pass Speaker segmentation Speech Speaker Speaker segments clustering clusters through generic eigenvoices Retrain UBM on speech segments Retrain eigenvoices on speaker clusters 2 nd pass Speaker segmentation Speech Speaker Speaker segments clustering clusters through specific eigenvoices 5 22 June 2016

  6. Speaker Segmentation: Boundary Generation Speaker segmentation: Boundary generation and boundary elimination 1. Boundary generation: creation of candidate speaker change points L R • Two hypotheses: different or identical speakers in left and right fixed-length sliding windows • Search for maximal dissimilarity by comparing the distribution of acoustic features (MFCCs) 6 22 June 2016

  7. Overlapping Comparison Windows Speech/non-speech segmentation does not eliminate short pauses (<1s) between speakers L R • Adjacent comparison windows maximize dissimilarity at the pause boundaries • Overlapping windows maximize dissimilarity at the center of the pause 7 22 June 2016

  8. Speaker Segmentation via Speaker Factor Extraction Extract speaker-specific information in each comparison window through factor analysis • GMM-UBM speech model with 32 components low dimensional speaker variability (eigenvoice) matrix 𝑊 (R=20) • Extract speaker factors 𝑦 𝑢 with a sliding window (1s ) approach • 𝑛 𝑢 ≈ 𝑛 𝑉𝐶𝑁 + 𝑊𝑦 𝑢 𝒏 𝒖 Speaker factor 𝑦 𝑢 • Training data for the UBM and eigenvoice model: HUB4 BN96 English broadcast news 8 22 June 2016

  9. Speaker Factor Distance Measures • Significant local changes in speaker factors indicate a speaker change 𝐸 𝑦 𝑢−𝜐 , 𝑦 𝑢+𝜐 • Phonetic content has impact on speaker factors due to short extraction windows • Estimate this intra-speaker variability on the test data 𝒚 𝒖−𝝊 𝒚 𝒖+𝝊 Σ R Σ L • Model the intra-speaker variability of the left (L) and right (R) speaker with two covariance matrices Σ 𝑀 and Σ 𝑆 • Emphasize local changes that are not explained by intra-speaker variability with the Mahalanobis distance 𝑈 Σ 𝑀 −1 Δ𝑦 𝑢 + 𝑈 Σ 𝑆 −1 Δ𝑦 𝑢 with Δ𝑦 𝑢 = 𝑦 𝑢+𝜐 − 𝑦 𝑢−𝜐 𝐸 𝑁𝐵𝐼 𝑢 = Δ𝑦 𝑢 Δ𝑦 𝑢 9 22 June 2016

  10. Generate Candidate Change Points 𝑈 Σ 𝑀 −1 Δ𝑦 𝑢 + 𝑈 Σ 𝑆 −1 Δ𝑦 𝑢 𝐸 𝑁𝐵𝐼 𝑢 = Δ𝑦 𝑢 Δ𝑦 𝑢 Peak selection algorithm: • Averaging filter to avoid detection of spurious peaks • Select a # of maxima according to the length of the speech segment • Enforce a minimum duration of 1s for each generated speaker turn 10 22 June 2016

  11. Speaker Segmentation: Boundary Elimination 2. Boundary elimination: eliminate false positives … i i+1 … • Split the speech segment into speaker turns defined by the candidate boundaries 1 st pass: Δ BIC agglomerative clustering of adjacent speaker turns • Δ𝐶𝐽𝐷 = 𝑂 𝑗 + 𝑂 𝑗+1 log Σ 𝑗,𝑗+1 − 𝑂 𝑗 log Σ 𝑗 − 𝑂 𝑗+1 log Σ 𝑗+1 − 𝜇𝑄 2 nd pass: CDS agglomerative clustering of adjacent speaker turns 𝑦 𝑗 ⋅ 𝑦 𝑗+1 𝐸 𝐷𝐸𝑇 𝑗, 𝑗 + 1 = 1 − 𝑦 𝑗 𝑦 𝑗+1 • Clustering threshold controls the number of eliminated boundaries 11 22 June 2016

  12. Speaker Segmentation: Baseline Results • COST278 broadcast news test set: 11(+1) languages, 30 hours, 4400 speaker turns Maximum recall: 90.6% Evaluation • Mapping with 500ms margin • Recall: percentage of real boundaries mapped to computed ones • Precision: percentage of computed boundaries mapped to real ones • Popular Δ BIC boundary generation baseline (overlapping comparison windows) 12 22 June 2016

  13. Two-Pass Adaptive Speaker Segmentation 1. Cluster the speaker turns generated by our best system 𝐸 𝑁𝐵𝐼 + ∆𝐶𝐽𝐷 2. Retrain the UBM and eigenvoice model on the speech and speaker clusters of the test file 3. Repeat the boundary generation ( 𝐸 𝑁𝐵𝐼 ) and elimination (CDS) with the adapted models 13 22 June 2016

  14. Soft VAD for Speaker Factor Extraction • Our Speaker Factor Extraction does not differentiate between speech and nonspeech frames • Give speech frames more weight during speaker factor extraction • Integrate GMM-based soft Voice Activity Detection (VAD) during estimation Baum-Welch statistics: 𝑓 log 𝑞(𝒑 𝑢 |𝑉𝐶𝑁 𝑇 ) 𝑞 𝑇 𝒑 𝑢 = 𝑓 log 𝑞(𝒑 𝑢 |𝑉𝐶𝑁 𝑇 ) + 𝑓 log 𝑞(𝒑 𝑢 |𝑉𝐶𝑁 𝑂𝑇 ) 𝑛 = 𝑂 χ 𝑞(𝑇|𝒑 𝑢 )𝛿(𝑉𝐶𝑁 𝑇,𝑛 |𝒑 𝑢 ) 𝑝 𝑢 ∈ χ 𝑛 = 𝒈 χ 𝑞(𝑇|𝒑 𝑢 )𝛿(𝑉𝐶𝑁 𝑇,𝑛 |𝒑 𝑢 ) 𝒑 𝑢 𝑝 𝑢 ∈ χ Modifications 2 nd pass adaptive system: • Also retrain the nonspeech UBM on the test file • Use soft VAD speaker factor extraction during CDS boundary elimination 14 22 June 2016

  15. Soft VAD Speaker Segmentation 15 22 June 2016

  16. Agglomerative Clustering • Initial ∆𝐶𝐽𝐷 clustering • iVector PLDA clustering Cluster 1 Cluster 1 ??? Cluster 2 Cluster 2 c 𝑛 𝑑3 → 𝑦 𝑑3 𝑛 𝑑2 → 𝑦 𝑑2 𝑛 𝑑1 → 𝑦 𝑑1 1. Extract iVector 𝑦 𝑑 for each cluster (after VAD and feature warping): 𝑛 𝑑 = 𝑛 𝑉𝐶𝑁 + 𝑈𝑦 𝑑 2. Hypothesis test with Gaussian PLDA 𝑞 𝑦 𝑑1 , 𝑦 𝑑3 |𝐼 𝑡𝑏𝑛𝑓_𝑡𝑞𝑓𝑏𝑙𝑓𝑠 𝑞 𝑦 𝑑1 , 𝑦 𝑑3 |𝐼 𝑒𝑗𝑔𝑔𝑓𝑠𝑓𝑜𝑢_𝑡𝑞𝑓𝑏𝑙𝑓𝑠 3. Merge cluster pair with largest ratio 4. Iterate whole process until maximum ratio is too small 16 22 June 2016

  17. Results Clustering + Conclusion • COST278 broadcast news test set: 11(+1) languages, 30 hours, 4400 speaker turns • Diarization Error Rate (DER): percentage of frames attributed to a wrong speaker Boundary Boundary Initial segmentation DER(%) Precision(%) Recall(%) 𝐸 Δ𝐶𝐽𝐷 10.1 65.2 75.8 𝐸 𝑁𝐵𝐼 9.7 74.3 84.2 2−pass 𝐸 𝑁𝐵𝐼 9.8 79.7 81.3 2-pass with soft VAD 8.9 81.7 85.0 • Factor analysis based speaker segmentation produces more accurate boundaries • File-by-file adaptation further improves results • Soft VAD makes the speaker factor extraction more accurate (adaptive system) • Viterbi resegmentation deteriorates the boundary accuracy of the proposed system 17 22 June 2016

Recommend


More recommend