Odyssey 2016, Bilbao, Spain Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System Abraham Woubie 1 , Jordi Luque 2 , and Javier Hernando 1 1 TALP Research Center, Dept. of Signal Theory and Communications, Universitat Politècnica de Catalunya, Barcelona, Spain 2 Telefonica Research, Edificio Telefonica-Diagonal, Barcelona, Spain June 24, 2016
Outline Odyssey 2016, Bilbao, Spain ❑ Introduction ❑ Objectives ❑ Voice-quality and Prosodic Features ❑ Speaker Diarization Architecture ❑ Fusion Techniques ❑ Experimental Setup and Results ❑ Conclusions Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System 2
Introduction Odyssey 2016, Bilbao, Spain ❑ Speaker diarization = speaker segmentation + speaker clustering Speaker segmentation Speaker clustering Speaker 3 Speaker 1 Speaker 2 ❑ Motivation MFCC and GMM are the most widely used short-term speech features and o speaker clustering techniques in speaker diarization, respectively. Jitter and shimmer voice-quality measurements (JS) and prosodic features have o been successfully used together with MFCC in GMM based speaker diarization. We have proposed the fusion of scores of i-vectors extracted from MFCC and o long-term speech features for speaker clustering task. Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System 3
Outline Odyssey 2016, Bilbao, Spain ❑ Introduction ❑ Objectives ❑ Voice-quality and Prosodic Features ❑ Speaker Diarization Architecture ❑ Fusion Techniques ❑ Experimental Setup and Results ❑ Conclusions Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System 4
Objectives Odyssey 2016, Bilbao, Spain ❑ Feature selection of long-term voice-quality and prosodic features. o The voice-quality features are Absolute Jitter, Absolute Shimmer and Shimmer apq3. o The prosodic ones are pitch, intensity and the first four formant frequencies. ❑ Stacking these voice-quality and prosodic features in the same feature vector. ❑ Extraction of i-vectors from short-term spectral and these long-term feature sets. ❑ Fusion of scores of i-vectors extracted from these features for speaker clustering. Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System 5
Outline Odyssey 2016, Bilbao, Spain ❑ Introduction ❑ Objectives ❑ Voice-quality and Prosodic Features ❑ Speaker Diarization Architecture ❑ Fusion Techniques ❑ Experimental Setup and Results ❑ Conclusions Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System 6
Speech Features Odyssey 2016, Bilbao, Spain ❑ Mel Frequency Cepstral Coefficients (MFCC): They are the mostly widely used short term features in speaker diarization. ❑ Voice quality features: They measure variations of fundamental frequency and amplitude of speaker’s voice. o We have extracted Absolute Jitter, Absolute shimmer and Shimmer apq3. ❑ Prosodic features: They are estimated capturing the evolution in time of fundamental frequency, acoustic intensity and formant frequencies. o We have extracted pitch, intensity and the first four formant frequencies. Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System 7
Voice-quality Odyssey 2016, Bilbao, Spain ❑ Jitter (absolute): It is the average absolute difference between consecutive periods. ❑ Shimmer (absolute): It is the average absolute logarithm of the ratio between amplitudes of consecutive periods. ❑ Shimmer (apq3): It is the three-point Amplitude Perturbation Quotient. Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System 8
Prosody Odyssey 2016, Bilbao, Spain ❑ Prosody is estimated capturing the evolution in time of fundamental frequency, acoustic intensity and formant frequencies. Pitch: It is the perceived fundamental frequency. o Intensity: It is the energy of a speech signal. o Formant frequencies: They are concentration of acoustic energy around o particular frequencies Intensity 5000 Hz 100 dB 70 dB 30 dB 30 dB 0 Hz Formant frequency Pitch Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System 9
Outline Odyssey 2016, Bilbao, Spain ❑ Introduction ❑ Objectives ❑ Voice-quality and Prosodic Features ❑ Speaker Diarization Architecture ❑ Fusion Techniques ❑ Experimental Setup and Results ❑ Conclusions Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System 10
Speaker Diarization Architecture Odyssey 2016, Bilbao, Spain Speech only frames Speech Reference Initialization Initialize clusters MFCC extraction JS extraction Prosody extraction Stack features Speaker MFCC score segmentation GMM complexity HMM training Score fusion Viterbi segmentation JS and Prosody score Merged clusters Clusters Yes Merge Merge clusters BIC computation clusters Speaker No clustering Final hypothesis Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System 11
Proposed Speaker Clustering Architecture Odyssey 2016, Bilbao, Spain Merged clusters Clusters i-vector extraction from i-vector extraction from JS UBM MFCC and Prosody Yes Cosine score Merge Merge clusters fusion clusters No Speaker Clustering Final hypothesis Clustering merging is based on threshold value o The i-vectors are extracted using Alize toolkit o Two gender independent UBMs of 512 GMM components are trained using 100 AMI o shows (not included in the testing set) The T-Matrices are trained using the same previous dataset. o Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System 12
Odyssey 2016, Bilbao, Spain Stopping Criterion Selection Show one Show two Show three Show five Show four Diarization Error Rate (DER) and cosine-distance score per iteration for selected five shows from the development set Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System 13
Outline Odyssey 2016, Bilbao, Spain ❑ Introduction ❑ Objectives ❑ Voice-quality and Prosodic Features ❑ Speaker Diarization Architecture ❑ Fusion Techniques Segmentation o Clustering o ❑ Experimental Setup and Results ❑ Conclusions Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System 14
Fusion Techniques: Segmentation Odyssey 2016, Bilbao, Spain ❑ The fusion of voice quality features with the prosodic ones is carried out at the feature level. Tuned alpha ❑ The fusion of short- and long-term speech features is carried out at the score likelihood level as follows for speaker segmentation: o log P(x, y) is the fused GMM score o θ ix is model of spectral features o θ iy is model of JS and Prosody o α is weight of MFCC o 1- α is weight of JS and Prosody Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System 15
Fusion Techniques:Clustering Odyssey 2016, Bilbao, Spain ❑ The fusion of cosine distance scores of i-vectors from the short term and long-term speech features is carried out at the score level for speaker clustering as follows: Prosody MFCC Voice-quality extraction extraction extraction Stack features i-vector Cosine score i-vector extraction fusion extraction Fused score = x i and x j are the i-vectors extracted from MFCC for clusters i and j o y i and y j are the i-vectors extracted from voice-quality and Prosody for clusters i and j o β is the weight of cosine-distance of i-vectors extracted from MFCC o (1- β) is the weight of cosine-distance of i-vectors extracted voice-quality and Prosody o Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System 16
Outline Odyssey 2016, Bilbao, Spain ❑ Introduction ❑ Objectives ❑ Voice-quality and Prosodic Features ❑ Speaker Diarization Architecture ❑ Fusion Techniques ❑ Experimental Setup and Results o Experimental Setup o Experimental Results ❑ Conclusions Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System 17
Experimental setup Odyssey 2016, Bilbao, Spain ❑ The experiments have been developed and tested on AMI corpus, a multi- party and spontaneous speech set of recordings. ❑ The number of speakers is in the range [3 5]. ❑ We have selected 10 shows from AMI corpus as a development set. ❑ Two experimental scenarios have been defined for the test sets: Single-site: 10 shows from Idiap site (total duration= 307 minutes) o Multiple-site: 10 shows from Idiap, Edinburgh and TNO sites (294 o minutes) ❑ The size of i-vectors for the short- and long-term speech features are100 and 50 respectively. ❑ Oracle SAD has been used as speech activity detection. Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System 18
Recommend
More recommend