multimodal meeting manager - m4 Crosstalk Analysis Stuart N Wrigley Vincent Wan Guy J Brown Steve Renals 29 January 2003 Speech and Hearing Research Group, University of Sheffield, UK
multimodal meeting manager - m4 Crosstalk Analysis Goals • Detection of crosstalk. • Ideally, would like to segment each channel into channel speaker, channel speaker alone, channel speaker + crosstalk, crosstalk alone. • Segmentation must be channel (i.e. speaker, meeting, environment) independent. Data • ICSI: closetalking mics for each participant (mix of lapel and head-mounted), plus 4 tabletop mics. Large amounts of data which has already been checked and transcribed • IDIAP: lapel mics, plus 12 tabletop mics and a manikin. Still in initial stages of collection and transcription. Speech and Hearing Research Group, University of Sheffield, UK
multimodal meeting manager - m4 Initial notes Despite attractiveness, channel energy may be an unreliable cue to speaker activity • ICSI data primarily headmounted headsets - microphone fixed relative to the mouth (with one or two notable exceptions). • However , M4 recordings made with lapel microphones - head and body movement will change the channel gain throughout the meeting. e.g. If speaker turns head to speak to colleague, signal energy in channel may drop significantly. Speech and Hearing Research Group, University of Sheffield, UK
multimodal meeting manager - m4 Channel Activity Classifier Our goal is to produce a system that will classify a frame of a meeting as either: • Current channel speaker alone • Current channel speaker + crosstalk • Crosstalk alone • Silence / background noise We have taken a similar approach to that of ICSI by using an ergodic HMM (EHMM). However, our classifier differs : • Four main states as apposed to ICSI’s two (speech / nonspeech). • No intermediate states pairs (used to impose time constraints on transitions). Speech and Hearing Research Group, University of Sheffield, UK
multimodal meeting manager - m4 Ergodic HMM (EHMM) • Four states, each representing a particular label. S C • Equal prior probability of first state being any one of the four. • Each state modelled as a multivariate GMM. N SC • Transitions allowed between every state pair. • No minimum residency time in each state. Speech and Hearing Research Group, University of Sheffield, UK
multimodal meeting manager - m4 Features As mentioned at the August meeting, we wished to look at many different features and determine which were the best. The list (which is still growing) is: • MFCCs (20 coeffs) • Energy • Zero crossing rate (ZCR) • Time-domain kurtosis (a measure of nongaussianity of the signal) • Frequency-domain kurtosis (a measure of nongaussianity of the spectrum) • Spectral autocorrelation (SAPVR) • Fundamentalness (a measure related to AM and FM at different frequencies) • max, min and mean crosscorrelation of all channel pairs • autocorrelation normalised max, min and mean crosscorrelation of all channel pairs Total number of features: 13 (Total number of dimensions: 32) Speech and Hearing Research Group, University of Sheffield, UK
multimodal meeting manager - m4 Spotlight on... Kurtosis (fourth central moment divided by fourth power of the standard deviation) • Kurtosis is based on the size of a distribution's tails - i.e. a measure of gaussianity. • Kurtosis is zero for a gaussian random variable; nongaussian random variables have a nonzero kurtosis. • Kurtosis of co-channel speech (crosstalk) is generally less than the kurtosis of the individual speech utterances # . # See LeBlanc and de Leon. Speech Separation by Kurtosis Maximization, IEEE ICASSP 1998 , 1029-1032. Speech and Hearing Research Group, University of Sheffield, UK
multimodal meeting manager - m4 Spotlight on... Fundamentalness (see Speech Comm. 27 (1999) page 196, eqns (13)-(19)) • A wide analysing wavelet makes the output corresponding to the fundamental component have smaller FM and AM than other outputs. • Fundamentalness is defined as having maximum value when the FM and AM modulation magnitudes are minimum - corresponding to the fundamental component. • Although this was developed to analyse single harmonic series, the concept that a single fundamental produces high fundamentalness is useful: • If more than one fundamental is present, interference of the two components will cause AM and FM modulation, thus decreasing the fundamentalness measure. Speech and Hearing Research Group, University of Sheffield, UK
multimodal meeting manager - m4 Data • For each classifier, the multivariate GMMs were trained on 1M frames (16 ms) per class, taken randomly from four ICSI meetings ( bro012, bmr006, bed010, bed008 ). • The classifier was evaluated using 1K frames (16 ms) per class, taken randomly from one ICSI meetings ( bmr001 ). Note, the crosscorrelation information is incorporated into the feature set as opposed to being a post processing stage as in the ICSI classifier. Speech and Hearing Research Group, University of Sheffield, UK
multimodal meeting manager - m4 Selection of best features • The parcel algorithm (see Scott, Niranjan and Prager) was used to assess the classification performance of the different feature combinations. • A receiver operating characteristic (ROC) curve shows classification performance. • For each feature combination, the GMMs are trained and then evaluated to create a ROC for each class. • Each point on a ROC curve represents the performance of a classifier with a different decision threshold between two classes (i.e. the class of interest vs all others). • Given a number of ROCs (one per feature combination), a maximum realisable ROC (MRROC) can be calculated by fitting a convex hull over the existing ROCs. • Therefore, each point on a MRROC represents the optimum feature combination of that class for a particular trade-off between true positives and false positives. Speech and Hearing Research Group, University of Sheffield, UK
multimodal meeting manager - m4 ROCs After initial inspection of the ‘raw’ ROCs, it was determined that only a subset of features should be investigated (thus reducing the number of combinations from 8191 to 127). For example, the performance of the MFCCs was sufficiently low that it was not considered in combination with others. e.g. MFCCs vs crosscorrelation in detecting speaker alone: Single feature: max normalised XC Single feature: mfcc 100 100 90 90 80 80 Correct detection probability (in %) Correct detection probability (in %) 70 70 60 60 50 50 40 40 30 30 20 20 speaker alone speaker alone crosstalk alone crosstalk alone 10 10 speaker+crosstalk speaker+crosstalk silence silence 0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 ~64 % ~78 % False Alarm probability (in %) False Alarm probability (in %) Speech and Hearing Research Group, University of Sheffield, UK
multimodal meeting manager - m4 MRROCs We computed the MRROCs of each combination of: • energy • kurtosis • fundamentalness • max XC • mean XC • max normalised XC • mean normalised XC ... and then the MRROCs of those MRROCs !! The final MRROC tells us which feature combination to use in the final classifier. Speech and Hearing Research Group, University of Sheffield, UK
multimodal meeting manager - m4 MRROC MMROC using features: energy, kurtosis, fundamentalness, max XC, mean XC, max normalised XC, mean normalised XC, 100 90 80 70 Correct detection probability (in %) 60 50 40 30 20 speaker alone 10 crosstalk alone speaker+crosstalk silence 0 0 10 20 30 40 50 60 70 80 90 100 False Alarm probability (in %) Speaker alone: ~83 % Speech and Hearing Research Group, University of Sheffield, UK
multimodal meeting manager - m4 MRROC discarding energy MMROC using features: kurtosis, fundamentalness, max XC, mean XC, max normalised XC, mean normalised XC, 100 90 80 70 Correct detection probability (in %) 60 50 40 30 20 speaker alone 10 crosstalk alone speaker+crosstalk silence 0 0 10 20 30 40 50 60 70 80 90 100 False Alarm probability (in %) Speaker alone: ~81 % Speech and Hearing Research Group, University of Sheffield, UK
multimodal meeting manager - m4 MRROC discarding energy and crosscorrelation (note ~10 % performance drop when not using crosscorrelation) MMROC using features: kurtosis, fundamentalness, 100 90 80 70 Correct detection probability (in %) 60 50 40 30 20 speaker alone 10 crosstalk alone speaker+crosstalk silence 0 0 10 20 30 40 50 60 70 80 90 100 False Alarm probability (in %) Speaker alone: ~71 % Speech and Hearing Research Group, University of Sheffield, UK
Recommend
More recommend