Monaural speech separation using source-adapted models Ron Weiss, Dan Ellis { ronw,dpwe } @ee.columbia.edu LabROSA Department of Electrical Enginering Columbia University 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 2007 1 / 15
Monaural speech separation Given single channel recording of multiple talkers Infer the original source signals from mixture Under-determined - more unknowns (sources) than observations Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 2007 2 / 15
Speech separation challenge [Cooke and Lee, 2006] Single channel, two-talker mixtures of utterances from 34 speakers Constrained grammar: command(4) color(4) preposition(4) letter(25) digit(10) adverb(4) Task: determine letter and digit for source that said “white” -9 to 6 dB TMR Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 2007 3 / 15
Model-based separation Model means 8 0 Frequency (kHz) −10 6 −20 4 −30 2 −40 0 −50 20 40 60 80 100 120 State index Use constraints from prior signal models to guide separation HMM, log spectral features Factorial model inference Explain each frame of mixed signal as combination of model states e.g. Iroquois [Kristjansson et al., 2006] Speaker-dependent models Acoustic dynamics and grammar constraints Superhuman performance Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 2007 4 / 15
Model-based separation - Limitations Rely on speaker-dependent models to disambiguate sources What if the task isn’t so well defined? No a priori knowledge of speaker identities or grammar Adapt speaker-independent source model [Ozerov et al., 2005] Problems Want to adapt to a single utterance, not enough data for MLLR 1 Use PCA to reduce number of adaptation parameters - “Eigenvoices” Only observation is mixed signal 2 Iterative separation/adaptation algorithm Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 2007 5 / 15
Eigenvoices [Kuhn et al., 2000] Train N speaker-dependent models priors on space of speaker variation Pack model parameters (Gaussian means) into speaker supervector Principal component analysis to find orthonormal bases Speaker model is a linear combination of bases: = ¯ + + U g µ µ w adapted mean weights eigenvoice gain model voice bases Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 2007 6 / 15
Eigenvoice example Mean voice 8 Frequency (kHz) 6 4 2 b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaway ah aoowuwax Eigenvoice dimension 1 Frequency (kHz) 8 6 4 2 b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaway ah aoowuwax Eigenvoice dimension 2 8 Frequency (kHz) 6 4 2 b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaway ah aoowuwax Eigenvoice dimension 3 8 Frequency (kHz) 6 4 2 b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaway ah aoowuwax Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 2007 7 / 15
Separation algorithm - Signal separation model 2 Compose factorial HMM from adapted models Find maximum likelihood path using Viterbi algorithm Reconstruct source signals from Viterbi model 1 path observations / time Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 2007 8 / 15
Separation algorithm - Model adaptation Find projection of reconstructed source signals onto eigenvoice bases But state sequence is hidden, need EM E-step: HMM forward-backward M-step: for each possible state sequence, project signal frames onto corresponding sequence of states from each eigenvoice basis vector Iterate... Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 2007 9 / 15
Separation example Mixture: t32_swil2a_m18_sbar9n 8 6 4 2 Adaptation iteration 1 8 6 4 2 Adaptation iteration 3 Frequency (kHz) 8 0 6 −20 4 2 −40 Adaptation iteration 5 8 6 4 2 SD model separation 8 6 4 2 0 0.5 1 1.5 Time (sec) Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 2007 10 / 15
Performance 60 Diff Gender Same Gender Same Talker 55 50 45 Accuracy 40 35 30 25 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Iteration Letter-digit accuracy averaged across all TMRs Adaptation improves separation Same talker case hard - source permutations Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 2007 11 / 15
Performance - Adapted vs. source-dependent models Diff Gender SD SA SI Baseline 80 60 40 20 0 6dB 3dB 0dB −3dB −6dB −9dB Same Gender 80 Accuracy 60 40 20 0 6dB 3dB 0dB −3dB −6dB −9dB Same Talker 80 60 40 20 0 6dB 3dB 0dB −3dB −6dB −9dB Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 2007 12 / 15
Performance - Held out speakers Same Gender Diff Gender 80 80 SA 70 70 SD 60 60 Accuracy 50 50 40 40 30 30 20 20 10 20 30 34 10 20 30 34 Num training speakers Num training speakers Trained models on subset of speakers Tested on mixtures from held out speakers Performance suffers for both systems Relative decrease significantly bigger for SD than SA Open question: scale Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 2007 13 / 15
Summary Limitations of model-based source separation Algorithm for model adaptation from mixed signal Significant improvement over speaker-independent models Source-dependent models better on matched training/testing data Adaptation helps generalize better to held out speakers Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 2007 14 / 15
References Cooke, M. and Lee, T. W. (2006). The speech separation challenge. Kristjansson, T., Hershey, J., Olsen, P., Rennie, S., and Gopinath, R. (2006). Super-human multi-talker speech recognition: The IBM 2006 speech separation challenge system. In Proceedings of Interspeech . Kuhn, R., Junqua, J., Nguyen, P., and Niedzielski, N. (2000). Rapid speaker adaptation in eigenvoice space. IEEE Transations on Speech and Audio Processing , 8(6):695 – 707. Ozerov, A., Philippe, P., Gribonval, R., and Bimbot, F. (2005). One microphone singing voice separation using source-adapted models. In Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics . Ron Weiss, Dan Ellis (Columbia University) Monaural speech separation using source-adapted models WASPAA 2007 15 / 15
Separation algorithm - Initialization Eigenvoice weights vs speaker gender 2000 Fast convergence needs good initialization Male Female 1500 Want to differentiate source models to get 1000 best separation 500 w 2 0 Get initial coefficient for each eigenvoice −500 dimension independently −1000 Coarsely quantize eigenvoice weights −1500 −2000 Find most likely combination in mixture −2000 −1500 −1000 −500 0 500 1000 1500 2000 w 1
Recommend
More recommend