A perceptual investigation of wavelet-based decomposition of f0 for text-to-speech synthesis M. Sam Ribeiro, Junichi Yamagishi, Robert A. J. Clark School Of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk 8 September 2015 1 / 29
Overview Introduction Motivation Hypotheses Experiments Experiment 2 Experiment 3 Discussion Summary References 2 / 29
Introduction • Wavelets in Speech Processing [Farouk, 2014] • Annotation of prominence [Vainio et al, 2013] • Pre-processing step for f0 modeling [Suni et al, 2013], [Ribeiro and Clark, 2015] • Voice Conversion [Sanchez et al, 2014] • Modeling of f0 • General conclusion indicate that signal decomposition is beneficial for f0 modeling • But it is assumed that all components are equally relevant to the reconstructed signal The individual importance of each wavelet scale to the overall signal is not fully understood 3 / 29
Introduction • Wavelets in Speech Processing [Farouk, 2014] • Annotation of prominence [Vainio et al, 2013] • Pre-processing step for f0 modeling [Suni et al, 2013], [Ribeiro and Clark, 2015] • Voice Conversion [Sanchez et al, 2014] • Modeling of f0 • General conclusion indicate that signal decomposition is beneficial for f0 modeling • But it is assumed that all components are equally relevant to the reconstructed signal The individual importance of each wavelet scale to the overall signal is not fully understood 4 / 29
The Continuous Wavelet Transform • The CWT decomposes an input signal into various scales of selected frequency. • 10-scale decomposition • Each scale approximately 1 octave apart. • f0 reconstruction: • [Suni et al, 2013] 10 � C i ( x )( i + 2 . 5) − 5 / 2 f 0 ( x ) = i =1 5 / 29
Hypotheses Middle frequencies (scales 5-8) are associated with higher lev- els of naturalness Low frequencies (scales 1-4) don’t contain much informa- tion and are comparable to HMM-generated f0 High frequencies (scales 9-10) are mostly noise and do not contribute much to the per- ceived naturalness 6 / 29
Conditions and Reconstruction Experimental conditions Condition Description Freq. (Hz) natural Vocoded speech using natural parameters - all All f0 frequencies. 0.1-50 1-2 Low frequencies. Scales indexed at 1 and 2. 0.1-0.2 3-4 Low frequencies. Scales indexed at 3 and 4. 0.4-0.8 1-4 All low frequencies. Scales indexed at 1, 2, 3, and 4. 0.1-0.8 5-6 Middle frequencies. Scales indexed at 5 and 6. 1.6-3.2 7-8 Middle frequencies. Scales indexed at 7 and 8. 6.3-13 5-8 All middle frequencies. Scales indexed at 5, 6, 7, and 8. 1.6-13 9-10 High frequencies. Scales indexed at 9 and 10. 25-50 MSD-HMM f0 signal predicted from an MSD-HMM. - Table: Experimental conditions with approximate CWT frequency ranges. F0 reconstruction 10 � w i C i ( x )( i + 2 . 5) − 5 / 2 f 0 ( x ) = i =1 where w i is the weight given to scale i where w i ∈ { 0 , 1 } 7 / 29
Perceptual Experiments Experiment 1 - Prominence Detection Participants are asked to judge which word appears more prominent in an utterance. Experiment 2 - Similarity Experiment Participants are asked to judge utterances in terms of similar naturalness. Experiment 3 - MUSHRA Test Participants are asked to rate an utterance on a 100-point scale with respect to an reference and all other conditions. Experiment 4 - MOS Test Participants are asked to rate an utterance on a 5-point scale with no references. 8 / 29
Perceptual Experiments Experiment 1 - Prominence Detection Participants are asked to judge which word appears more prominent in an utterance. Experiment 2 - Similarity Experiment Participants are asked to judge utterances in terms of similar naturalness. Experiment 3 - MUSHRA Test Participants are asked to rate an utterance on a 100-point scale with respect to an reference and all other conditions. Experiment 4 - MOS Test Participants are asked to rate an utterance on a 5-point scale with no references. 9 / 29
Perceptual Experiments Experiment 1 - Prominence Detection Participants are asked to judge which word appears more prominent in an utterance. Experiment 2 - Similarity Experiment Participants are asked to judge utterances in terms of similar naturalness. Experiment 3 - MUSHRA Test Participants are asked to rate an utterance on a 100-point scale with respect to an reference and all other conditions. Experiment 4 - MOS Test Participants are asked to rate an utterance on a 5-point scale with no references. 10 / 29
Perceptual Experiments Experiment 1 - Prominence Detection Participants are asked to judge which word appears more prominent in an utterance. Experiment 2 - Similarity Experiment Participants are asked to judge utterances in terms of similar naturalness. Experiment 3 - MUSHRA Test Participants are asked to rate an utterance on a 100-point scale with respect to an reference and all other conditions. Experiment 4 - MOS Test Participants are asked to rate an utterance on a 5-point scale with no references. 11 / 29
Perceptual Experiments Experiment 1 - Prominence Detection Participants are asked to judge which word appears more prominent in an utterance. Experiment 2 - Similarity Experiment Participants are asked to judge utterances in terms of similar naturalness. Experiment 3 - MUSHRA Test Participants are asked to rate an utterance on a 100-point scale with respect to an reference and all other conditions. Experiment 4 - MOS Test Participants are asked to rate an utterance on a 5-point scale with no references. 12 / 29
Experiment 2 - Similarity • Data • Expressive audiobook data • mel-cepstral, aperiodicity, voicing parameters • HMM system trained on roughly 5000 utterances • duration is force-aligned • natural (vocoded) condition uses original parameters • remaining conditions use natural f0 processed accordingly • Design • 20 utterances synthesized for each of the 10 conditions • 10 native listeners. Each rating 144 utterance pairs • Each pair consists of different utterances and different conditions • No repetitions (utterance or condition) within any three consecutive pairs • Participants asked to judge if the pair is similar or different in terms of naturalness 13 / 29
Experiment 2 - Similarity • 45 distinct condition pairs, each pair judged at least 32 times • Create 10x10 dissimilarity matrix and embed it into a 2-dimensional space with MDS • Kruskal’s normalized stress1 with stress value of 0.086 14 / 29
Experiment 2 - Similarity • Listeners naturally clustered low, middle, and high frequencies • All frequencies seems to be similar to the middle frequencies • It is also farther from natural speech than middle frequencies • Listeners tend do prefer the CWT middle frequencies • These have been previously associated with the word level 15 / 29
Perceptual Experiments Experiment 1 - Prominence Detection Participants are asked to judge which word appears more prominent in an utterance. Experiment 2 - Similarity Experiment Participants are asked to judge utterances in terms of similar naturalness. Experiment 3 - MUSHRA Test Participants are asked to rate an utterance on a 100-point scale with respect to an reference and all other conditions. Experiment 4 - MOS Test Participants are asked to rate an utterance on a 5-point scale with no references. 16 / 29
Experiment 3 - MUSHRA • Data • Expressive audiobook data (same as Similarity Experiment). • Design • Ask participants to judge all conditions simultaneously (1 to 100). • Reference is given as the natural condition. • 10 Participants rate 20 sets of 10 stimuli. • From the 200 expected sets, 48 were discarded as the hidden reference was not judged as natural. • 152 sets were used for analysis. 17 / 29
Experiment 3 - MUSHRA 18 / 29
Main Conclusions • Mid-frequencies • Consistently achieve better results • Naturalness is almost comparable to all frequencies • Have been associated previously with the word-level [Suni et al, 2013], [Ribeiro and Clark, 2015] • Low-frequencies • Comparable to HMM generated f0 (Prominence, MUSHRA, MOS tests) • Although not really the same (similarity test) • Previously associated with phrase and utterance levels • High-frequencies • Consistently judged the most unnatural condition • Not really relevant to naturalness • Previously associated with the phone-level 19 / 29
Earlier assumptions Earlier assumptions [Suni et al, 2013], [Ribeiro and Clark, 2015] 1 All wavelet components are equally relevant to the reconstructed signal 2 The association of wavelet components to linguistic levels is meaningful • First assumption shown not to be true. • Middle frequencies carry most of the information • Low and high frequencies not so relevant • How about their association with linguistic levels? 20 / 29
Unit and Peak Rates • Compute unit and peak rates at utterance level for 5000 utterances • Count units and peaks (local maxima) and divide by utterance duration in seconds 21 / 29
Summary • Main Findings • Wavelet components do not carry equal weights for the f0 signal • Middle frequencies convey most of the information • HMM-generated f0 is somewhat similar to low-frequencies • Association with linguistic levels is not very good • Speech Samples • http://homepages.inf.ed.ac.uk/s1250520/samples/interspeech15.html • Future Work • Associate each scale with meaningful linguistic-levels • Use middle frequencies to learn relevant syllable and word-level features 22 / 29
Recommend
More recommend