Modeling Prosody Pattern of Chinese Expressive Speech Application in Personalized Speech Conversion Zhang Zhang, Wu Zhiyong, Jia Jia, Cai Lianhong
Outline • Background • Introduction • Modeling prosody pattern of expressive speech • Application in personalized speech conversion • Experiments • Conclusion 2
Background • Neutral and expressive speech – Neutral expressive • Personality of different speakers uttering the same expressive speech – Speaker 1 Speaker 2 • Related researches – Meng, F. etc Synthesizing Expressive Speech to Convey Focus – Li, K. etc Automatic Lexical and Pitch Accent Detection – Yang, H. etc Modeling the Acoustic Correlates of Expressive Elements 3
Introduction • To model prosody pattern of expressive speech – Focus on pitch, intensity, duration of the speech – Identify the core and non-core syllables of a prosodic word – Propose a double-layer perturbation model • To apply the model in personalized speech conversion – Propose a two-step method to convert the speech 4
Modeling prosody pattern • Corpus ——Text prompts – Text prompts are extracted from Hong Kong Tourism Board – Each text prompt introduces the attractive features of a scenic spot – 25 utterances in total • 120 phrases, 416 prosodic words and 1231 syllables <Name of tourist spot> 太平山顶 ( English: Victoria Peak ) <Descriptive text> 太平山顶是香港最受欢迎的名胜景点之一,登临其间,可俯瞰山下鳞次栉 比的摩天高楼和享誉全球的维多利亚港景色。 ( Victoria Peak is the most popular scenic spot in Hong Kong. When you climb up, you can overlook the row upon row of skyscrapers and the word famous Victoria Harbor . ) 5
Modeling prosody pattern • Corpus ——Expressivity annotation – Adopt the PAD model (Mehrabian 1995) to describe the expressivity – Use the A (arousal-nonarousal) descriptor to measure the expressive degree (e.g. superlative, comparative, etc.) – A = 0.2, 0.4, 0.6, 0.8, 1.0 – 272 prosodic words with A > 0 • Corpus —— Contrastive speech recordings – Four native Mandarin speakers (two males and two females) – Record the text prompts twice: neutral and expressive – 50 files of speech recordings for each speaker • Saved in wav format (16 bit mono, sampled at 16 kHz). 6
Modeling prosody pattern • Acoustic features – Mean F0 – F0 Range – Duration – RMS Energy • Acoustic measurements 7
Modeling prosody pattern • Classification of core and non-core syllable – The acoustic variations (from neutral to expressive speech) of core syllables are more significant than non-core syllables. • Neutral and expressive speech – Neutral expressive 太平山顶是香港最受欢迎的名胜景点之一 …… ( Victoria Peak is the most popular scenic spot in Hong Kong…… ) 8
Modeling prosody pattern • Acoustic analysis of the core syllables – Core syllables (272) A R 0.2 0.4 0.6 0.8 1.0 Mean F0 1.09 1.11 1.14 1.16 1.18 F0 Range 1.12 1.16 1.19 1.25 1.31 Duration 1.06 1.09 1.10 1.11 1.13 RMS Energy 1.20 1.38 1.54 1.71 1.94 – R has the positive correlation with A . 9
Modeling prosody pattern • Acoustic analysis ( R ) of the non-core syllables – Non-core syllables (615) A Distance to the core syllable 0.2 0.4 0.6 0.8 1.0 0 1.20 1.38 1.54 1.71 1.94 1 1.11 1.30 1.48 1.68 1.88 2 1.03 1.17 1.31 1.42 1.58 3 0.97 1.07 1.16 1.27 1.55 – R is bigger for core syllables than non-core syllables – R is negatively correlated with the distance 10
Modeling prosody pattern • Acoustic analysis ( R ) of the non-core syllables A A Mean F0 F0 Range Dis. Dis. 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0 1.09 1.11 1.14 1.16 1.18 0 1.12 1.16 1.19 1.25 1.31 1 1.07 1.08 1.08 1.09 1.09 1 1.02 1.06 1.09 1.05 1.07 2 1.04 1.05 1.05 1.06 1.06 2 1.01 1.02 1.03 1.04 1.05 3 1.02 1.02 1.03 1.03 1.04 3 0.97 0.98 1.01 1.02 1.06 A Duration 0.2 0.4 0.6 0.8 1.0 Dis. 0 1.06 1.09 1.10 1.11 1.18 1 1.05 1.06 1.06 1.07 1.07 2 1.01 1.01 0.99 0.99 0.99 3 0.93 0.92 0.90 0.89 0.88 11
Modeling prosody pattern • Double-layer perturbation model – Core syllable – Non-core syllable 12
Application in personalized speech conversion • Acoustic analysis of different speakers – Core syllables 13
Application in personalized speech conversion • Acoustic analysis of different speakers – Non-core syllables 14
Application in personalized speech conversion • To generate speech with target speaker’s prosody characteristics – Step 1: to convert neutral speech from speaker s to t – Step 2: to generate expressive speech for speaker t • Acoustic features of the target expressive speech 15
Experiment 1 • 10 text phrases were randomly selected from our corpus • 3 files were designed for each phrase – a) The neutral speech recording of a speaker – b) The expressive speech recording of the same speaker – c) The transformed speech from a) • 15 native Mandarin speakers were invited as subjects – To listen to files played in the order of a)-b)-c)-a)-b)-c) – To judge where c) sounds similar to its counterpart b) – To give a MOS score from 1 to 5 indicating the level of the similarity between c) and b) 16
Experiment 1 • Average score is 3.94 • About 90% of the file c) is more similar to b) than to a) 17
Experiment 2 • Another 10 phrases were selected from our corpus • 4 files were designed for each phrase – d) the expressive speech recording of speaker 1 – e) the expressive speech recording of speaker 2 – f) the transformed speech from NEU using speaker 1’s model – g) the transformed speech from NEU using speaker 2’s model – NEU) the neutral speech recording of speaker 3 • 15 native Mandarin speakers were invited as subjects – To listen to the files in the order of d)-e)-x), where x) might be f) or g) – To judge where x) is imitating d) or e) 18
Experiment 2 • Results Speaker i Accuracy Speaker 1 74.6% Speaker 2 72.7% – The proposed model can reflect the personalized features of the prosody patterns of different speakers. – The proposed method for personalized speech conversion is able to achieve good performance. 19
Conclusions • Proposed a double-layer perturbation model for modeling the prosody patterns of expressive speech – Identify the core syllable and non-core syllable – Use the Mean F 0, F 0 range, duration and RMS energy • Applied the above model in personalized speech conversion – Propose a two-step method for generating personalized prosody patterns 20
21
Recommend
More recommend