modeling prosody pattern of chinese expressive speech
play

Modeling Prosody Pattern of Chinese Expressive Speech Application - PowerPoint PPT Presentation

Modeling Prosody Pattern of Chinese Expressive Speech Application in Personalized Speech Conversion Zhang Zhang, Wu Zhiyong, Jia Jia, Cai Lianhong Outline Background Introduction Modeling prosody pattern of expressive speech


  1. Modeling Prosody Pattern of Chinese Expressive Speech Application in Personalized Speech Conversion Zhang Zhang, Wu Zhiyong, Jia Jia, Cai Lianhong

  2. Outline • Background • Introduction • Modeling prosody pattern of expressive speech • Application in personalized speech conversion • Experiments • Conclusion 2

  3. Background • Neutral and expressive speech – Neutral expressive • Personality of different speakers uttering the same expressive speech – Speaker 1 Speaker 2 • Related researches – Meng, F. etc Synthesizing Expressive Speech to Convey Focus – Li, K. etc Automatic Lexical and Pitch Accent Detection – Yang, H. etc Modeling the Acoustic Correlates of Expressive Elements 3

  4. Introduction • To model prosody pattern of expressive speech – Focus on pitch, intensity, duration of the speech – Identify the core and non-core syllables of a prosodic word – Propose a double-layer perturbation model • To apply the model in personalized speech conversion – Propose a two-step method to convert the speech 4

  5. Modeling prosody pattern • Corpus ——Text prompts – Text prompts are extracted from Hong Kong Tourism Board – Each text prompt introduces the attractive features of a scenic spot – 25 utterances in total • 120 phrases, 416 prosodic words and 1231 syllables <Name of tourist spot> 太平山顶 ( English: Victoria Peak ) <Descriptive text> 太平山顶是香港最受欢迎的名胜景点之一,登临其间,可俯瞰山下鳞次栉 比的摩天高楼和享誉全球的维多利亚港景色。 ( Victoria Peak is the most popular scenic spot in Hong Kong. When you climb up, you can overlook the row upon row of skyscrapers and the word famous Victoria Harbor . ) 5

  6. Modeling prosody pattern • Corpus ——Expressivity annotation – Adopt the PAD model (Mehrabian 1995) to describe the expressivity – Use the A (arousal-nonarousal) descriptor to measure the expressive degree (e.g. superlative, comparative, etc.) – A = 0.2, 0.4, 0.6, 0.8, 1.0 – 272 prosodic words with A > 0 • Corpus —— Contrastive speech recordings – Four native Mandarin speakers (two males and two females) – Record the text prompts twice: neutral and expressive – 50 files of speech recordings for each speaker • Saved in wav format (16 bit mono, sampled at 16 kHz). 6

  7. Modeling prosody pattern • Acoustic features – Mean F0 – F0 Range – Duration – RMS Energy • Acoustic measurements 7

  8. Modeling prosody pattern • Classification of core and non-core syllable – The acoustic variations (from neutral to expressive speech) of core syllables are more significant than non-core syllables. • Neutral and expressive speech – Neutral expressive 太平山顶是香港最受欢迎的名胜景点之一 …… ( Victoria Peak is the most popular scenic spot in Hong Kong…… ) 8

  9. Modeling prosody pattern • Acoustic analysis of the core syllables – Core syllables (272) A R 0.2 0.4 0.6 0.8 1.0 Mean F0 1.09 1.11 1.14 1.16 1.18 F0 Range 1.12 1.16 1.19 1.25 1.31 Duration 1.06 1.09 1.10 1.11 1.13 RMS Energy 1.20 1.38 1.54 1.71 1.94 – R has the positive correlation with A . 9

  10. Modeling prosody pattern • Acoustic analysis ( R ) of the non-core syllables – Non-core syllables (615) A Distance to the core syllable 0.2 0.4 0.6 0.8 1.0 0 1.20 1.38 1.54 1.71 1.94 1 1.11 1.30 1.48 1.68 1.88 2 1.03 1.17 1.31 1.42 1.58 3 0.97 1.07 1.16 1.27 1.55 – R is bigger for core syllables than non-core syllables – R is negatively correlated with the distance 10

  11. Modeling prosody pattern • Acoustic analysis ( R ) of the non-core syllables A A Mean F0 F0 Range Dis. Dis. 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0 1.09 1.11 1.14 1.16 1.18 0 1.12 1.16 1.19 1.25 1.31 1 1.07 1.08 1.08 1.09 1.09 1 1.02 1.06 1.09 1.05 1.07 2 1.04 1.05 1.05 1.06 1.06 2 1.01 1.02 1.03 1.04 1.05 3 1.02 1.02 1.03 1.03 1.04 3 0.97 0.98 1.01 1.02 1.06 A Duration 0.2 0.4 0.6 0.8 1.0 Dis. 0 1.06 1.09 1.10 1.11 1.18 1 1.05 1.06 1.06 1.07 1.07 2 1.01 1.01 0.99 0.99 0.99 3 0.93 0.92 0.90 0.89 0.88 11

  12. Modeling prosody pattern • Double-layer perturbation model – Core syllable – Non-core syllable 12

  13. Application in personalized speech conversion • Acoustic analysis of different speakers – Core syllables 13

  14. Application in personalized speech conversion • Acoustic analysis of different speakers – Non-core syllables 14

  15. Application in personalized speech conversion • To generate speech with target speaker’s prosody characteristics – Step 1: to convert neutral speech from speaker s to t – Step 2: to generate expressive speech for speaker t • Acoustic features of the target expressive speech 15

  16. Experiment 1 • 10 text phrases were randomly selected from our corpus • 3 files were designed for each phrase – a) The neutral speech recording of a speaker – b) The expressive speech recording of the same speaker – c) The transformed speech from a) • 15 native Mandarin speakers were invited as subjects – To listen to files played in the order of a)-b)-c)-a)-b)-c) – To judge where c) sounds similar to its counterpart b) – To give a MOS score from 1 to 5 indicating the level of the similarity between c) and b) 16

  17. Experiment 1 • Average score is 3.94 • About 90% of the file c) is more similar to b) than to a) 17

  18. Experiment 2 • Another 10 phrases were selected from our corpus • 4 files were designed for each phrase – d) the expressive speech recording of speaker 1 – e) the expressive speech recording of speaker 2 – f) the transformed speech from NEU using speaker 1’s model – g) the transformed speech from NEU using speaker 2’s model – NEU) the neutral speech recording of speaker 3 • 15 native Mandarin speakers were invited as subjects – To listen to the files in the order of d)-e)-x), where x) might be f) or g) – To judge where x) is imitating d) or e) 18

  19. Experiment 2 • Results Speaker i Accuracy Speaker 1 74.6% Speaker 2 72.7% – The proposed model can reflect the personalized features of the prosody patterns of different speakers. – The proposed method for personalized speech conversion is able to achieve good performance. 19

  20. Conclusions • Proposed a double-layer perturbation model for modeling the prosody patterns of expressive speech – Identify the core syllable and non-core syllable – Use the Mean F 0, F 0 range, duration and RMS energy • Applied the above model in personalized speech conversion – Propose a two-step method for generating personalized prosody patterns 20

  21. 21

Recommend


More recommend