Modeling Prosody Pattern of Chinese Expressive Speech Application - PowerPoint PPT Presentation

Modeling Prosody Pattern of Chinese Expressive Speech Application in Personalized Speech Conversion Zhang Zhang, Wu Zhiyong, Jia Jia, Cai Lianhong

Outline • Background • Introduction • Modeling prosody pattern of expressive speech • Application in personalized speech conversion • Experiments • Conclusion 2

Background • Neutral and expressive speech – Neutral expressive • Personality of different speakers uttering the same expressive speech – Speaker 1 Speaker 2 • Related researches – Meng, F. etc Synthesizing Expressive Speech to Convey Focus – Li, K. etc Automatic Lexical and Pitch Accent Detection – Yang, H. etc Modeling the Acoustic Correlates of Expressive Elements 3

Introduction • To model prosody pattern of expressive speech – Focus on pitch, intensity, duration of the speech – Identify the core and non-core syllables of a prosodic word – Propose a double-layer perturbation model • To apply the model in personalized speech conversion – Propose a two-step method to convert the speech 4

Modeling prosody pattern • Corpus ——Text prompts – Text prompts are extracted from Hong Kong Tourism Board – Each text prompt introduces the attractive features of a scenic spot – 25 utterances in total • 120 phrases, 416 prosodic words and 1231 syllables <Name of tourist spot> 太平山顶 ( English: Victoria Peak ) <Descriptive text> 太平山顶是香港最受欢迎的名胜景点之一，登临其间，可俯瞰山下鳞次栉比的摩天高楼和享誉全球的维多利亚港景色。 ( Victoria Peak is the most popular scenic spot in Hong Kong. When you climb up, you can overlook the row upon row of skyscrapers and the word famous Victoria Harbor . ) 5

Modeling prosody pattern • Corpus ——Expressivity annotation – Adopt the PAD model (Mehrabian 1995) to describe the expressivity – Use the A (arousal-nonarousal) descriptor to measure the expressive degree (e.g. superlative, comparative, etc.) – A = 0.2, 0.4, 0.6, 0.8, 1.0 – 272 prosodic words with A > 0 • Corpus —— Contrastive speech recordings – Four native Mandarin speakers (two males and two females) – Record the text prompts twice: neutral and expressive – 50 files of speech recordings for each speaker • Saved in wav format (16 bit mono, sampled at 16 kHz). 6

Modeling prosody pattern • Acoustic features – Mean F0 – F0 Range – Duration – RMS Energy • Acoustic measurements 7

Modeling prosody pattern • Classification of core and non-core syllable – The acoustic variations (from neutral to expressive speech) of core syllables are more significant than non-core syllables. • Neutral and expressive speech – Neutral expressive 太平山顶是香港最受欢迎的名胜景点之一 …… ( Victoria Peak is the most popular scenic spot in Hong Kong…… ) 8

Modeling prosody pattern • Acoustic analysis of the core syllables – Core syllables (272) A R 0.2 0.4 0.6 0.8 1.0 Mean F0 1.09 1.11 1.14 1.16 1.18 F0 Range 1.12 1.16 1.19 1.25 1.31 Duration 1.06 1.09 1.10 1.11 1.13 RMS Energy 1.20 1.38 1.54 1.71 1.94 – R has the positive correlation with A . 9

Modeling prosody pattern • Acoustic analysis ( R ) of the non-core syllables – Non-core syllables (615) A Distance to the core syllable 0.2 0.4 0.6 0.8 1.0 0 1.20 1.38 1.54 1.71 1.94 1 1.11 1.30 1.48 1.68 1.88 2 1.03 1.17 1.31 1.42 1.58 3 0.97 1.07 1.16 1.27 1.55 – R is bigger for core syllables than non-core syllables – R is negatively correlated with the distance 10

Modeling prosody pattern • Acoustic analysis ( R ) of the non-core syllables A A Mean F0 F0 Range Dis. Dis. 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0 1.09 1.11 1.14 1.16 1.18 0 1.12 1.16 1.19 1.25 1.31 1 1.07 1.08 1.08 1.09 1.09 1 1.02 1.06 1.09 1.05 1.07 2 1.04 1.05 1.05 1.06 1.06 2 1.01 1.02 1.03 1.04 1.05 3 1.02 1.02 1.03 1.03 1.04 3 0.97 0.98 1.01 1.02 1.06 A Duration 0.2 0.4 0.6 0.8 1.0 Dis. 0 1.06 1.09 1.10 1.11 1.18 1 1.05 1.06 1.06 1.07 1.07 2 1.01 1.01 0.99 0.99 0.99 3 0.93 0.92 0.90 0.89 0.88 11

Modeling prosody pattern • Double-layer perturbation model – Core syllable – Non-core syllable 12

Application in personalized speech conversion • Acoustic analysis of different speakers – Core syllables 13

Application in personalized speech conversion • Acoustic analysis of different speakers – Non-core syllables 14

Application in personalized speech conversion • To generate speech with target speaker’s prosody characteristics – Step 1: to convert neutral speech from speaker s to t – Step 2: to generate expressive speech for speaker t • Acoustic features of the target expressive speech 15

Experiment 1 • 10 text phrases were randomly selected from our corpus • 3 files were designed for each phrase – a) The neutral speech recording of a speaker – b) The expressive speech recording of the same speaker – c) The transformed speech from a) • 15 native Mandarin speakers were invited as subjects – To listen to files played in the order of a)-b)-c)-a)-b)-c) – To judge where c) sounds similar to its counterpart b) – To give a MOS score from 1 to 5 indicating the level of the similarity between c) and b) 16

Experiment 1 • Average score is 3.94 • About 90% of the file c) is more similar to b) than to a) 17

Experiment 2 • Another 10 phrases were selected from our corpus • 4 files were designed for each phrase – d) the expressive speech recording of speaker 1 – e) the expressive speech recording of speaker 2 – f) the transformed speech from NEU using speaker 1’s model – g) the transformed speech from NEU using speaker 2’s model – NEU) the neutral speech recording of speaker 3 • 15 native Mandarin speakers were invited as subjects – To listen to the files in the order of d)-e)-x), where x) might be f) or g) – To judge where x) is imitating d) or e) 18

Experiment 2 • Results Speaker i Accuracy Speaker 1 74.6% Speaker 2 72.7% – The proposed model can reflect the personalized features of the prosody patterns of different speakers. – The proposed method for personalized speech conversion is able to achieve good performance. 19

Conclusions • Proposed a double-layer perturbation model for modeling the prosody patterns of expressive speech – Identify the core syllable and non-core syllable – Use the Mean F 0, F 0 range, duration and RMS energy • Applied the above model in personalized speech conversion – Propose a two-step method for generating personalized prosody patterns 20

Modeling Prosody Pattern of Chinese Expressive Speech Application - PowerPoint PPT Presentation

Modeling Prosody Pattern of Chinese Expressive Speech Application in Personalized Speech Conversion Zhang Zhang, Wu Zhiyong, Jia Jia, Cai Lianhong Outline Background Introduction Modeling prosody pattern of expressive speech

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

Expressive Linear Algebra in Haskell Henning Thielemann 2019-08-21 Expressive Linear Algebra in

The Future of Prosody Its about Time Dafydd Gibbon Bielefeld University Jinan University

WELCOME CHINESE Your Access Channel to the Chinese Market Welcome Chinese mission statement

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Expressive Writing Level 2, Teacher Presentation Book Expressive Writing Level 2, Teacher

Various Approaches Various Approaches acoustic classic The Prosody The Prosody measurement

11-823 Conlanging Prosody 2: so what does it all mean? Prosody Timing Stress timed vs

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

CHiVE Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical

Generating segment-level foreign-accented synthetic speech with natural speech prosody Gustav Eje

Speech Processing 15-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary

EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling and

On the Expressive Power of Deep Neural Networks Maithra Raghu, Ben Poole, Jon Kleinberg, Surya

Expressivity and Complexity of MongoDB queries Elena Botoeva Faculty of Computer Science, Free

Efficient Instance Retrieval over Semi-Expressive Ontologies Dissertation Presentation Sebastian

Express Yourself: Biomechanics of Expressivity Mike Karlesky Mike Karlesky Computer Science

A Brief History of Physical Modeling Synthesis, Leading up to Mobile Devices and MPE Pat

Hypervideo and Annotations on the Web Madjid Sadallah Olivier Aubert Yannick Pri LIRIS -

2CN-CLab Talk Cultura, Redes e Poltica Manuel Gama & Fernanda Pinheiro CULTURAL NETWORKS

Cascading Verification Fokion Zervoudakis (UCL) David S. Rosenblum (NUS) Sebastian Elbaum (UNL)

Sambuz

Useful Links

Newsletter

Mail Us

Modeling Prosody Pattern of Chinese Expressive Speech Application - PowerPoint PPT Presentation

Modeling Prosody Pattern of Chinese Expressive Speech Application in Personalized Speech Conversion Zhang Zhang, Wu Zhiyong, Jia Jia, Cai Lianhong Outline Background Introduction Modeling prosody pattern of expressive speech

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

Expressive Linear Algebra in Haskell Henning Thielemann 2019-08-21 Expressive Linear Algebra in

The Future of Prosody Its about Time Dafydd Gibbon Bielefeld University Jinan University

WELCOME CHINESE Your Access Channel to the Chinese Market Welcome Chinese mission statement

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Expressive Writing Level 2, Teacher Presentation Book Expressive Writing Level 2, Teacher

Various Approaches Various Approaches acoustic classic The Prosody The Prosody measurement

11-823 Conlanging Prosody 2: so what does it all mean? Prosody Timing Stress timed vs

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

CHiVE Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical

Generating segment-level foreign-accented synthetic speech with natural speech prosody Gustav Eje

Speech Processing 15-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary

EE E6820: Speech &amp; Audio Processing &amp; Recognition Lecture 5: Speech modeling and

On the Expressive Power of Deep Neural Networks Maithra Raghu, Ben Poole, Jon Kleinberg, Surya

Expressivity and Complexity of MongoDB queries Elena Botoeva Faculty of Computer Science, Free

Efficient Instance Retrieval over Semi-Expressive Ontologies Dissertation Presentation Sebastian

Express Yourself: Biomechanics of Expressivity Mike Karlesky Mike Karlesky Computer Science

A Brief History of Physical Modeling Synthesis, Leading up to Mobile Devices and MPE Pat

Hypervideo and Annotations on the Web Madjid Sadallah Olivier Aubert Yannick Pri LIRIS -

2CN-CLab Talk Cultura, Redes e Poltica Manuel Gama &amp; Fernanda Pinheiro CULTURAL NETWORKS

Cascading Verification Fokion Zervoudakis (UCL) David S. Rosenblum (NUS) Sebastian Elbaum (UNL)

Sambuz

Useful Links

Newsletter

Mail Us

EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling and

2CN-CLab Talk Cultura, Redes e Poltica Manuel Gama & Fernanda Pinheiro CULTURAL NETWORKS