A template-based approach for speech synthesis intonation generation - PowerPoint PPT Presentation

A template-based approach for speech synthesis intonation generation using LSTMs Srikanth Ronanki Gustav Zhizheng Simon

Introduction: Statistical speech synthesis SPSS: Generation based approaches • Vocoder MGC BAP Regression Text Front-end Pitch Model LF0 Speech Waveform generator

Why template-based approach? Lack of convincing intonation makes current parametric systems • sound dull and lifeless. Typically, these systems predict F0 frame-by-frame using regression • models. This approach leads to overly-smooth pitch contours and fail to • construct an appropriate prosodic structure. Templates retain the dynamic range of F0 within the segment. • We propose a classification-based approach to automatic F0 • generation.

Pitch contour

Pitch contour segmentation

Hierarchical clustering Pitch contours

How to determine number of clusters?

A set of templates (clusters)

Intonation reconstruction from templates Goldilocks and the three bears /g@Ulwd/ /lI/ /Qks/ /pau/ /and/ /D@/ /Tri/ /bE@z/ 2 4 5 4 4 4 2 5 Mean pitch 2 4 5 4 4 4 2 5 Duration

Intonation reconstruction

Hierarchical clustering - Recap Training Data force-aligned durations • Interpolate the F0 contour of each utterance   and segment into syllables segmentation (syllable) • Apply DCT based decomposition:   duration c0 representing the mean over syllable,   normalisation c = [c1,…,CN-1], representing the shape   of the contour Pitch patterns (DCT features) • Perform top-to-bottom hierarchical clustering   mean normalisation over the patterns ( c ). Clustering (Hierarchical)

Proposed approaches Pitch Syllable contour 70 52.5 35 Neural Network classifiers: 17.5 0 1 2 3 4 5 6 A hierarchical deep neural network classifier (HC). • template counts in the data ‣ The first DNN choses between flat and non-flat template. ‣ The second DNN choses among rest of the non-flat templates. A simplified LSTM with a CTC output layer (CTC). • ‣ Connectionist temporal classification coupled with S-LSTM to predict the sequence of templates given sequence of phonemes.

Waveform smoothing Vocoder MGC BAP F 0 Acoustic features Output layer o t i t C t Acoustic S-LSTM 1 − f t f t Input Input Input #1 #2 #3 Frame-level F0 Frame-level linguistic features F0 reconstruction d ( t ) Phone durations Syllable templates CTC output layer Intonation S-LSTM Duration S-LSTM Phone-level linguistic features Input Input Input #1 #2 #3 Linguistic features Text analysis Text

Results: systems Baseline system • ‣ MSE - A frame-wise regression baseline predicting F0 using LSTMs. Proposed systems • ‣ HC - A hierarchical deep neural network classifier ‣ CTC - A simplified LSTM coupled with CTC output layer ‣ Oracle - A oracle system using templates derived from natural F0 contour but with predicted F0 mean and duration

Objective evaluation Classification measures • ‣ Accuracy - percentage of templates correctly classified ‣ F1 score - is a measure of test’s accuracy (precision and recall) Model Accuracy F1 score HC 61.1% 0.590 CTC 63.8% 0.593

Objective evaluation F0 prediction measures • ‣ RMSE - Root mean square error ‣ CORR - Pearson correlation 47 0.6 45.25 0.45 43.5 0.3 0.15 41.75 0 40 MSE HC CTC Oracle MSE HC CTC Oracle Fig: Correlation of predicted F0 Fig: RMSE of predicted F0 • Oracle templates + Oracle F0 mean - 0.89 (corr.)

Subjective evaluation Reference systems • ‣ MSE - A frame-wise regression baseline predicting F0 using LSTMs. ‣ BOT - A bottom line using piecewise-constant F0 per syllable (the mean natural F0) ‣ BMK - A benchmark system using force-aligned durations and natural F0 contours ‣ VOC - A top line of vocoded speech (STRAIGHT in this work)

Subjective evaluation: MUSHRA 7 Subjective rank (highest is best) 6 5 20 listeners • 20 out of 32 test   • 4 stimuli 3 2 1 VOC BOT BMK MSE HC CTC Oracle Fig: Box plot of aggregate ranks from listening test. Red lines are medians, orange squares means.

Summary and conclusions A classification approach to intonation prediction with syllable F0 • templates Proposed approach matches the performance of conventional • approach Has potential to exceed it once the issues with oracle template • system are overcome Future work: • ‣ Better smoothing techniques and word-level templates ‣ Use the prediction probabilities as features for frame-level regression approaches

Code Code for templates and clustering • ‣ https://github.com/ronanki/Hybrid_prosody_model Code for training neural networks • ‣ https://github.com/CSTR-Edinburgh/merlin

A template-based approach for speech synthesis intonation generation - PowerPoint PPT Presentation

A template-based approach for speech synthesis intonation generation using LSTMs Srikanth Ronanki Gustav Zhizheng Simon Introduction: Statistical speech synthesis SPSS: Generation based approaches Vocoder MGC BAP Regression Text

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Melodic Intonation Therapy Cassandra Coleman & Christine Truong What is Melodic Intonation

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

11-752: Speech Synthesis Objectives Understand basic processing in speech synthesis

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

Speech Processing 15-492/18-492 Speech Synthesis Waveform generation 2 Speech Synthesis Text

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

SYNTHESIS OF SUPER SYNTHESIS OF SUPER NANOPOROUS SYNTHESIS OF SUPER SYNTHESIS OF

Text-to-Speech Synthesis Bernd Mbius Language Science and Technology Saarland University

Speech Processing 15-492/18-492 Speech Synthesis Talking heads Singing Synthesis More

Probability and Independence Definition If P(B) > 0 , the conditional probability of A given

The Effective Audit Committee 2 April 2019 The Holiday Inn Housekeeping Fire alarms test

X 1 1 1 1 0 0 0 0 1000

Chapter 3: Basics from Probability Theory and Statistics It is likely that unlikely things should

{Total Aerosol Carbon / Sulfate} in the Free Troposphere at MLO Barry Huebert, Steve Howell, John

Mutual Information Example - SSD 5 x 10 6 10 20 5 30 4 40 50 3 60 2 70 1 80 R I

? ? Computer (finite-state) Linguists Scientists string rewrites finite-state (equivalent)

Products of free spaces and applications Pedro L. Kaufmann I BWB - Maresias 2014 Pedro L.

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

A template-based approach for speech synthesis intonation generation - PowerPoint PPT Presentation

A template-based approach for speech synthesis intonation generation using LSTMs Srikanth Ronanki Gustav Zhizheng Simon Introduction: Statistical speech synthesis SPSS: Generation based approaches Vocoder MGC BAP Regression Text

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Melodic Intonation Therapy Cassandra Coleman &amp; Christine Truong What is Melodic Intonation

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

11-752: Speech Synthesis Objectives Understand basic processing in speech synthesis

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

Speech Processing 15-492/18-492 Speech Synthesis Waveform generation 2 Speech Synthesis Text

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

SYNTHESIS OF SUPER SYNTHESIS OF SUPER NANOPOROUS SYNTHESIS OF SUPER SYNTHESIS OF

Text-to-Speech Synthesis Bernd Mbius Language Science and Technology Saarland University

Speech Processing 15-492/18-492 Speech Synthesis Talking heads Singing Synthesis More

Probability and Independence Definition If P(B) &gt; 0 , the conditional probability of A given

The Effective Audit Committee 2 April 2019 The Holiday Inn Housekeeping Fire alarms test

X 1 1 1 1 0 0 0 0 1000

Chapter 3: Basics from Probability Theory and Statistics It is likely that unlikely things should

{Total Aerosol Carbon / Sulfate} in the Free Troposphere at MLO Barry Huebert, Steve Howell, John

Mutual Information Example - SSD 5 x 10 6 10 20 5 30 4 40 50 3 60 2 70 1 80 R I

? ? Computer (finite-state) Linguists Scientists string rewrites finite-state (equivalent)

Products of free spaces and applications Pedro L. Kaufmann I BWB - Maresias 2014 Pedro L.

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Melodic Intonation Therapy Cassandra Coleman & Christine Truong What is Melodic Intonation

Probability and Independence Definition If P(B) > 0 , the conditional probability of A given