sub project i prosody tones and text to speech synthesis
play

Sub-Project I Prosody, Tones and Text-To-Speech Synthesis - PowerPoint PPT Presentation

Sub-Project I Prosody, Tones and Text-To-Speech Synthesis Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI), Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI), Lin-shan Lee, Hsin-min Wang Outline Members Theme of Sub-project I Research Roadmap


  1. Sub-Project I Prosody, Tones and Text-To-Speech Synthesis Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI), Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI), Lin-shan Lee, Hsin-min Wang

  2. Outline � Members � Theme of Sub-project I � Research Roadmap � Current Achievements � Research Infrastructure � Future Direction 2

  3. Members Sin-Horng Chen Chiu-yu Tseng Professor (PI) Professor & Research Fellow NCTU (Co-PI) Academia Sinica Yih-Ru Wang, Yuan-Fu Liao Associate Professor Assistant Professor (Co-PI) , NCTU (Co-PI) , NTUT Lin-shan Lee Hsin-min Wang Professor , NTU Associate Research Fellow Academia Sinica 3

  4. Theme of Sub-Project I Hierarchical modeling of fluent prosody Latent Factor-based pitch contour model = ( + β ) γ Z Y Mean model: n n s s n n = + β + β + β + β + β + β Y X Prosody Analysis n n t pt ft i f p Prosody Analysis n n n n n n = + + + + + and Modeling Z X b b b b b Shape model: and Modeling n n tc q s i f n n n n n Prosodic model-based tone recognizer Tone Sandhi Tone Behavior Tone Behavior and Modeling and Modeling High performance TTS Applications in Applications in Applications in Applications in Speaker recognition Text-to-speech Speech/Speaker Text-to-speech Speech/Speaker Less 0.2 Fast Keyword breaks Synthesis Recognition speakers Speaker 0.15 2 7-2 Synthesis Recognition 2-2 2-6 2-7 6-2 2-1 0.1 1-2 5-2 2-4 2-5 2-3 7-7 6-7 0.05 7-5 6-5 7-4 5-7 7-6 Dimension 2 8-5 7-8 7 5-5 8-6 6-8 6-6 8-7 3-2 8-3 8 6 5 0 8-8 5-6 5-4 8-4 8-1 6-4 4-6 4-3 7-3 1-7 1-8 4 7-1 5-3 1-6 6-1 -0.05 5-8 3-7 4-7 5-1 6-3 3-4 3-6 3-5 1-5 1 3 4-1 1-4 -0.1 3-3 3-1 1-3 1-1 -0.15 -0.2 0 0.02 0.04 0.06 0.08 0.1 0.12 Slow More Dimension 1 speakers 4 breaks

  5. Research Focus � How to analyze and model fluent speech prosody – Approach 1: Hierarchical modeling of fluent speech prosody • Develop a hierarchical prosody framework of fluent speech • Construct modular acoustic models for: (1) F0 contours, (2) duration patterns, (3) Intensity distribution and (4) boundary breaks – Approach 2: Latent factor analysis-based modeling • Assume there are some latent affecting factors • Latent factor analysis for syllable duration, pitch contour, energy and Inter- syllable coarticulation • Explore the relation between latent factors and syntactic information � How to integrate these two approaches and apply them to – Text-to-speech synthesis – Speech/tone/speaker recognition 5

  6. Research Roadmap Current Achievements � � Future Direction •Investigation in relation to •Hierarchical modeling of prosody organization: F0 range •COSPRO corpus/Toolkits fluent speech prosody and reset, naturalness and measurement, voice quality •Latent factor analysis •RNN/VQ-based duration, pitch mean, •Automatic prosodic labeling prosodic modeling shape, inter-syllable •Prosodic phrase analysis coarticulation •High performance TTS •Model-based TTS •Corpus-based TTS Mandarin, Min-south, Hakka •Tone modeling and •HMM •Model-based tone recognizer recognition, MLP/RNN •Eigen prosody analysis-based •Prosodic model-based speaker recognition speaker recognition •Language model+pause, PM •Prosodic cues-dependent LM 6

  7. Hierarchical Prosody Framework of Fluent Speech (1/4) � Hierarchical framework of fluent speech prosody for multi- phrase speech paragraphs – Hierarchical cross-phrase patterns and contributions are found in all 4 acoustic dimensions. – Acoustic templates are derived for each prosody level • F0 template • Syllable duration templates and temporal allocation patterns • Intensity distribution patterns • Boundary break patterns 7

  8. Hierarchical Prosody Framework of Fluent Speech (2/4) � The Prosody Hierarchy with Prosodic Boundaries Prosodic Group B5 B4 Breath Group B4 Initial PP Middle Prosodic Phrase Final PP B3 B3 PWPW .. .. .. .. .. .. .. .. .. .. .. .. .. PW B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 8

  9. Hierarchical Prosody Framework of Fluent Speech (3/4) � F0 cadence of multi-phrase PG � Syllable duration cadence of multi- (Prosodic Phrase Group ) phrase PG 1.6 Tide over Wave and Ripple 1.4 1.2 1 0.8 0.6 0.4 0.2 0 -0.2 1 2 3 4 5 6 7 8 9 10 11 -0.4 -0.6 -0.8 -1 -1.2 PG-initial PPh l 1.6 1.6 1.4 1.4 1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 -0.2 1 2 3 4 5 6 7 8 9 10 11 -0.2 1 2 3 4 -0.4 -0.4 -0.6 -0.6 -0.8 -0.8 -1 -1 -1.2 -1.2 the PW level PG-medial PPh l 1.6 1.6 1.4 1.4 1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 -0.2 1 2 3 4 5 6 7 8 9 10 11 -0.2 1 2 3 4 5 6 7 8 9 10 11 -0.4 -0.4 -0.6 -0.6 -0.8 -0.8 -1 -1 -1.2 -1.2 9 the PPh level PG-final PPh l

  10. Hierarchical Prosody Framework of Fluent Speech (4/4) � Duration Re-synthesis, F054C � F0 Re-synthesis, F054C Original Original Original Original 350 350 350 350 300 300 300 300 250 250 250 250 F 0 ( H z ) 200 F 0 ( H z ) 200 200 200 150 150 150 150 100 100 100 100 50 50 50 50 0 0 0 0 Initial Middle Final Initial Middle Final Initial Medial Final Initial Medial Final 350 350 350 350 350 350 300 300 300 300 300 300 250 250 250 250 250 250 F 0 ( H z ) F 0 ( H z ) F 0 ( H z ) 200 200 200 200 200 200 150 150 150 150 150 150 100 100 100 100 100 100 50 50 50 50 50 50 0 0 0 0 0 0 Initial Middle Final Initial Middle Final Initial Middle Final Initial Medial Final Initial Medial Final Initial Medial Final 350 350 350 350 350 350 300 300 300 300 300 300 250 250 250 250 250 250 F 0 ( H z ) 200 F 0 ( H z ) 200 F 0 ( H z ) 200 200 200 200 150 150 150 150 150 150 100 100 100 100 100 100 50 50 50 50 50 50 0 0 0 Initial Middle Final Initial Middle Final Initial Middle Final 0 0 0 Initial Medial Final Initial Medial Final Initial Medial Final � Cross speaker synthesis: To manipulate Speaker A’s Duration Parameters with Speaker B’s Modified Original Original 10

  11. Latent Factor Analysis-based Prosody Modeling (1/3) � Syllable Duration Model = X γ γ γ γ γ – Multiplicative model Z n n t y j l s n n n n n – Additive model = + γ + γ + γ + γ + γ Z X n n t y j l s n n n n n � Relations between Prosodic State CFs of Initial/Final and Syllable Duration Models 2 1 .8 mean: 42.3 frames � 43.9 frames companding factor of initial and final 1 .6 variance: 180 frame 2 � 2.52 frame 2 1 .4 1 .2 RMSE: 1.93 frames 1 fin a l (5ms/frame) in itia l 0 .8 0 .6 0 .4 0 .4 0 .6 0 .8 1 1 .2 1 .4 1 .6 1 .8 2 c o m p a n d in g fa c to r o f s y lla b le 11

  12. Latent Factor Analysis-based Prosody Modeling (2/3) � Syllable Pitch Contour Model = + β ) γ – Mean model Z ( Y n n s s n n = + β + β + β + β + β + β Y X n n t pt ft i f p n n n n n n = + + + + + – Shape model Z X b b b b b n n tc q s i f n n n n n � The patterns of x-3-3 � Reconstructed pitch mean 9.5 9 8.5 pitch period (ms) 8 7.5 033 133 233 7 333 433 533 6.5 020 030 6 0 2 4 6 8 10 12 14 16 18 20 frame 12

  13. Latent Factor Analysis-based Prosody Modeling (3/3) � Inter-syllable coarticulation pitch contour model � The relationship of syllable pitch contours and affecting factors � Reconstructed pitch contour 13

  14. Mandarin/Taiwanese TTS � Block diagram of TTS system � TTS samples Model- Corpus- input Min-Nan or Chinese text based TTS based TTS female 1 female 1 Text Analyzer base-syllable linguistic female 2 female 2 sequence feature female 3 female 3 RNN-based Acoustic Prosody female 4 female 4 Inventory Generator female 5 female 5 waveform prosodic Taiwanese - sequence parameters PSOLA Speech Synthesizer synthetic speech 14

  15. Tone Behavior Modeling and Recognition with Inter-Syllabic Features � Gabor-IFAS-based pitch detection � Four inter-syllabic features – Ratio of duration of adjacent syllables – Averaged pitch value over a syllable – Maximum pitch difference within a syllable – Averaged slope of the pitch contour over a syllable � Context-dependent tone behavior modeling 15

Recommend


More recommend