Syllable-level representations of suprasegmental features for - PowerPoint PPT Presentation

Syllable-level representations of suprasegmental features for DNN-based text-to-speech synthesis M. Sam Ribeiro, Oliver Watts, Junichi Yamagishi School Of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk 12 September 2016 INTERSPEECH - San Francisco, United States 1 / 22

Introduction • Speech synthesis and Prosody • Synthetic speech may sound bland and monotonous • A good understanding and modelling of prosody is essential for natural speech synthesis. • Prosody is inherently suprasegmental • Suprasegmental features are mostly associated with long-term variation. • Current features are very shallow (positional and POS/stress related) • Most systems operate at frame/state levels and rely heavily on segmental features. Idea: Learn suprasegmental representations by pre-processing higher-level features separately. 2 / 22

Introduction • Speech synthesis and Prosody • Synthetic speech may sound bland and monotonous • A good understanding and modelling of prosody is essential for natural speech synthesis. • Prosody is inherently suprasegmental • Suprasegmental features are mostly associated with long-term variation. • Current features are very shallow (positional and POS/stress related) • Most systems operate at frame/state levels and rely heavily on segmental features. Idea: Learn suprasegmental representations by pre-processing higher-level features separately. 3 / 22

Earlier work • Hierarchical models • Cascaded and parallel deep neural networks [Yin et al (2016)], [Ribeiro et al (2016)] • Systems with hierarchical recurrences [Chen et al (1998)] • Continuous representations of linguistic contexts • Segmental-level [Lu et al (2013)] [Wu et al (2015)] • Word-level [Watts et al (2014)] [Wang et al (2015)] • Sentence-level [Watts et al (2015)] Contributions • A top-down hierarchical model at syllable-level (cascaded) • An investigation of additional features at syllable and word-level 4 / 22

Earlier work • Hierarchical models • Cascaded and parallel deep neural networks [Yin et al (2016)], [Ribeiro et al (2016)] • Systems with hierarchical recurrences [Chen et al (1998)] • Continuous representations of linguistic contexts • Segmental-level [Lu et al (2013)] [Wu et al (2015)] • Word-level [Watts et al (2014)] [Wang et al (2015)] • Sentence-level [Watts et al (2015)] Contributions • A top-down hierarchical model at syllable-level (cascaded) • An investigation of additional features at syllable and word-level 5 / 22

Database • Database • Expressive audiobook data • Ideal for exploring higher-level prosodic phenomena • A Tramp Abroad , available from Librivox, processed according to [Braunschweiler et al (2010)] and [Braunschweiler and Buchholz (2011)] . • Training, development, and test sets consisting of 4500, 300, 100 utterances. • Baseline • Feedforward neural network - 6 hidden layers, each with 1024 nodes • Output features: 60-dimensional MCCs, 25 band aperiodicities, 1 log-f0, 1 voicing decision (with dynamic features) • Input features: linguistic contexts at state, phone, syllable, and word levels (594 features). 6 / 22

Hierarchical approach • Syllable-level network: triangular feedforward neural network • Output features: MCCs, BAPs, continuous log-F0 averaged over the entire syllable • Input features: linguistic contexts defined at syllable and word levels (suprasegmental features) frame-level acoustic parameters syllable-level acoustic parameters ... ... ... ... ... ... segmental features hidden representation suprasegmental features 7 / 22

Embedding size • Effect of the bottleneck layer size on objective measures • Does it matter if segmental and suprasegmental features are unbalanced? Mel Cepstral Distortion LF0-RMSE LF0-CORR 4.60 28.2 0.480 0.475 28.0 4.59 0.470 27.8 0.465 4.58 27.6 0.460 4.57 27.4 0.455 27.2 0.450 4.56 27.0 0.445 bline d64 d256 bline d64 d256 bline d64 d256 d32 d128 d512 d32 d128 d512 d32 d128 d512 8 / 22

Additional features • Hypotheses • Hierarchical approaches will be able to leverage additional suprasegmental features. • Frame-level network will depend mostly on segmental features and ignore the new high-dimensional features. • Additional features • Syllable bag-of-phones • Text-based word embeddings (skip-gram model [Mikolov et al (2013)] ) 9 / 22

Syllable bag-of-phones • Syllables have a variable number of phones • Bag-of-phones allows us to represent them with fixed-sized units • Syllable structure: onset, nucleus, coda • For each syllable component, define an n-hot encoding • Includes identity and articulatory features for phones 10 / 22

Syllable bag-of-phones • Systems trained • frame frame-level DNN • frame-BoP frame-level DNN with syllable bag-of-phones • syl cascaded DNN • syl-BoP cascaded DNN with syllable bag-of-phones Mel-Cepstral Distortion BAP Distortion 4.60 2.21 2.20 4.59 2.19 4.58 2.18 4.57 2.17 4.56 2.16 4.55 2.15 LF0-RMSE LF0-CORR 28.5 0.48 28.0 0.47 27.5 0.46 27.0 0.45 26.5 0.44 frame frame-BoP syl syl-BoP frame frame-BoP syl syl-BoP 11 / 22

Word embeddings • Text-based word embeddings learned with the Skip-gram model [Mikolov et al (2013)] • English Wikipedia data - 500 million words • Embedding size: 100 and 300 dimensions System MCD BAP F0-RMSE F0-CORR frame 4.596 2.197 28.054 .449 frame-w100 4.598 2.204 28.048 .448 syl-BoP 4.557 2.176 27.095 .477 syl-BoP-w100 2.177 27.086 .463 4.55 syl-BoP-w300 4.565 2.178 26.850 .479 • No real improvements, although previous work has suggested these embeddings are useful for text-to-speech. [Wang et al (2015)] 12 / 22

Subjective evaluation ID syl syl-w300 1 48.15% 43.18% • Preference Test 2 60.87% 51.79% • 50 test utterances 3 59.26% 59.09% • 16 native listeners 4 56.52% 48.21% • 400 judgements per 5 53.70% 54.55% condition 6 63.04% 46.43% • Systems evaluated 7 38.89% 56.00% 8 56.52% 53.57% • Baseline Basic 9 44.44% 50.00% feedforward DNN with all 10 65.38% 51.72% available features 11 42.59% 43.18% • Syl Top-down hierarchical 12 52.17% 48.21% system with 13 55.56% 63.64% syllable-bag-of-phones 14 65.22% 42.86% • Syl-w300 Adds 300 15 46.30% 47.73% dimensional word 16 45.65% 42.86% embeddings to syl all 53.39% 50.19% 13 / 22

What are listeners responding to? Listeners judge the samples primarily based on f0 variation, which suggests current methodology mostly affects f0 . 14 / 22

What are listeners responding to? Listeners judge the samples primarily based on f0 variation, which suggests current methodology mostly affects f0 . 15 / 22

Speech Samples speech samples http://homepages.inf.ed.ac.uk/s1250520/samples/ interspeech16.html 16 / 22

Summary Contributions 1 A top-down hierarchical model at syllable-level (cascaded) 2 An investigation of its usefulness with additional features at syllable and word-level Main Findings 1 Hierarchical approach performs best when segmental and suprasegmental features are balanced. 2 Syllable-bag of phones give minor improvements on objective scores 3 Text-based word embeddings have little effect 4 No significant results in terms of subjective evaluation, but clear differences in terms of predicted f0 contours. 17 / 22

Summary Contributions 1 A top-down hierarchical model at syllable-level (cascaded) 2 An investigation of its usefulness with additional features at syllable and word-level Main Findings 1 Hierarchical approach performs best when segmental and suprasegmental features are balanced. 2 Syllable-bag of phones give minor improvements on objective scores 3 Text-based word embeddings have little effect 4 No significant results in terms of subjective evaluation, but clear differences in terms of predicted f0 contours. 18 / 22

Future work • Most improvements derive from the hierarchical framework • This suggests it is working mostly as a feature extractor or denoiser Parallel and cascaded deep neural networks for text-to-speech synthesis Ribeiro, M. S., Watts, O. & Junichi, Y. (2016) Parallel and cascaded deep neural networks for text-to-speech synthesis. In Proc. of SSW, Sunnyvale, 2016. 19 / 22

Syllable-level representations of suprasegmental features for DNN-based text-to-speech synthesis Thank you for listening M. Sam Ribeiro, Oliver Watts, Junichi Yamagishi School of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk 12 September 2015 - San Francisco, United States 20 / 22

Syllable-level representations of suprasegmental features for - PowerPoint PPT Presentation

Syllable-level representations of suprasegmental features for DNN-based text-to-speech synthesis M. Sam Ribeiro, Oliver Watts, Junichi Yamagishi School Of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk 12 September 2016

Phonetics Suprasegmental Features Darrell Larsen Linguistics 101 Darrell Larsen Phonetics

Phonetics Suprasegmental Features Darrell Larsen Linguistics 101 Darrell Larsen Phonetics

Syllabic Patterns in MTL A MTL word, like each English word, can be formed by only one syllable

Syllable-based compression for XML Katsiaryna Chernik, Jan Lnsk, Leo Galambo Dept. of

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

61A Lecture 16 Announcements String Representations String Representations 4 String

PowerWizard Level 1.0 & Level 2.0 Control Systems Training Systems Comparison Level 2

Fourier transform for nilpotent Lie groups Index sets and representations Granada Index sets

Learning text representations from character-level data Grzegorz Chrupa la Department of

drop hum run If a word Yes! skip has only one syllable Yes! ends with a single consonant

INTRODUCTION Annotation Phoneme Syllable Word Break Indexes Tone ISSUES

Optimizing Structure: The Case of the Op g S uc u e e Case o e CCV Syllable of Akan

The motor system To move things is all that mankind can do whether in whispering a syllable or

From Speech Perception to Language Andrew Nevins (Harvard University) Lectures at Universidadte

Characterizing Syllable Well-Formedness Using Inviolable Constraints over Formal Word Models

11-823 Conlanging Prosody 2: so what does it all mean? Prosody Timing Stress timed vs

Euclidean Geometry Introduction Undefined Terms A point is like a dot, only smaller. It has a

Secret Key: stream ciphers & block ciphers Stream Ciphers Idea: try to simulate one-time pad

Classifying Quadrilaterals MPM2D: Principles of Mathematics Like triangles, we can often classify

Techniques for managing probabilistic data Dan Suciu University of Washington 1 Databases Are

NCEMCH Georgetown University This project is supported by the Health Resources and Services

Nutrition, Breastfeeding and Physical Activity State Title V Program Efforts Childrens Healthy

Development Center Dorothy Cilenti, DrPH The University of North Carolina at Chapel Hill March

Kansas Maternal & Child Health Council OCTOBER 5, 2016 MEETING Welcome Recognize New

Syllable-level representations of suprasegmental features for - PowerPoint PPT Presentation

Syllable-level representations of suprasegmental features for DNN-based text-to-speech synthesis M. Sam Ribeiro, Oliver Watts, Junichi Yamagishi School Of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk 12 September 2016

Phonetics Suprasegmental Features Darrell Larsen Linguistics 101 Darrell Larsen Phonetics

Phonetics Suprasegmental Features Darrell Larsen Linguistics 101 Darrell Larsen Phonetics

Syllabic Patterns in MTL A MTL word, like each English word, can be formed by only one syllable

Syllable-based compression for XML Katsiaryna Chernik, Jan Lnsk, Leo Galambo Dept. of

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

61A Lecture 16 Announcements String Representations String Representations 4 String

PowerWizard Level 1.0 &amp; Level 2.0 Control Systems Training Systems Comparison Level 2

Fourier transform for nilpotent Lie groups Index sets and representations Granada Index sets

Learning text representations from character-level data Grzegorz Chrupa la Department of

drop hum run If a word Yes! skip has only one syllable Yes! ends with a single consonant

INTRODUCTION Annotation Phoneme Syllable Word Break Indexes Tone ISSUES

Optimizing Structure: The Case of the Op g S uc u e e Case o e CCV Syllable of Akan

The motor system To move things is all that mankind can do whether in whispering a syllable or

From Speech Perception to Language Andrew Nevins (Harvard University) Lectures at Universidadte

Characterizing Syllable Well-Formedness Using Inviolable Constraints over Formal Word Models

11-823 Conlanging Prosody 2: so what does it all mean? Prosody Timing Stress timed vs

Euclidean Geometry Introduction Undefined Terms A point is like a dot, only smaller. It has a

Secret Key: stream ciphers &amp; block ciphers Stream Ciphers Idea: try to simulate one-time pad

Classifying Quadrilaterals MPM2D: Principles of Mathematics Like triangles, we can often classify

Techniques for managing probabilistic data Dan Suciu University of Washington 1 Databases Are

NCEMCH Georgetown University This project is supported by the Health Resources and Services

Nutrition, Breastfeeding and Physical Activity State Title V Program Efforts Childrens Healthy

Development Center Dorothy Cilenti, DrPH The University of North Carolina at Chapel Hill March

Kansas Maternal &amp; Child Health Council OCTOBER 5, 2016 MEETING Welcome Recognize New

PowerWizard Level 1.0 & Level 2.0 Control Systems Training Systems Comparison Level 2

Secret Key: stream ciphers & block ciphers Stream Ciphers Idea: try to simulate one-time pad

Kansas Maternal & Child Health Council OCTOBER 5, 2016 MEETING Welcome Recognize New