syllable level representations of suprasegmental features
play

Syllable-level representations of suprasegmental features for - PowerPoint PPT Presentation

Syllable-level representations of suprasegmental features for DNN-based text-to-speech synthesis M. Sam Ribeiro, Oliver Watts, Junichi Yamagishi School Of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk 12 September 2016


  1. Syllable-level representations of suprasegmental features for DNN-based text-to-speech synthesis M. Sam Ribeiro, Oliver Watts, Junichi Yamagishi School Of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk 12 September 2016 INTERSPEECH - San Francisco, United States 1 / 22

  2. Introduction • Speech synthesis and Prosody • Synthetic speech may sound bland and monotonous • A good understanding and modelling of prosody is essential for natural speech synthesis. • Prosody is inherently suprasegmental • Suprasegmental features are mostly associated with long-term variation. • Current features are very shallow (positional and POS/stress related) • Most systems operate at frame/state levels and rely heavily on segmental features. Idea: Learn suprasegmental representations by pre-processing higher-level features separately. 2 / 22

  3. Introduction • Speech synthesis and Prosody • Synthetic speech may sound bland and monotonous • A good understanding and modelling of prosody is essential for natural speech synthesis. • Prosody is inherently suprasegmental • Suprasegmental features are mostly associated with long-term variation. • Current features are very shallow (positional and POS/stress related) • Most systems operate at frame/state levels and rely heavily on segmental features. Idea: Learn suprasegmental representations by pre-processing higher-level features separately. 3 / 22

  4. Earlier work • Hierarchical models • Cascaded and parallel deep neural networks [Yin et al (2016)], [Ribeiro et al (2016)] • Systems with hierarchical recurrences [Chen et al (1998)] • Continuous representations of linguistic contexts • Segmental-level [Lu et al (2013)] [Wu et al (2015)] • Word-level [Watts et al (2014)] [Wang et al (2015)] • Sentence-level [Watts et al (2015)] Contributions • A top-down hierarchical model at syllable-level (cascaded) • An investigation of additional features at syllable and word-level 4 / 22

  5. Earlier work • Hierarchical models • Cascaded and parallel deep neural networks [Yin et al (2016)], [Ribeiro et al (2016)] • Systems with hierarchical recurrences [Chen et al (1998)] • Continuous representations of linguistic contexts • Segmental-level [Lu et al (2013)] [Wu et al (2015)] • Word-level [Watts et al (2014)] [Wang et al (2015)] • Sentence-level [Watts et al (2015)] Contributions • A top-down hierarchical model at syllable-level (cascaded) • An investigation of additional features at syllable and word-level 5 / 22

  6. Database • Database • Expressive audiobook data • Ideal for exploring higher-level prosodic phenomena • A Tramp Abroad , available from Librivox, processed according to [Braunschweiler et al (2010)] and [Braunschweiler and Buchholz (2011)] . • Training, development, and test sets consisting of 4500, 300, 100 utterances. • Baseline • Feedforward neural network - 6 hidden layers, each with 1024 nodes • Output features: 60-dimensional MCCs, 25 band aperiodicities, 1 log-f0, 1 voicing decision (with dynamic features) • Input features: linguistic contexts at state, phone, syllable, and word levels (594 features). 6 / 22

  7. Hierarchical approach • Syllable-level network: triangular feedforward neural network • Output features: MCCs, BAPs, continuous log-F0 averaged over the entire syllable • Input features: linguistic contexts defined at syllable and word levels (suprasegmental features) frame-level acoustic parameters syllable-level acoustic parameters ... ... ... ... ... ... segmental features hidden representation suprasegmental features 7 / 22

  8. Embedding size • Effect of the bottleneck layer size on objective measures • Does it matter if segmental and suprasegmental features are unbalanced? Mel Cepstral Distortion LF0-RMSE LF0-CORR 4.60 28.2 0.480 0.475 28.0 4.59 0.470 27.8 0.465 4.58 27.6 0.460 4.57 27.4 0.455 27.2 0.450 4.56 27.0 0.445 bline d64 d256 bline d64 d256 bline d64 d256 d32 d128 d512 d32 d128 d512 d32 d128 d512 8 / 22

  9. Additional features • Hypotheses • Hierarchical approaches will be able to leverage additional suprasegmental features. • Frame-level network will depend mostly on segmental features and ignore the new high-dimensional features. • Additional features • Syllable bag-of-phones • Text-based word embeddings (skip-gram model [Mikolov et al (2013)] ) 9 / 22

  10. Syllable bag-of-phones • Syllables have a variable number of phones • Bag-of-phones allows us to represent them with fixed-sized units • Syllable structure: onset, nucleus, coda • For each syllable component, define an n-hot encoding • Includes identity and articulatory features for phones 10 / 22

  11. Syllable bag-of-phones • Systems trained • frame frame-level DNN • frame-BoP frame-level DNN with syllable bag-of-phones • syl cascaded DNN • syl-BoP cascaded DNN with syllable bag-of-phones Mel-Cepstral Distortion BAP Distortion 4.60 2.21 2.20 4.59 2.19 4.58 2.18 4.57 2.17 4.56 2.16 4.55 2.15 LF0-RMSE LF0-CORR 28.5 0.48 28.0 0.47 27.5 0.46 27.0 0.45 26.5 0.44 frame frame-BoP syl syl-BoP frame frame-BoP syl syl-BoP 11 / 22

  12. Word embeddings • Text-based word embeddings learned with the Skip-gram model [Mikolov et al (2013)] • English Wikipedia data - 500 million words • Embedding size: 100 and 300 dimensions System MCD BAP F0-RMSE F0-CORR frame 4.596 2.197 28.054 .449 frame-w100 4.598 2.204 28.048 .448 syl-BoP 4.557 2.176 27.095 .477 syl-BoP-w100 2.177 27.086 .463 4.55 syl-BoP-w300 4.565 2.178 26.850 .479 • No real improvements, although previous work has suggested these embeddings are useful for text-to-speech. [Wang et al (2015)] 12 / 22

  13. Subjective evaluation ID syl syl-w300 1 48.15% 43.18% • Preference Test 2 60.87% 51.79% • 50 test utterances 3 59.26% 59.09% • 16 native listeners 4 56.52% 48.21% • 400 judgements per 5 53.70% 54.55% condition 6 63.04% 46.43% • Systems evaluated 7 38.89% 56.00% 8 56.52% 53.57% • Baseline Basic 9 44.44% 50.00% feedforward DNN with all 10 65.38% 51.72% available features 11 42.59% 43.18% • Syl Top-down hierarchical 12 52.17% 48.21% system with 13 55.56% 63.64% syllable-bag-of-phones 14 65.22% 42.86% • Syl-w300 Adds 300 15 46.30% 47.73% dimensional word 16 45.65% 42.86% embeddings to syl all 53.39% 50.19% 13 / 22

  14. What are listeners responding to? Listeners judge the samples primarily based on f0 variation, which suggests current methodology mostly affects f0 . 14 / 22

  15. What are listeners responding to? Listeners judge the samples primarily based on f0 variation, which suggests current methodology mostly affects f0 . 15 / 22

  16. Speech Samples speech samples http://homepages.inf.ed.ac.uk/s1250520/samples/ interspeech16.html 16 / 22

  17. Summary Contributions 1 A top-down hierarchical model at syllable-level (cascaded) 2 An investigation of its usefulness with additional features at syllable and word-level Main Findings 1 Hierarchical approach performs best when segmental and suprasegmental features are balanced. 2 Syllable-bag of phones give minor improvements on objective scores 3 Text-based word embeddings have little effect 4 No significant results in terms of subjective evaluation, but clear differences in terms of predicted f0 contours. 17 / 22

  18. Summary Contributions 1 A top-down hierarchical model at syllable-level (cascaded) 2 An investigation of its usefulness with additional features at syllable and word-level Main Findings 1 Hierarchical approach performs best when segmental and suprasegmental features are balanced. 2 Syllable-bag of phones give minor improvements on objective scores 3 Text-based word embeddings have little effect 4 No significant results in terms of subjective evaluation, but clear differences in terms of predicted f0 contours. 18 / 22

  19. Future work • Most improvements derive from the hierarchical framework • This suggests it is working mostly as a feature extractor or denoiser Parallel and cascaded deep neural networks for text-to-speech synthesis Ribeiro, M. S., Watts, O. & Junichi, Y. (2016) Parallel and cascaded deep neural networks for text-to-speech synthesis. In Proc. of SSW, Sunnyvale, 2016. 19 / 22

  20. Syllable-level representations of suprasegmental features for DNN-based text-to-speech synthesis Thank you for listening M. Sam Ribeiro, Oliver Watts, Junichi Yamagishi School of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk 12 September 2015 - San Francisco, United States 20 / 22

Recommend


More recommend