parallel and cascaded deep neural networks for text to
play

Parallel and cascaded deep neural networks for text-to-speech - PowerPoint PPT Presentation

Parallel and cascaded deep neural networks for text-to-speech synthesis M. Sam Ribeiro, Oliver Watts, Junichi Yamagishi School Of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk 14 September 2016 Speech Synthesis Workshop 9


  1. Parallel and cascaded deep neural networks for text-to-speech synthesis M. Sam Ribeiro, Oliver Watts, Junichi Yamagishi School Of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk 14 September 2016 Speech Synthesis Workshop 9 - Sunnyvale, United States 1 / 36

  2. Introduction • Speech synthesis and Prosody • Synthetic speech may sound bland and monotonous • A good understanding and modelling of prosody is essential for natural speech synthesis. • Prosody is inherently suprasegmental • Suprasegmental features are mostly associated with long-term variation. • Current features are very shallow (positional and POS/stress related) • Most systems operate at frame/state levels and rely heavily on segmental features. Ideally we would have a framework that has good representations of contexts, but also the ability to exploit them. 2 / 36

  3. Introduction • Speech synthesis and Prosody • Synthetic speech may sound bland and monotonous • A good understanding and modelling of prosody is essential for natural speech synthesis. • Prosody is inherently suprasegmental • Suprasegmental features are mostly associated with long-term variation. • Current features are very shallow (positional and POS/stress related) • Most systems operate at frame/state levels and rely heavily on segmental features. Ideally we would have a framework that has good representations of contexts, but also the ability to exploit them. 3 / 36

  4. Earlier work • Hierarchical models • Cascaded and parallel deep neural networks • Superpositional model of f0 [Yin et al (2016)] • Systems with hierarchical recurrences [Chen et al (1998)] • Continuous representations of linguistic contexts • Segmental-level [Lu et al (2013)] [Wu et al (2015)] • Word-level [Watts et al (2014)] [Wang et al (2015)] • Sentence-level [Watts et al (2015)] Recent work Ribeiro et al (2016) Syllable-level representations of suprasegmental features for DNN-based text-to-speech synthesis. Proceedings of Interspeech 2016 4 / 36

  5. Earlier work • Hierarchical models • Cascaded and parallel deep neural networks • Superpositional model of f0 [Yin et al (2016)] • Systems with hierarchical recurrences [Chen et al (1998)] • Continuous representations of linguistic contexts • Segmental-level [Lu et al (2013)] [Wu et al (2015)] • Word-level [Watts et al (2014)] [Wang et al (2015)] • Sentence-level [Watts et al (2015)] Recent work Ribeiro et al (2016) Syllable-level representations of suprasegmental features for DNN-based text-to-speech synthesis. Proceedings of Interspeech 2016 5 / 36

  6. Ribeiro et al (2016) Contributions 1 A top-down hierarchical model at syllable-level (cascaded) 2 An investigation of its usefulness with additional features at syllable and word-level Main Findings 1 Hierarchical approach performs best when segmental and suprasegmental features are balanced. 2 Syllable-bag of phones give minor improvements on objective scores 3 Text-based word embeddings have little effect 4 No significant results in terms of subjective evaluation, but clear differences in terms of predicted f0 contours. 6 / 36

  7. Ribeiro et al (2016) Contributions 1 A top-down hierarchical model at syllable-level (cascaded) 2 An investigation of its usefulness with additional features at syllable and word-level Main Findings 1 Hierarchical approach performs best when segmental and suprasegmental features are balanced. 2 Syllable-bag of phones give minor improvements on objective scores 3 Text-based word embeddings have little effect 4 No significant results in terms of subjective evaluation, but clear differences in terms of predicted f0 contours. 7 / 36

  8. Ribeiro et al (2016) • Most improvements derive from the hierarchical framework • This suggests it is working mostly as a feature extractor or denoiser Parallel and cascaded deep neural networks for text-to-speech synthesis Ribeiro, M. S., Watts, O. & Junichi, Y. (2016) Parallel and cascaded deep neural networks for text-to-speech synthesis. In Proc. of SSW, Sunnyvale, 2016. 8 / 36

  9. Baseline Network frame-level acoustic parameters • Feedforward deep neural network ... • 6-hidden layers, each with 1024 nodes ... • Output features • 60-dimensional MCCs, 25 band aperiodicities, 1 ... log-f0, 1 voicing decision (plus dynamic features) input features 9 / 36

  10. Hierarchical Networks • Input features • Segmental : phone-level and below • Suprasegmental: syllable-level and above • Output features • Frame-level acoustic parameters averaged over the entire syllable • Architecture • 6-hidden layer triangular networks • Top hidden layer used as bottleneck layer • Integration strategies • Cascaded strategy • Parallel strategy 10 / 36

  11. Cascaded Network syllable-level acoustic parameters frame-level acoustic parameters ... ... ... ... ... ... segmental features hidden representation suprasegmental features 11 / 36

  12. Parallel Network frame-level acoustic parameters ... frame-level acoustic parameters syllable-level acoustic parameters ... ... ... ... ... ... segmental features suprasegmental features 12 / 36

  13. Linguistic Features • Segmental-Features • Constant for all systems • Phone and state-level features (352 dimensions) • Suprasegmental - Full Set • Standard set of features used for HMM-based speech synthesis • Derived from a common Front-End - Festival • Syllable, word, phrase, utterance (roughly 1100 dimensions) • Suprasegmental - Pruned Set • Hand-selected set of features for DNN-based speech synthesis • Higher-level context was removed • Syllable, word (244 dimensions) 13 / 36

  14. Database • Expressive audiobook data • Ideal for exploring higher-level prosodic phenomena • A Tramp Abroad , available from Librivox, processed according to • [Braunschweiler et al (2010)] • [Braunschweiler and Buchholz (2011)] • Training, development, and test sets consisting of 4500, 300, 100 utterances, respectively. 14 / 36

  15. Systems • 3 network architectures, • 2 sets of linguistic features • 6 systems trained 1 Baseline - Hand-selected 2 Cascaded - Hand-selected 3 Parallel - Hand-selected 4 Baseline - Standard 5 Cascaded - Standard 6 Parallel - Standard 15 / 36

  16. Hypotheses Addition of noisy suprasegmental features • Adding more (suprasegmental) features to a frame-level model will degrade its performance Hierarchical Systems • Hierarchical systems will outperform non-hierarchical systems • Previous work has suggested hierarchical systems are beneficial for speech synthesis Parallel and cascaded networks • Parallel architectures will be preferred over cascaded architectures 16 / 36

  17. Hypotheses Addition of noisy suprasegmental features • Adding more (suprasegmental) features to a frame-level model will degrade its performance Hierarchical Systems • Hierarchical systems will outperform non-hierarchical systems • Previous work has suggested hierarchical systems are beneficial for speech synthesis Parallel and cascaded networks • Parallel architectures will be preferred over cascaded architectures 17 / 36

  18. Hypotheses Addition of noisy suprasegmental features • Adding more (suprasegmental) features to a frame-level model will degrade its performance Hierarchical Systems • Hierarchical systems will outperform non-hierarchical systems • Previous work has suggested hierarchical systems are beneficial for speech synthesis Parallel and cascaded networks • Parallel architectures will be preferred over cascaded architectures 18 / 36

  19. Listening tests • MUSHRA test • MUltiple Stimuli with Hidden Reference and Anchor • Simultaneous comparison of multiple speech samples • Listeners rank each system against all conditions and against a reference • Test setup • 20 native English listeners • Each rate 20 sets of stimuli • Total of 400 parallel comparisons 19 / 36

  20. Results 20 / 36

  21. Results - additional features 21 / 36

  22. Results - additional features 22 / 36

  23. Results - hand-selected features 23 / 36

  24. Results - hand-selected features 24 / 36

  25. Results - hand-selected features 25 / 36

  26. Results - standard features 26 / 36

  27. Results - standard feature set 27 / 36

  28. Results - standard features 28 / 36

  29. Results - parallel networks 29 / 36

  30. Results - parallel networks 30 / 36

  31. Speech Samples speech samples 31 / 36

  32. Summary Main Findings 1 Adding high-dimensional representations of context to frame-level network may be harmful 2 Hierarchical systems (parallel or cascaded) can be useful if using noisy suprasegmental features • This suggests it may be operating as a feature extractor or denoiser 3 Parallel networks outperform cascaded networks in all cases • Consistent with findings of [Yin et al (2016)], although tested under different circumstances 32 / 36

  33. Future work • Explore parallel approach with additional features • Syllable bag-of-phones, text-based word embeddings [Ribeiro et al (2016)] • Can these frameworks leverage new information? • Decoupling of linguistic-levels with parallel approach (similar to [Yin et al (2016)]) • Hierarchical systems with recurrent layers • Alternative acoustic features for suprasegmental network 33 / 36

Recommend


More recommend