Parallel and cascaded deep neural networks for text-to-speech - PowerPoint PPT Presentation

Parallel and cascaded deep neural networks for text-to-speech synthesis M. Sam Ribeiro, Oliver Watts, Junichi Yamagishi School Of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk 14 September 2016 Speech Synthesis Workshop 9 - Sunnyvale, United States 1 / 36

Introduction • Speech synthesis and Prosody • Synthetic speech may sound bland and monotonous • A good understanding and modelling of prosody is essential for natural speech synthesis. • Prosody is inherently suprasegmental • Suprasegmental features are mostly associated with long-term variation. • Current features are very shallow (positional and POS/stress related) • Most systems operate at frame/state levels and rely heavily on segmental features. Ideally we would have a framework that has good representations of contexts, but also the ability to exploit them. 2 / 36

Introduction • Speech synthesis and Prosody • Synthetic speech may sound bland and monotonous • A good understanding and modelling of prosody is essential for natural speech synthesis. • Prosody is inherently suprasegmental • Suprasegmental features are mostly associated with long-term variation. • Current features are very shallow (positional and POS/stress related) • Most systems operate at frame/state levels and rely heavily on segmental features. Ideally we would have a framework that has good representations of contexts, but also the ability to exploit them. 3 / 36

Earlier work • Hierarchical models • Cascaded and parallel deep neural networks • Superpositional model of f0 [Yin et al (2016)] • Systems with hierarchical recurrences [Chen et al (1998)] • Continuous representations of linguistic contexts • Segmental-level [Lu et al (2013)] [Wu et al (2015)] • Word-level [Watts et al (2014)] [Wang et al (2015)] • Sentence-level [Watts et al (2015)] Recent work Ribeiro et al (2016) Syllable-level representations of suprasegmental features for DNN-based text-to-speech synthesis. Proceedings of Interspeech 2016 4 / 36

Earlier work • Hierarchical models • Cascaded and parallel deep neural networks • Superpositional model of f0 [Yin et al (2016)] • Systems with hierarchical recurrences [Chen et al (1998)] • Continuous representations of linguistic contexts • Segmental-level [Lu et al (2013)] [Wu et al (2015)] • Word-level [Watts et al (2014)] [Wang et al (2015)] • Sentence-level [Watts et al (2015)] Recent work Ribeiro et al (2016) Syllable-level representations of suprasegmental features for DNN-based text-to-speech synthesis. Proceedings of Interspeech 2016 5 / 36

Ribeiro et al (2016) Contributions 1 A top-down hierarchical model at syllable-level (cascaded) 2 An investigation of its usefulness with additional features at syllable and word-level Main Findings 1 Hierarchical approach performs best when segmental and suprasegmental features are balanced. 2 Syllable-bag of phones give minor improvements on objective scores 3 Text-based word embeddings have little effect 4 No significant results in terms of subjective evaluation, but clear differences in terms of predicted f0 contours. 6 / 36

Ribeiro et al (2016) Contributions 1 A top-down hierarchical model at syllable-level (cascaded) 2 An investigation of its usefulness with additional features at syllable and word-level Main Findings 1 Hierarchical approach performs best when segmental and suprasegmental features are balanced. 2 Syllable-bag of phones give minor improvements on objective scores 3 Text-based word embeddings have little effect 4 No significant results in terms of subjective evaluation, but clear differences in terms of predicted f0 contours. 7 / 36

Ribeiro et al (2016) • Most improvements derive from the hierarchical framework • This suggests it is working mostly as a feature extractor or denoiser Parallel and cascaded deep neural networks for text-to-speech synthesis Ribeiro, M. S., Watts, O. & Junichi, Y. (2016) Parallel and cascaded deep neural networks for text-to-speech synthesis. In Proc. of SSW, Sunnyvale, 2016. 8 / 36

Baseline Network frame-level acoustic parameters • Feedforward deep neural network ... • 6-hidden layers, each with 1024 nodes ... • Output features • 60-dimensional MCCs, 25 band aperiodicities, 1 ... log-f0, 1 voicing decision (plus dynamic features) input features 9 / 36

Hierarchical Networks • Input features • Segmental : phone-level and below • Suprasegmental: syllable-level and above • Output features • Frame-level acoustic parameters averaged over the entire syllable • Architecture • 6-hidden layer triangular networks • Top hidden layer used as bottleneck layer • Integration strategies • Cascaded strategy • Parallel strategy 10 / 36

Cascaded Network syllable-level acoustic parameters frame-level acoustic parameters ... ... ... ... ... ... segmental features hidden representation suprasegmental features 11 / 36

Parallel Network frame-level acoustic parameters ... frame-level acoustic parameters syllable-level acoustic parameters ... ... ... ... ... ... segmental features suprasegmental features 12 / 36

Linguistic Features • Segmental-Features • Constant for all systems • Phone and state-level features (352 dimensions) • Suprasegmental - Full Set • Standard set of features used for HMM-based speech synthesis • Derived from a common Front-End - Festival • Syllable, word, phrase, utterance (roughly 1100 dimensions) • Suprasegmental - Pruned Set • Hand-selected set of features for DNN-based speech synthesis • Higher-level context was removed • Syllable, word (244 dimensions) 13 / 36

Database • Expressive audiobook data • Ideal for exploring higher-level prosodic phenomena • A Tramp Abroad , available from Librivox, processed according to • [Braunschweiler et al (2010)] • [Braunschweiler and Buchholz (2011)] • Training, development, and test sets consisting of 4500, 300, 100 utterances, respectively. 14 / 36

Systems • 3 network architectures, • 2 sets of linguistic features • 6 systems trained 1 Baseline - Hand-selected 2 Cascaded - Hand-selected 3 Parallel - Hand-selected 4 Baseline - Standard 5 Cascaded - Standard 6 Parallel - Standard 15 / 36

Hypotheses Addition of noisy suprasegmental features • Adding more (suprasegmental) features to a frame-level model will degrade its performance Hierarchical Systems • Hierarchical systems will outperform non-hierarchical systems • Previous work has suggested hierarchical systems are beneficial for speech synthesis Parallel and cascaded networks • Parallel architectures will be preferred over cascaded architectures 16 / 36

Listening tests • MUSHRA test • MUltiple Stimuli with Hidden Reference and Anchor • Simultaneous comparison of multiple speech samples • Listeners rank each system against all conditions and against a reference • Test setup • 20 native English listeners • Each rate 20 sets of stimuli • Total of 400 parallel comparisons 19 / 36

Results 20 / 36

Results - additional features 21 / 36

Results - additional features 22 / 36

Results - hand-selected features 23 / 36

Results - standard features 26 / 36

Results - standard feature set 27 / 36

Results - standard features 28 / 36

Results - parallel networks 29 / 36

Results - parallel networks 30 / 36

Speech Samples speech samples 31 / 36

Summary Main Findings 1 Adding high-dimensional representations of context to frame-level network may be harmful 2 Hierarchical systems (parallel or cascaded) can be useful if using noisy suprasegmental features • This suggests it may be operating as a feature extractor or denoiser 3 Parallel networks outperform cascaded networks in all cases • Consistent with findings of [Yin et al (2016)], although tested under different circumstances 32 / 36

Future work • Explore parallel approach with additional features • Syllable bag-of-phones, text-based word embeddings [Ribeiro et al (2016)] • Can these frameworks leverage new information? • Decoupling of linguistic-levels with parallel approach (similar to [Yin et al (2016)]) • Hierarchical systems with recurrent layers • Alternative acoustic features for suprasegmental network 33 / 36

Parallel and cascaded deep neural networks for text-to-speech - PowerPoint PPT Presentation

Parallel and cascaded deep neural networks for text-to-speech synthesis M. Sam Ribeiro, Oliver Watts, Junichi Yamagishi School Of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk 14 September 2016 Speech Synthesis Workshop 9

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Learning Cascaded Influence under Partial Monitoring Jiaqi Ma 1 Jie Zhang 2 Jie Tang 3 1 Dept. of

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Optimizing Deep Neural Networks Leena Chennuru Vankadara 26-10-2015 Table of Contents Neural

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

On the Expressive Power of Deep Neural Networks Maithra Raghu, Ben Poole, Jon Kleinberg, Surya

Weight Parameterizations in Deep Neural Networks Sergey Zagoruyko e Paris-Est, Universit

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

Explaining Trends in Tidal Waters Synthesis Process and Lessons Learned Experienced

A Framework for Automated Test Mocking of Mobile Apps Mattia Fazzini Alessandra Gorla

Computer-Aided Synthesis Design Reaction Prediction Synthetic Feasibility Dr. Wendy A. Warr

High-Level Synthesis Creating Custom Circuits from High-Level Code Hao Zheng Comp Sci & Eng

Texture S ynthesis Daniel Cohen-Or + = + = = The Goal of Texture Synthesis input image

Synthesizing from Components: Building from Blocks Ashish Tiwari SRI International 333

Sound analysis and synthesis with the package seewave J er ome Sueur , Thierry Aubin,

Safety Controller Synthesis for Switched Systems using Multiscale Symbolic Models Antoine Girard

Parallel and cascaded deep neural networks for text-to-speech - PowerPoint PPT Presentation

Parallel and cascaded deep neural networks for text-to-speech synthesis M. Sam Ribeiro, Oliver Watts, Junichi Yamagishi School Of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk 14 September 2016 Speech Synthesis Workshop 9

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Learning Cascaded Influence under Partial Monitoring Jiaqi Ma 1 Jie Zhang 2 Jie Tang 3 1 Dept. of

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Optimizing Deep Neural Networks Leena Chennuru Vankadara 26-10-2015 Table of Contents Neural

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

On the Expressive Power of Deep Neural Networks Maithra Raghu, Ben Poole, Jon Kleinberg, Surya

Weight Parameterizations in Deep Neural Networks Sergey Zagoruyko e Paris-Est, Universit

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

Explaining Trends in Tidal Waters Synthesis Process and Lessons Learned Experienced

A Framework for Automated Test Mocking of Mobile Apps Mattia Fazzini Alessandra Gorla

Computer-Aided Synthesis Design Reaction Prediction Synthetic Feasibility Dr. Wendy A. Warr

High-Level Synthesis Creating Custom Circuits from High-Level Code Hao Zheng Comp Sci &amp; Eng

Texture S ynthesis Daniel Cohen-Or + = + = = The Goal of Texture Synthesis input image

Synthesizing from Components: Building from Blocks Ashish Tiwari SRI International 333

Sound analysis and synthesis with the package seewave J er ome Sueur , Thierry Aubin,

Safety Controller Synthesis for Switched Systems using Multiscale Symbolic Models Antoine Girard

High-Level Synthesis Creating Custom Circuits from High-Level Code Hao Zheng Comp Sci & Eng