Where do the improvements come from in sequence-to-sequence neural - PowerPoint PPT Presentation

Where do the improvements come from in sequence-to-sequence neural TTS? Oliver Watts ◆ Gustav Eje Henter ◆ Jason Fong ◆ Cassia Valentini-Botinhao

Input feature Text TEXT extraction analysis com- Input layer Hidden layers Output layer x 1 h 1 h 1 h 1 - Statistics (mean & var) of speech parameter vector sequence Input features including 1 11 21 31 Griffin-Lim reconstruction features at frame 1 y 1 binary & numeric 1 x 1 h 1 h 1 h 1 Linear-scale 2 12 22 32 spectrogram y 1 2 x 1 h 1 h 1 h 1 CBHG 3 13 23 33 y 1 3 neu- x 1 h 1 h 1 h 1 Seq2seq target 4 14 24 34 with r=3 de- CBHG ... ... ... ... ... Decoder Decoder Decoder RNN RNN RNN x T h T h T h T Input features including 1 11 21 31 functions features at frame T y T Attention binary & numeric Attention Attention Attention mul- 1 RNN RNN RNN Attention is applied x T h T h T h T trees 2 12 22 32 Pre-net to all decoder steps Pre-net Pre-net Pre-net y T 2 x T h T h T h T 3 13 23 33 y T 3 Character embeddings <GO> frame x T h T h T h T 4 14 24 34 that Waveform Parameter SPEECH ]. synthesis generation T ACOTRON : T OWARDS E ND - TO -E ND S PEECH S YN - STATISTICAL PARAMETRIC SPEECH SYNTHESIS USING DEEP NEURAL NETWORKS THESIS Heiga Zen, Andrew Senior, Mike Schuster Google Yuxuan Wang ∗ , RJ Skerry-Ryan ∗ , Daisy Stanton, Yonghui Wu, Ron J. Weiss † , Navdeep Jaitly, f heigazen,andrewsenior,schuster g @google.com Zongheng Yang, Ying Xiao ∗ , Zhifeng Chen, Samy Bengio † , Quoc Le, Yannis Agiomyrgiannakis, ABSTRACT HMM through a binary decision tree, where one context-related binary question is associated with each non-terminal node. The num- Conventional approaches to statistical parametric speech synthe- ber of clusters, namely the number of terminal nodes, determines sis typically use decision tree-clustered context-dependent hidden Rob Clark, Rif A. Saurous ∗ the model complexity. The decision tree is constructed by sequen- Markov models (HMMs) to represent probability densities of speech tially selecting the questions which yield the largest log likelihood parameters given texts. Speech parameters are generated from the gain of the training data. The size of the tree is controlled using a probability densities to maximize their output probabilities, then a Google, Inc. arXiv:1703.10135v2 [cs.CL] 6 Apr 2017 pre-determined threshold of log likelihood gain, a model complexity speech waveform is reconstructed from the generated parameters. { yxwang,rjryan,rif } @google.com penalty [14,15], or cross validation [16,17]. With the use of context- This approach is reasonably effective but has a couple of limita- related questions and state parameter sharing, the unseen contexts tions, e.g. decision trees are inefficient to model complex context and data sparsity problems are effectively addressed. As the method dependencies. This paper examines an alternative scheme that is has been successfully used in speech recognition, HMM-based sta- based on a deep neural network (DNN). The relationship between tistical parametric speech synthesis naturally employs a similar ap- input texts and their acoustic realizations is modeled by a DNN. The A BSTRACT proach to model very rich contexts. use of the DNN can address some limitations of the conventional approach. Experimental results show that the DNN-based systems Although the decision tree-clustered context-dependent HMMs outperformed the HMM-based systems with similar numbers of work reasonably effectively in statistical parametric speech synthe- A text-to-speech synthesis system typically consists of multiple stages, such as a sis, there are some limitations. First, it is inefficient to express com- parameters. text analysis frontend, an acoustic model and an audio synthesis module. Build- plex context dependencies such as XOR , parity or multiplex prob- Index Terms — Statistical parametric speech synthesis; Hidden lems by decision trees [18]. To represent such cases, decision trees ing these components often requires extensive domain expertise and may contain 2013: ‘Old paradigm’ 2017: ‘New paradigm’ Markov model; Deep neural network; will be prohibitively large. Second, this approach divides the input brittle design choices. In this paper, we present Tacotron, an end-to-end genera- space and use separate parameters for each region, with each region 1. INTRODUCTION tive text-to-speech model that synthesizes speech directly from characters. Given associated with a terminal node of the decision tree. This results in fragmenting the training data and reducing the amount of the data < text, audio > pairs, the model can be trained completely from scratch with ran- Statistical parametric speech synthesis based on hidden Markov that can be used in clustering the other contexts and estimating the dom initialization. We present several key techniques to make the sequence-to- models (HMMs) [1] has grown in popularity in the last decade. This distributions [19]. Having a prohibitively large tree and fragmenting approach has various advantages over the concatenative speech syn- sequence framework perform well for this challenging task. Tacotron achieves a training data will both lead to overfitting and degrade the quality of thesis approach [2], such as the flexibility to change its voice charac- the synthesized speech. 3.82 subjective 5-scale mean opinion score on US English, outperforming a pro- teristics, [3–6], small footprint [7–9], and robustness [10]. However To address these limitations, this paper examines an alternative duction parametric system in terms of naturalness. In addition, since Tacotron its major limitation is the quality of the synthesized speech. Zen scheme that is based on a deep architecture [20]. The decision trees

Old paradigm New paradigm

Merlin DCTTS Wu et al. 2016 Tachibana et al. 2018 github.com/CSTR-Edinburgh/merlin github.com/Kyubyong/dc_tts Old paradigm New paradigm

100 90 Normalised subjective rating 80 Naturalness rating 70 60 50 40 30 20 10 0 M 2 G1 G1H G1TH G1HAG1THA Merlin DCTTS Wu et al. 2016 Tachibana et al. 2018 github.com/CSTR-Edinburgh/merlin github.com/Kyubyong/dc_tts Old paradigm New paradigm

100   s t 90 n e m e v o Normalised subjective rating r p m 80 i e ? h m t o o r d Naturalness rating f e e r m e o h 70 c W 60 50 40 30 20 10 0 M 2 G1 G1H G1TH G1HAG1THA Merlin DCTTS Wu et al. 2016 Tachibana et al. 2018 github.com/CSTR-Edinburgh/merlin github.com/Kyubyong/dc_tts Old paradigm New paradigm

Old paradigm

Front end Old paradigm

Duration model/forced alignment Front end Old paradigm

Acoustic model Duration model/forced alignment Front end Old paradigm

Acoustic model Duration model/forced alignment Front end Old paradigm New paradigm

Acoustic model Duration model/forced alignment Front end Text encoder Old paradigm New paradigm

Acoustic model attention Duration model/forced alignment Front end Text encoder Old paradigm New paradigm

Acoustic model Audio encoder attention Duration model/forced alignment Front end Text encoder Old paradigm New paradigm

Audio Acoustic decoder model Audio encoder attention Duration model/forced alignment Front end Text encoder Old paradigm New paradigm

Audio Acoustic decoder model Audio encoder attention Duration model/forced alignment T: text encoder vs. front Text encoder vs. front Front end Text encoder end end

Audio Acoustic decoder model Audio encoder Attention vs. precomputed attention Duration alignment model/forced alignment T: text encoder vs. front Text encoder vs. front Front end Text encoder end end

Autoregression vs. frame independence Audio Acoustic decoder model Audio encoder Attention vs. precomputed attention Duration alignment model/forced alignment T: text encoder vs. front Text encoder vs. front Front end Text encoder end end

Autoregression vs. frame independence Audio Acoustic decoder model Audio encoder Attention vs. precomputed attention Duration alignment model/forced alignment Vocoder: World vs. Griffin-Lim T: text encoder vs. front Text encoder vs. front Front end Text encoder end end Loss functions

Autoregression vs. frame independence Audio Acoustic decoder model Audio encoder Attention vs. precomputed attention Duration alignment model/forced alignment T: text encoder vs. front Text encoder vs. front Front end Text encoder end end

Audio Acoustic decoder model Audio encoder attention Duration model/forced alignment Front end Text encoder

Audio Acoustic decoder model Audio encoder Duration attention model/forced alignment Front end Text encoder

Audio decoder Audio encoder Duration attention model/forced alignment Front end Text encoder

Where do the improvements come from in sequence-to-sequence neural - PowerPoint PPT Presentation

Where do the improvements come from in sequence-to-sequence neural TTS? Oliver Watts Gustav Eje Henter Jason Fong Cassia Valentini-Botinhao Input feature Text TEXT extraction analysis com- Input layer Hidden layers Output layer

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Come, Come Whoever You Are Come, Come, Whoever You Are Though youve broken your vows a

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Asynchronous sequence circuits An asynchronous sequence machine is a sequence circuit without

SEQ 3 : Differentiable Sequence-to-Sequence-to-Sequence Autoencoder for Unsupervised Abstractive

Sequence 7 January 2019 OSU CSE 1 Sequence The Sequence component family allows you to

Sequence-to-Sequence Models Can Directly Translate Foreign Speech Ron J. Weiss, Jan Chorowski ,

Natural Language Processing with Deep Learning Sequence-to-sequence Models with Attention Navid

Impact Analysis of Different Scheduling and Retransmission Techniques on an Underwater Routing

Improving Controller Synthesis from Esterel Cristian Soviani Jia Zeng Stephen A. Edwards

More on PSL some examples, some pitfalls FSM start continue continue idle p1 p2 p3

Distributional averaging of switched DAEs with two modes Stephan Trenn Technomathematics group,

Resonance Control of Cavities Jeremiah Holzbauer PIP-II Machine Advisory Committee 10 April 2017

6.003: Signals and Systems Signals and Systems September 8, 2011 1 6.003: Signals and Systems

On-off Control: Audio Applications Graham C. Goodwin Day 4: Lecture 3 16th September 2004

Covert and Side Channel Attacks and Defenses Mengjia Yan Fall 2020 Based on slides from

Where do the improvements come from in sequence-to-sequence neural - PowerPoint PPT Presentation

Where do the improvements come from in sequence-to-sequence neural TTS? Oliver Watts Gustav Eje Henter Jason Fong Cassia Valentini-Botinhao Input feature Text TEXT extraction analysis com- Input layer Hidden layers Output layer

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Come, Come Whoever You Are Come, Come, Whoever You Are Though youve broken your vows a

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Asynchronous sequence circuits An asynchronous sequence machine is a sequence circuit without

SEQ 3 : Differentiable Sequence-to-Sequence-to-Sequence Autoencoder for Unsupervised Abstractive

Sequence 7 January 2019 OSU CSE 1 Sequence The Sequence component family allows you to

Sequence-to-Sequence Models Can Directly Translate Foreign Speech Ron J. Weiss, Jan Chorowski ,

Natural Language Processing with Deep Learning Sequence-to-sequence Models with Attention Navid

Impact Analysis of Different Scheduling and Retransmission Techniques on an Underwater Routing

Improving Controller Synthesis from Esterel Cristian Soviani Jia Zeng Stephen A. Edwards

More on PSL some examples, some pitfalls FSM start continue continue idle p1 p2 p3

Distributional averaging of switched DAEs with two modes Stephan Trenn Technomathematics group,

Resonance Control of Cavities Jeremiah Holzbauer PIP-II Machine Advisory Committee 10 April 2017

6.003: Signals and Systems Signals and Systems September 8, 2011 1 6.003: Signals and Systems

On-off Control: Audio Applications Graham C. Goodwin Day 4: Lecture 3 16th September 2004

Covert and Side Channel Attacks and Defenses Mengjia Yan Fall 2020 Based on slides from

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or