where do the improvements come from in sequence to
play

Where do the improvements come from in sequence-to-sequence neural - PowerPoint PPT Presentation

Where do the improvements come from in sequence-to-sequence neural TTS? Oliver Watts Gustav Eje Henter Jason Fong Cassia Valentini-Botinhao Input feature Text TEXT extraction analysis com- Input layer Hidden layers Output layer


  1. Where do the improvements come from in sequence-to-sequence neural TTS? Oliver Watts ◆ Gustav Eje Henter ◆ Jason Fong ◆ Cassia Valentini-Botinhao

  2. Input feature Text TEXT extraction analysis com- Input layer Hidden layers Output layer x 1 h 1 h 1 h 1 - Statistics (mean & var) of speech parameter vector sequence Input features including 1 11 21 31 Griffin-Lim reconstruction features at frame 1 y 1 binary & numeric 1 x 1 h 1 h 1 h 1 Linear-scale 2 12 22 32 spectrogram y 1 2 x 1 h 1 h 1 h 1 CBHG 3 13 23 33 y 1 3 neu- x 1 h 1 h 1 h 1 Seq2seq target 4 14 24 34 with r=3 de- CBHG ... ... ... ... ... Decoder Decoder Decoder RNN RNN RNN x T h T h T h T Input features including 1 11 21 31 functions features at frame T y T Attention binary & numeric Attention Attention Attention mul- 1 RNN RNN RNN Attention is applied x T h T h T h T trees 2 12 22 32 Pre-net to all decoder steps Pre-net Pre-net Pre-net y T 2 x T h T h T h T 3 13 23 33 y T 3 Character embeddings <GO> frame x T h T h T h T 4 14 24 34 that Waveform Parameter SPEECH ]. synthesis generation T ACOTRON : T OWARDS E ND - TO -E ND S PEECH S YN - STATISTICAL PARAMETRIC SPEECH SYNTHESIS USING DEEP NEURAL NETWORKS THESIS Heiga Zen, Andrew Senior, Mike Schuster Google Yuxuan Wang ∗ , RJ Skerry-Ryan ∗ , Daisy Stanton, Yonghui Wu, Ron J. Weiss † , Navdeep Jaitly, f heigazen,andrewsenior,schuster g @google.com Zongheng Yang, Ying Xiao ∗ , Zhifeng Chen, Samy Bengio † , Quoc Le, Yannis Agiomyrgiannakis, ABSTRACT HMM through a binary decision tree, where one context-related bi- nary question is associated with each non-terminal node. The num- Conventional approaches to statistical parametric speech synthe- ber of clusters, namely the number of terminal nodes, determines sis typically use decision tree-clustered context-dependent hidden Rob Clark, Rif A. Saurous ∗ the model complexity. The decision tree is constructed by sequen- Markov models (HMMs) to represent probability densities of speech tially selecting the questions which yield the largest log likelihood parameters given texts. Speech parameters are generated from the gain of the training data. The size of the tree is controlled using a probability densities to maximize their output probabilities, then a Google, Inc. arXiv:1703.10135v2 [cs.CL] 6 Apr 2017 pre-determined threshold of log likelihood gain, a model complexity speech waveform is reconstructed from the generated parameters. { yxwang,rjryan,rif } @google.com penalty [14,15], or cross validation [16,17]. With the use of context- This approach is reasonably effective but has a couple of limita- related questions and state parameter sharing, the unseen contexts tions, e.g. decision trees are inefficient to model complex context and data sparsity problems are effectively addressed. As the method dependencies. This paper examines an alternative scheme that is has been successfully used in speech recognition, HMM-based sta- based on a deep neural network (DNN). The relationship between tistical parametric speech synthesis naturally employs a similar ap- input texts and their acoustic realizations is modeled by a DNN. The A BSTRACT proach to model very rich contexts. use of the DNN can address some limitations of the conventional approach. Experimental results show that the DNN-based systems Although the decision tree-clustered context-dependent HMMs outperformed the HMM-based systems with similar numbers of work reasonably effectively in statistical parametric speech synthe- A text-to-speech synthesis system typically consists of multiple stages, such as a sis, there are some limitations. First, it is inefficient to express com- parameters. text analysis frontend, an acoustic model and an audio synthesis module. Build- plex context dependencies such as XOR , parity or multiplex prob- Index Terms — Statistical parametric speech synthesis; Hidden lems by decision trees [18]. To represent such cases, decision trees ing these components often requires extensive domain expertise and may contain 2013: ‘Old paradigm’ 2017: ‘New paradigm’ Markov model; Deep neural network; will be prohibitively large. Second, this approach divides the input brittle design choices. In this paper, we present Tacotron, an end-to-end genera- space and use separate parameters for each region, with each region 1. INTRODUCTION tive text-to-speech model that synthesizes speech directly from characters. Given associated with a terminal node of the decision tree. This results in fragmenting the training data and reducing the amount of the data < text, audio > pairs, the model can be trained completely from scratch with ran- Statistical parametric speech synthesis based on hidden Markov that can be used in clustering the other contexts and estimating the dom initialization. We present several key techniques to make the sequence-to- models (HMMs) [1] has grown in popularity in the last decade. This distributions [19]. Having a prohibitively large tree and fragmenting approach has various advantages over the concatenative speech syn- sequence framework perform well for this challenging task. Tacotron achieves a training data will both lead to overfitting and degrade the quality of thesis approach [2], such as the flexibility to change its voice charac- the synthesized speech. 3.82 subjective 5-scale mean opinion score on US English, outperforming a pro- teristics, [3–6], small footprint [7–9], and robustness [10]. However To address these limitations, this paper examines an alternative duction parametric system in terms of naturalness. In addition, since Tacotron its major limitation is the quality of the synthesized speech. Zen scheme that is based on a deep architecture [20]. The decision trees

  3. Old paradigm New paradigm

  4. Merlin DCTTS Wu et al. 2016 Tachibana et al. 2018 github.com/CSTR-Edinburgh/merlin github.com/Kyubyong/dc_tts Old paradigm New paradigm

  5. 100 90 Normalised subjective rating 80 Naturalness rating 70 60 50 40 30 20 10 0 M 2 G1 G1H G1TH G1HAG1THA Merlin DCTTS Wu et al. 2016 Tachibana et al. 2018 github.com/CSTR-Edinburgh/merlin github.com/Kyubyong/dc_tts Old paradigm New paradigm

  6. 100 
 s t 90 n e m e v o Normalised subjective rating r p m 80 i e ? h m t o o r d Naturalness rating f e e r m e o h 70 c W 60 50 40 30 20 10 0 M 2 G1 G1H G1TH G1HAG1THA Merlin DCTTS Wu et al. 2016 Tachibana et al. 2018 github.com/CSTR-Edinburgh/merlin github.com/Kyubyong/dc_tts Old paradigm New paradigm

  7. Old paradigm

  8. Front end Old paradigm

  9. Duration model/forced alignment Front end Old paradigm

  10. Acoustic model Duration model/forced alignment Front end Old paradigm

  11. Acoustic model Duration model/forced alignment Front end Old paradigm

  12. Acoustic model Duration model/forced alignment Front end Old paradigm

  13. Acoustic model Duration model/forced alignment Front end Old paradigm

  14. Acoustic model Duration model/forced alignment Front end Old paradigm New paradigm

  15. Acoustic model Duration model/forced alignment Front end Text encoder Old paradigm New paradigm

  16. Acoustic model attention Duration model/forced alignment Front end Text encoder Old paradigm New paradigm

  17. Acoustic model Audio encoder attention Duration model/forced alignment Front end Text encoder Old paradigm New paradigm

  18. Audio Acoustic decoder model Audio encoder attention Duration model/forced alignment Front end Text encoder Old paradigm New paradigm

  19. Audio Acoustic decoder model Audio encoder attention Duration model/forced alignment Front end Text encoder Old paradigm New paradigm

  20. Audio Acoustic decoder model Audio encoder attention Duration model/forced alignment T: text encoder vs. front Text encoder vs. front Front end Text encoder end end

  21. Audio Acoustic decoder model Audio encoder Attention vs. precomputed attention Duration alignment model/forced alignment T: text encoder vs. front Text encoder vs. front Front end Text encoder end end

  22. Autoregression vs. frame independence Audio Acoustic decoder model Audio encoder Attention vs. precomputed attention Duration alignment model/forced alignment T: text encoder vs. front Text encoder vs. front Front end Text encoder end end

  23. Autoregression vs. frame independence Audio Acoustic decoder model Audio encoder Attention vs. precomputed attention Duration alignment model/forced alignment Vocoder: World vs. Griffin-Lim T: text encoder vs. front Text encoder vs. front Front end Text encoder end end Loss functions

  24. Autoregression vs. frame independence Audio Acoustic decoder model Audio encoder Attention vs. precomputed attention Duration alignment model/forced alignment T: text encoder vs. front Text encoder vs. front Front end Text encoder end end

  25. Audio Acoustic decoder model Audio encoder attention Duration model/forced alignment Front end Text encoder

  26. Audio Acoustic decoder model Audio encoder Duration attention model/forced alignment Front end Text encoder

  27. Audio decoder Audio encoder Duration attention model/forced alignment Front end Text encoder

  28. Audio decoder Audio encoder Duration attention model/forced alignment Front end Text encoder

Recommend


More recommend