dnn based tts systems tts architecture traditional
play

DNN Based TTS Systems TTS Architecture: Traditional Pipeline - PowerPoint PPT Presentation

DNN Based TTS Systems TTS Architecture: Traditional Pipeline Typical statistical parametric TTS commonly contains different modules including: a text frontend extracting various linguistic features a duration model an acoustic


  1. DNN Based TTS Systems

  2. TTS Architecture: Traditional Pipeline • Typical statistical parametric TTS commonly contains different modules including: • a text frontend extracting various linguistic features • a duration model • an acoustic feature prediction model • and a complex signal-processing-based vocoder

  3. End-to-End Approach • There are many advantages of an integrated end-to-end TTS system that can be trained on<text, audio>pairs: • First, such a system alleviates the need for laborious feature engineering, which may involve heuristics and brittle design choices. • Second, it more easily allows for rich conditioning on various attributes, such as speaker or language, or high-level features like sentiment. • Similarly, adaptation to new data might also be easier. • Finally, a single model is likely to be more robust than a multi-stage model where each component ’ s errors can compound.

  4. Challenges • TTS is a large-scale inverse problem: a highly compressed source (text) is “ decompressed ” into audio. Since the same text can correspond to different pronunciations or speaking styles, this is a particularly difficult learning task for an end-to-end model: it must cope with large variations at the signal level for a given input. • Unlike end-to-end speech recognition or machine translation TTS outputs are continuous, and output sequences are usually much longer than those of the input. These attributes cause prediction errors to accumulate quickly.

  5. Tacotron • An end-to-end generative TTS model based on the sequence-to- sequence (seq2seq) with attention paradigm. • Tacotron takes characters as input and outputs raw spectrogram. • It does not require phoneme-level alignment, so it can easily scale to using large amounts of acoustic data with transcripts. • With a simple waveform synthesis technique, Tacotron produces a 3.82 mean opinion score (MOS) on an US English eval set, outperforming a production parametric system in terms of naturalness.

  6. Tacotron II • Entirely neural • Uses WaveNet as vocoder • Achieves a MOS of 4.53 comparable to a MOS of 4.58for professionally recorded speech.

  7. Tacotron II Architecture

  8. Samples Sentence Tacotron II Tacotron I Generative adversarial network or variational auto-encoder. He has read the whole thing. He reads books. Thisss isrealy awhsome. This is your personal assistant, Google Home. This is your personal assistant Google Home. The buses aren't the problem, they actually provide a solution. The buses aren't the PROBLEM, they actually provide a SOLUTION. The quick brown fox jumps over the lazy dog. Does the quick brown fox jump over the lazy dog?

  9. End-to-End Tacotron II Samples (Persian) متسه ،شابداش دماح ،نم ،ملبس. اب قرش زا ،ناتسنمکرت و ناتسنمرا ،ناجيابرذآ يروهمج اب لامش زا ناريا زا نينچمه و تسا هياسمه قارع و هيکرت اب برغ زا و ناتسکاپ و ناتسناغفا جيلخ هب بونج زا و رزخ يايرد هب لامش يمسر ملبعا کي طقف ،رازس ندمآ ات و هدش لح تلبکشم يمامت ًارهاظ تسا هدنام يقاب .

  10. Vocoders • WaveNet • Parallel WaveNet • WaveGlow • MelGAN • WaveRNN • LPCNet • Etc.

  11. WaveNet • Fully convolutional autoregressive • Fast at training but slow at inference time

  12. WaveGlow • WaveGlow combines insights from Glow and WaveNet. • Produces audio samples at a rate of more than 500 kHz on an NVIDIA V100 GPU.

  13. Tacotron 2 + WaveGlow Samples

  14. MelGAN • Non-autoregressive feed-forward convolutional architecture to perform audio waveform generation in a GAN setup. • MelGAN is substantially faster than other mel-spectrogram inversion alternatives. In particular, it is 10 times faster than the fastest available model to date without considerable degradation in audio quality.

  15. MelGAN Generator

  16. MelGAN Descriminator

  17. Losses

  18. Tacotron 2 + MelGAN Samples

  19. Questions?

Recommend


More recommend