DNN Based TTS Systems TTS Architecture: Traditional Pipeline - PowerPoint PPT Presentation

DNN Based TTS Systems

TTS Architecture: Traditional Pipeline • Typical statistical parametric TTS commonly contains different modules including: • a text frontend extracting various linguistic features • a duration model • an acoustic feature prediction model • and a complex signal-processing-based vocoder

End-to-End Approach • There are many advantages of an integrated end-to-end TTS system that can be trained on<text, audio>pairs: • First, such a system alleviates the need for laborious feature engineering, which may involve heuristics and brittle design choices. • Second, it more easily allows for rich conditioning on various attributes, such as speaker or language, or high-level features like sentiment. • Similarly, adaptation to new data might also be easier. • Finally, a single model is likely to be more robust than a multi-stage model where each component ’ s errors can compound.

Challenges • TTS is a large-scale inverse problem: a highly compressed source (text) is “ decompressed ” into audio. Since the same text can correspond to different pronunciations or speaking styles, this is a particularly difficult learning task for an end-to-end model: it must cope with large variations at the signal level for a given input. • Unlike end-to-end speech recognition or machine translation TTS outputs are continuous, and output sequences are usually much longer than those of the input. These attributes cause prediction errors to accumulate quickly.

Tacotron • An end-to-end generative TTS model based on the sequence-to- sequence (seq2seq) with attention paradigm. • Tacotron takes characters as input and outputs raw spectrogram. • It does not require phoneme-level alignment, so it can easily scale to using large amounts of acoustic data with transcripts. • With a simple waveform synthesis technique, Tacotron produces a 3.82 mean opinion score (MOS) on an US English eval set, outperforming a production parametric system in terms of naturalness.

Tacotron II • Entirely neural • Uses WaveNet as vocoder • Achieves a MOS of 4.53 comparable to a MOS of 4.58for professionally recorded speech.

Tacotron II Architecture

Samples Sentence Tacotron II Tacotron I Generative adversarial network or variational auto-encoder. He has read the whole thing. He reads books. Thisss isrealy awhsome. This is your personal assistant, Google Home. This is your personal assistant Google Home. The buses aren't the problem, they actually provide a solution. The buses aren't the PROBLEM, they actually provide a SOLUTION. The quick brown fox jumps over the lazy dog. Does the quick brown fox jump over the lazy dog?

End-to-End Tacotron II Samples (Persian) متسه ،شابداش دماح ،نم ،ملبس. اب قرش زا ،ناتسنمکرت و ناتسنمرا ،ناجيابرذآ يروهمج اب لامش زا ناريا زا نينچمه و تسا هياسمه قارع و هيکرت اب برغ زا و ناتسکاپ و ناتسناغفا جيلخ هب بونج زا و رزخ يايرد هب لامش يمسر ملبعا کي طقف ،رازس ندمآ ات و هدش لح تلبکشم يمامت ًارهاظ تسا هدنام يقاب .

Vocoders • WaveNet • Parallel WaveNet • WaveGlow • MelGAN • WaveRNN • LPCNet • Etc.

WaveNet • Fully convolutional autoregressive • Fast at training but slow at inference time

WaveGlow • WaveGlow combines insights from Glow and WaveNet. • Produces audio samples at a rate of more than 500 kHz on an NVIDIA V100 GPU.

Tacotron 2 + WaveGlow Samples

MelGAN • Non-autoregressive feed-forward convolutional architecture to perform audio waveform generation in a GAN setup. • MelGAN is substantially faster than other mel-spectrogram inversion alternatives. In particular, it is 10 times faster than the fastest available model to date without considerable degradation in audio quality.

MelGAN Generator

MelGAN Descriminator

Losses

Tacotron 2 + MelGAN Samples

Questions?

DNN Based TTS Systems TTS Architecture: Traditional Pipeline - PowerPoint PPT Presentation

DNN Based TTS Systems TTS Architecture: Traditional Pipeline Typical statistical parametric TTS commonly contains different modules including: a text frontend extracting various linguistic features a duration model an acoustic

DNN-based Branch-and-bound for the Quadratic Assignment Problem *Koichi Fujii, Naoki Ito, Yuji

The Dark Side of DNN Pruning Reza Yazdani Marc Riera Jose-Maria Arnau Antonio Gonzlez

General Presentation Kormarine/Glovis Conference Oct 2017 TTS Services Vision and Mission TTS

getting active after SCI Traditional Email Interaction: Traditional Email Interaction:

Power-Driven DNN Dataflow Optimization on FPGA Qi Sun 1 , Tinghuan Chen 1 , Jin Miao 2 , Bei Yu 1 1

Multilingual and low-resource ASR Lecture 18 CS 753 Instructor: Preethi Jyothi Recall Hybrid

Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training Hongyu Zhu 1,2 ,

Outlier Channel Splitting Improving DNN Quantization without Retraining Ritchie Zhao , Yuwei Hu,

CNVLUTIN: Ineffectual-neuron-free DNN computing J. Albericio , P. Judd, T. Hetherington*, T.

Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity 2020/11

Contents Introduction Pipelined FPGA DNN accelerators Roof-line Model and optimizing

Nothing is Traditional about Nothing is Traditional about Environments in a Traditional

Framework for Temporal Tunnel Services (TTS) draft-chen-teas-frmwk-tts-00 Huaimo Chen

From Traditional Neural From Traditional NN . . . Networks to Deep Learning Need to Go Beyond .

An Agent Architecture An Agent Architecture An Agent Architecture An Agent Architecture for

Architecture: Culture and Space Architecture: Culture and Space Architecture: Culture and Space

Semi-Supervised Adversarial Audio Source Separation applied to Singing Voice Extraction Daniel

CSE 562: Mobile Systems & Applications Quals Course Systems Area Shyam Gollakota First

Telefonica Research @ Trecvid 2011 Xavier Anguera, Daru Xu 1

Acoustic Fingerprinting Soundz Jake Runzer June 28, 2018 Jake Runzer Acoustic Fingerprinting

Two-photon laser spectroscopy of antiprotonic helium and the antiproton-electron mass ratio

Croissance et proprits magntiques de rseaux planaires auto-organiss de nanofils de Fer

Status of direct neutrino mass measurements Florian Frnkle, Institute for Nuclear Physics

(On behalf of SIDDHARTA and AMADEUS collaborations) LNF INFN, Frascati Hadrons in Nuclei, YITP,

DNN Based TTS Systems TTS Architecture: Traditional Pipeline - PowerPoint PPT Presentation

DNN Based TTS Systems TTS Architecture: Traditional Pipeline Typical statistical parametric TTS commonly contains different modules including: a text frontend extracting various linguistic features a duration model an acoustic

DNN-based Branch-and-bound for the Quadratic Assignment Problem *Koichi Fujii, Naoki Ito, Yuji

The Dark Side of DNN Pruning Reza Yazdani Marc Riera Jose-Maria Arnau Antonio Gonzlez

General Presentation Kormarine/Glovis Conference Oct 2017 TTS Services Vision and Mission TTS

getting active after SCI Traditional Email Interaction: Traditional Email Interaction:

Power-Driven DNN Dataflow Optimization on FPGA Qi Sun 1 , Tinghuan Chen 1 , Jin Miao 2 , Bei Yu 1 1

Multilingual and low-resource ASR Lecture 18 CS 753 Instructor: Preethi Jyothi Recall Hybrid

Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training Hongyu Zhu 1,2 ,

Outlier Channel Splitting Improving DNN Quantization without Retraining Ritchie Zhao , Yuwei Hu,

CNVLUTIN: Ineffectual-neuron-free DNN computing J. Albericio , P. Judd, T. Hetherington*, T.

Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity 2020/11

Contents Introduction Pipelined FPGA DNN accelerators Roof-line Model and optimizing

Nothing is Traditional about Nothing is Traditional about Environments in a Traditional

Framework for Temporal Tunnel Services (TTS) draft-chen-teas-frmwk-tts-00 Huaimo Chen

From Traditional Neural From Traditional NN . . . Networks to Deep Learning Need to Go Beyond .

An Agent Architecture An Agent Architecture An Agent Architecture An Agent Architecture for

Architecture: Culture and Space Architecture: Culture and Space Architecture: Culture and Space

Semi-Supervised Adversarial Audio Source Separation applied to Singing Voice Extraction Daniel

CSE 562: Mobile Systems &amp; Applications Quals Course Systems Area Shyam Gollakota First

Telefonica Research @ Trecvid 2011 Xavier Anguera, Daru Xu 1

Acoustic Fingerprinting Soundz Jake Runzer June 28, 2018 Jake Runzer Acoustic Fingerprinting

Two-photon laser spectroscopy of antiprotonic helium and the antiproton-electron mass ratio

Croissance et proprits magntiques de rseaux planaires auto-organiss de nanofils de Fer

Status of direct neutrino mass measurements Florian Frnkle, Institute for Nuclear Physics

(On behalf of SIDDHARTA and AMADEUS collaborations) LNF INFN, Frascati Hadrons in Nuclei, YITP,

CSE 562: Mobile Systems & Applications Quals Course Systems Area Shyam Gollakota First