TACOTRON 2 AND WAVEGLOW WITH TENSOR CORES Rafael Valle, Ryan Prenger - PowerPoint PPT Presentation

TEXT-TO-SPEECH SYNTHESIS USING TACOTRON 2 AND WAVEGLOW WITH TENSOR CORES Rafael Valle, Ryan Prenger and Yang Zhang 1

OUTLINE 1.Text to Speech Synthesis 2.Tacotron 2 3.WaveGlow 4.TTS and TensorCores 2

TEXT TO SPEECH SYNTHESIS (TTS) Global TTS Market Value 1 3.5 3 2.5 2 Human to ? Interaction 1.5 1 0.5 0 USD Billions 2016 2022 Apple Microsoft Nuance Amazon Google Siri Cortana Vocalizer Alexa / Polly TTS 1 https://www.marketsandmarkets.com/PressReleases/text-to-speech.asp 3

APPLICATIONS OF TTS Smart Home Devices Health Care Audio Books Vocaloids Self-Driving Cars Video Games 4

TEXT TO SPEECH SYNTHESIS Text Input Forty percent of a Speech Output per - c for - ty ent of a - 5

SPEECH SYNTHESIS: THE VODER 1939 6

PARAMETRIC SPEECH SYNTHESIS Pneumatic speech synthesizer developed Voder speech synthesizer developed by von Kempelen in 1791. by Homer Dudley in 1939. 7

CONCATENATIVE TTS SYNTHESIS Database First practical application in 1936: British Phone company’s Talking Clock per - c - ent of a for - ty 8

CONCATENATIVE TTS SYNTHESIS • Requires collecting speech units • Requires designing cost heuristics • Requires acoustic processing https://wezs.com/~danguy/monguy/TTS.html 9

PARAMETRIC (DEEP LEARNING) TTS SYNTHESIS Text Input Forty percent of a Deep Learning Audio Output per - c for - ty ent of a - 10

DEEP LEARNING TTS SYNTHESIS Linguistic or Acoustic features Text Input Forty percent of a 2º 1º X Audio Output per - c for - ty ent of a - 11

TEXT TO (MEL) SPECTROGRAM WITH TACOTRON Tacotron Tacotron 2 CBHG: Location sensitive attention, i.e. attend to: Convolution Bank (k=[1, 2, 4, 8…]) Memory (encoder output) Convolution stack (ngram like) Query (decoder output) Highway Location (attention weights) bi-directional GRU Cumulative attention weights (+= ) 13

Implementations https://github.com/NVIDIA/tacotron2/ https://github.com/NVIDIA/OpenSeq2Seq/ Deep Learning Framework and Libraries – PyTorch – TensorFlow – NVIDIA’s Automatic Mixed Precision Training Setup – NVIDIA’s Tesla V100 – Good results in less than a day starting fresh – Good results in a few hours warm-starting 14

TTS DATASET LJS (Linda Johnson: single native speakers, ~24 hours) ● 7 non-fiction books “All of my recordings were done from the sofa in my family room!” ● “All of my recordings were done on a MacBook Pro.” ● ● https://keithito.com/LJ-Speech-Dataset/ ● https://librivox.org/reader/11049 Sometimes raw text, other times ARPAbet 15

MEL TO AUDIO WITH WAVENET Samplin g Rates 44100 Hz 22050 Hz 16000 Hz https://deepmind.com/blog/wavenet-generative-model-raw-audio/ 16

WAVENET IMPLEMENTATION DETAILS Naïve PyTorch -> 20 samples per second Inference PyTorch on Volta -> 200 samples per second nv-wavenet -> 20000 samples per second 17

MEAN OPINION SCORES: TACOTRON AND WAVENET https://arxiv.org/abs/1712.05884 18

WAVENET IS THE BOTTLENECK TacoTron2 DeepVoice 3 Ping, W. Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning. https://arxiv.org/abs/1710.07654 Shen, J. Et al. Natural TTS Synthesis by Conditioning WaveNet on 20 Mel Spectrogram Predictions. https://arxiv.org/abs/1712.05884

WAVENET IS THE BOTTLENECK TacoTron2 DeepVoice 3 Ping, W. Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning. https://arxiv.org/abs/1710.07654 Shen, J. Et al. Natural TTS Synthesis by Conditioning WaveNet on 21 Mel Spectrogram Predictions. https://arxiv.org/abs/1712.05884

AUTO-REGRESSION IS INHERENTLY SERIAL = 𝑄 𝑦 0 𝑄 𝑦 1 𝑦 0 )𝑄 𝑦 2 𝑦 1 , 𝑦 0 … 𝑄 𝑦 0 , 𝑦 1 , 𝑦 2 , … van den Oord, A. WaveNet: A Generative Model for Raw Audio. 22 https://arxiv.org/pdf/1609.03499.pdf

AUTO-REGRESSION IS INHERENTLY SERIAL = 𝑄 𝑦 0 𝑄 𝑦 1 𝑦 0 )𝑄 𝑦 2 𝑦 1 , 𝑦 0 … 𝑄 𝑦 0 , 𝑦 1 , 𝑦 2 , … NV-WaveNet van den Oord, A. WaveNet: A Generative Model for Raw Audio. 23 https://github.com/NVIDIA/nv-wavenet https://arxiv.org/pdf/1609.03499.pdf

TRANSFORMING WHITENOISE TO AUDIO IS PARALLEL Gaussian Noise Mel-Spectrogram 24

AUTO-ENCODER (APPROXIMATING LIKELIHOOD) Loss 1 Gaussian Noise Mel-Spectrogram Loss 2 25

INVERTIBLE NETWORK (EXACT LIKELIHOOD) Loss 1 Gaussian Noise Mel-Spectrogram 26

HOW TO MAKE A NETWORK INVERTIBLE audio samples 27

HOW TO MAKE A NETWORK INVERTIBLE audio samples 28

HOW TO MAKE A NETWORK INVERTIBLE 29

HOW TO MAKE A NETWORK INVERTIBLE (s, b) (s, b) (s, b) (s, b) (s, b) (s, b) Coupling network 30

HOW TO MAKE A NETWORK INVERTIBLE s ● + b s ● + b s ● + b s ● + b s ● + b s ● + b (s, b) (s, b) (s, b) (s, b) (s, b) (s, b) Coupling network 31

HOW TO MAKE A NETWORK INVERTIBLE (s, b) (s, b) (s, b) (s, b) (s, b) (s, b) Coupling network ( - b) / s ( - b) / s ( - b) / s ( ( ( - b) / s - b) / s - b) / s 32

https://github.com/NVIDIA/waveglow 33

DECREASING TEMPERATURE CAN HELP 𝜏 ~ 0.8 Gaussian Noise Mel-Spectrogram 34

PARALLEL SOLUTION WORKS Loss NV-WaveNet: 24-48khz (1.2x – 2.4x realtime) WaveGlow (published): 520 khz (24.5x realtime) 35

PARALLEL SOLUTION WORKS Loss NV-WaveNet: 24-48khz (1.2x – 2.4x realtime) WaveGlow (published): 520 khz (24.5x realtime) WaveGlow (internal smaller): 1,500 khz (70x realtime) 36

RELATED WORK Parallel WaveNet/ClariNet Very similar network/inference Very different training procedure WaveRNN More like optimized auto-regressive Can get some parallelism with subscale trick 37

OUTLINE 1.Text to Speech Synthesis 2.Tacotron 2 3.WaveGlow 4.TTS and Tensor Cores 38

INFERENCE SPEED UP with Tensor Cores – Automatic Mixed Precision 3 samples/s in MHz 1.8x 2 On DGX-1 1 Tesla V100 GPU Batch size: 1 1 0 w/o Tensor Cores w/ Tensor Cores 39

INFERENCE SPEED UP with Tensor Cores – Automatic Mixed Precision 3 125x samples/s in MHz 2 On DGX-1 1 Tesla V100 GPU Batch size: 1 70x 1 1x 0 real-time w/o Tensor w/ Tensor Cores Cores 40

TENSOR CORES SPEED UP MATRIX MULTIPLICATIONS x FP16 x FP16 + FP32 41

w/o Tensor Cores w/ Tensor Cores Inference time 29ms 15ms 2X FASTER INFERENCE WITH TENSOR CORES 42

TRAINING SPEED UP with Tensor Cores – Automatic Mixed Precision 1000 training time in hours 800 On DGX-1 600 1 Tesla V100 GPU over 1000 Epochs 400 1.9x faster 200 0 FP32 Tensor Cores 43

TRAINING WITH TENSOR CORES FP32 Tensor Cores 1 0 -1 -2 Tensor Cores Loss -3 achieve similar training loss -4 -5 -6 -7 -8 0k 10k 20k 30k 40k 50k 60k 70k 80k 90k 100k Iterations 44

USING TENSOR CORES WITH AMP Automatic Mixed Precision library that enables Tensor Cores transparently manages type conversions and master weights automatic loss scaling to prevents gradient underflow Different levels of optimization white/black list allow user to enforce precision Easy code adjustment 45

INFERENCE WITH AMP IS EASY Code Example FP32 46

INFERENCE WITH AMP IS EASY Code Example Tensor Cores with AMP FP32 1x 1.8x 47

TRAINING WITH AMP IS EASY Code Example FP32 48

TRAINING WITH AMP IS EASY Code Example Tensor Cores with AMP 1.9x speed up 49

CONCLUSION Tensor Cores achieve close to 2x faster inference and training on Waveglow AMP enables Tensor Cores transparently for training and inference Code available on NGC and github https://ngc.nvidia.com/catalog/model-scripts/ https://github.com/NVIDIA/tacotron2 https://github.com/NVIDIA/waveglow https://github.com/NVIDIA/apex/tree/master/apex/amp 50

TACOTRON 2 AND WAVEGLOW WITH TENSOR CORES Rafael Valle, Ryan Prenger - PowerPoint PPT Presentation

TEXT-TO-SPEECH SYNTHESIS USING TACOTRON 2 AND WAVEGLOW WITH TENSOR CORES Rafael Valle, Ryan Prenger and Yang Zhang 1 OUTLINE 1.Text to Speech Synthesis 2.Tacotron 2 3.WaveGlow 4.TTS and TensorCores 2 TEXT TO SPEECH SYNTHESIS (TTS) Global

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

PROGRAMMING TENSOR CORES: NATIVE VOLTA TENSOR CORES WITH CUTLASS Andrew Kerr, Timmy Liu, Mostafa

Tacotron: End-to-End TTS Tacotron [Wang 2017]: Neural Vocoder Convert spectrogram to

8. Tensor Field Visualization Tensor: extension of concept of scalar and vector Tensor data

(Some) Challenges in (Some) Challenges in Tensor Mining Tensor Mining Evrim Acar Sandia

Tensor Field Techniques Lecture 11 March 5, 2020 Outline Basics of tensor algebra Tensor

TENSOR ALGEBRA Continuum Mechanics Course (MMC) - ETSECCPB - UPC Introduction to Tensors Tensor

Tensor-Matrix Products with a Compressed Sparse Tensor Shaden Smith George Karypis University

Tensor Field Visualization 9-1 Ronald Peikert SciVis 2007 - Tensor Fields Tensors

Fused and Composable Heterogeneous Cores Roshan Nair and Anirudh Krishna Villivalam Single cores

TENSOR CORE DL PERFORMANCE GUIDE Michael Andersch, Valerie Sarge, Paulius Micikevicius NVIDIA

Tensor Methods for Signal Processing and Machine Learning Qibin Zhao Tensor Learning Unit RIKEN

and You Tensor network methods Matrix product states (MPS) Projected Entangled Pair States

Tensor Invariants and Kronecker Coefficients Jiarui Fei University of California, Riverside

Higher order black holes of scalar tensor theories E Babichev and CC gr-qc/1312.3204 CC, T

Hairy black holes in scalar tensor theories E Babichev and CC gr-qc/1312.3204 CC, T Kolyvaris, E

ICOs, a revolutionary way to raise money for your company I am.. Taco Potze co-Founder of

Ingrid Ban Gabri Mannino Keeley Gay Meredith Katibah Research Questions I. What media channels

Is your elephant shy? shy? Achieving value from measurement Tim Elleston Murdoch University

Reproducible Research Practices for Economists Mindy L. Mallory November 10, 2017 Mindy L.

Board Meeting March 12, 2018 Room WW17 1:00 - 4:00 pm GOAL #1: Coordinate and facilitate

Differents Levels " Devices driving " Graphical User Interface " Experiment control

TECHNICAL ASSISTANCE COORDINATION OFFICE (TACO) Quarterly Development Partners Meeting (QDPM)

American Physical Society x Snapchat: Modern Outreach Methods August 12, 2016 Isabel Binamira

TACOTRON 2 AND WAVEGLOW WITH TENSOR CORES Rafael Valle, Ryan Prenger - PowerPoint PPT Presentation

TEXT-TO-SPEECH SYNTHESIS USING TACOTRON 2 AND WAVEGLOW WITH TENSOR CORES Rafael Valle, Ryan Prenger and Yang Zhang 1 OUTLINE 1.Text to Speech Synthesis 2.Tacotron 2 3.WaveGlow 4.TTS and TensorCores 2 TEXT TO SPEECH SYNTHESIS (TTS) Global

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

PROGRAMMING TENSOR CORES: NATIVE VOLTA TENSOR CORES WITH CUTLASS Andrew Kerr, Timmy Liu, Mostafa

Tacotron: End-to-End TTS Tacotron [Wang 2017]: Neural Vocoder Convert spectrogram to

8. Tensor Field Visualization Tensor: extension of concept of scalar and vector Tensor data

(Some) Challenges in (Some) Challenges in Tensor Mining Tensor Mining Evrim Acar Sandia

Tensor Field Techniques Lecture 11 March 5, 2020 Outline Basics of tensor algebra Tensor

TENSOR ALGEBRA Continuum Mechanics Course (MMC) - ETSECCPB - UPC Introduction to Tensors Tensor

Tensor-Matrix Products with a Compressed Sparse Tensor Shaden Smith George Karypis University

Tensor Field Visualization 9-1 Ronald Peikert SciVis 2007 - Tensor Fields Tensors

Fused and Composable Heterogeneous Cores Roshan Nair and Anirudh Krishna Villivalam Single cores

TENSOR CORE DL PERFORMANCE GUIDE Michael Andersch, Valerie Sarge, Paulius Micikevicius NVIDIA

Tensor Methods for Signal Processing and Machine Learning Qibin Zhao Tensor Learning Unit RIKEN

and You Tensor network methods Matrix product states (MPS) Projected Entangled Pair States

Tensor Invariants and Kronecker Coefficients Jiarui Fei University of California, Riverside

Higher order black holes of scalar tensor theories E Babichev and CC gr-qc/1312.3204 CC, T

Hairy black holes in scalar tensor theories E Babichev and CC gr-qc/1312.3204 CC, T Kolyvaris, E

ICOs, a revolutionary way to raise money for your company I am.. Taco Potze co-Founder of

Ingrid Ban Gabri Mannino Keeley Gay Meredith Katibah Research Questions I. What media channels

Is your elephant shy? shy? Achieving value from measurement Tim Elleston Murdoch University

Reproducible Research Practices for Economists Mindy L. Mallory November 10, 2017 Mindy L.

Board Meeting March 12, 2018 Room WW17 1:00 - 4:00 pm GOAL #1: Coordinate and facilitate

Differents Levels &quot; Devices driving &quot; Graphical User Interface &quot; Experiment control

TECHNICAL ASSISTANCE COORDINATION OFFICE (TACO) Quarterly Development Partners Meeting (QDPM)

American Physical Society x Snapchat: Modern Outreach Methods August 12, 2016 Isabel Binamira

Differents Levels " Devices driving " Graphical User Interface " Experiment control