TEXT-TO-SPEECH SYNTHESIS USING TACOTRON 2 AND WAVEGLOW WITH TENSOR CORES Rafael Valle, Ryan Prenger and Yang Zhang 1
OUTLINE 1.Text to Speech Synthesis 2.Tacotron 2 3.WaveGlow 4.TTS and TensorCores 2
TEXT TO SPEECH SYNTHESIS (TTS) Global TTS Market Value 1 3.5 3 2.5 2 Human to ? Interaction 1.5 1 0.5 0 USD Billions 2016 2022 Apple Microsoft Nuance Amazon Google Siri Cortana Vocalizer Alexa / Polly TTS 1 https://www.marketsandmarkets.com/PressReleases/text-to-speech.asp 3
APPLICATIONS OF TTS Smart Home Devices Health Care Audio Books Vocaloids Self-Driving Cars Video Games 4
TEXT TO SPEECH SYNTHESIS Text Input Forty percent of a Speech Output per - c for - ty ent of a - 5
SPEECH SYNTHESIS: THE VODER 1939 6
PARAMETRIC SPEECH SYNTHESIS Pneumatic speech synthesizer developed Voder speech synthesizer developed by von Kempelen in 1791. by Homer Dudley in 1939. 7
CONCATENATIVE TTS SYNTHESIS Database First practical application in 1936: British Phone company’s Talking Clock per - c - ent of a for - ty 8
CONCATENATIVE TTS SYNTHESIS • Requires collecting speech units • Requires designing cost heuristics • Requires acoustic processing https://wezs.com/~danguy/monguy/TTS.html 9
PARAMETRIC (DEEP LEARNING) TTS SYNTHESIS Text Input Forty percent of a Deep Learning Audio Output per - c for - ty ent of a - 10
DEEP LEARNING TTS SYNTHESIS Linguistic or Acoustic features Text Input Forty percent of a 2º 1º X Audio Output per - c for - ty ent of a - 11
OUTLINE 1.Text to Speech Synthesis 2.Tacotron 2 3.WaveGlow 4.TTS and TensorCores 12
TEXT TO (MEL) SPECTROGRAM WITH TACOTRON Tacotron Tacotron 2 CBHG: Location sensitive attention, i.e. attend to: Convolution Bank (k=[1, 2, 4, 8…]) Memory (encoder output) Convolution stack (ngram like) Query (decoder output) Highway Location (attention weights) bi-directional GRU Cumulative attention weights (+= ) 13
Implementations https://github.com/NVIDIA/tacotron2/ https://github.com/NVIDIA/OpenSeq2Seq/ Deep Learning Framework and Libraries – PyTorch – TensorFlow – NVIDIA’s Automatic Mixed Precision Training Setup – NVIDIA’s Tesla V100 – Good results in less than a day starting fresh – Good results in a few hours warm-starting 14
TTS DATASET LJS (Linda Johnson: single native speakers, ~24 hours) ● 7 non-fiction books “All of my recordings were done from the sofa in my family room!” ● “All of my recordings were done on a MacBook Pro.” ● ● https://keithito.com/LJ-Speech-Dataset/ ● https://librivox.org/reader/11049 Sometimes raw text, other times ARPAbet 15
MEL TO AUDIO WITH WAVENET Samplin g Rates 44100 Hz 22050 Hz 16000 Hz https://deepmind.com/blog/wavenet-generative-model-raw-audio/ 16
WAVENET IMPLEMENTATION DETAILS Naïve PyTorch -> 20 samples per second Inference PyTorch on Volta -> 200 samples per second nv-wavenet -> 20000 samples per second 17
MEAN OPINION SCORES: TACOTRON AND WAVENET https://arxiv.org/abs/1712.05884 18
OUTLINE 1.Text to Speech Synthesis 2.Tacotron 2 3.WaveGlow 4.TTS and TensorCores 19
WAVENET IS THE BOTTLENECK TacoTron2 DeepVoice 3 Ping, W. Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning. https://arxiv.org/abs/1710.07654 Shen, J. Et al. Natural TTS Synthesis by Conditioning WaveNet on 20 Mel Spectrogram Predictions. https://arxiv.org/abs/1712.05884
WAVENET IS THE BOTTLENECK TacoTron2 DeepVoice 3 Ping, W. Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning. https://arxiv.org/abs/1710.07654 Shen, J. Et al. Natural TTS Synthesis by Conditioning WaveNet on 21 Mel Spectrogram Predictions. https://arxiv.org/abs/1712.05884
AUTO-REGRESSION IS INHERENTLY SERIAL = 𝑄 𝑦 0 𝑄 𝑦 1 𝑦 0 )𝑄 𝑦 2 𝑦 1 , 𝑦 0 … 𝑄 𝑦 0 , 𝑦 1 , 𝑦 2 , … van den Oord, A. WaveNet: A Generative Model for Raw Audio. 22 https://arxiv.org/pdf/1609.03499.pdf
AUTO-REGRESSION IS INHERENTLY SERIAL = 𝑄 𝑦 0 𝑄 𝑦 1 𝑦 0 )𝑄 𝑦 2 𝑦 1 , 𝑦 0 … 𝑄 𝑦 0 , 𝑦 1 , 𝑦 2 , … NV-WaveNet van den Oord, A. WaveNet: A Generative Model for Raw Audio. 23 https://github.com/NVIDIA/nv-wavenet https://arxiv.org/pdf/1609.03499.pdf
TRANSFORMING WHITENOISE TO AUDIO IS PARALLEL Gaussian Noise Mel-Spectrogram 24
AUTO-ENCODER (APPROXIMATING LIKELIHOOD) Loss 1 Gaussian Noise Mel-Spectrogram Loss 2 25
INVERTIBLE NETWORK (EXACT LIKELIHOOD) Loss 1 Gaussian Noise Mel-Spectrogram 26
HOW TO MAKE A NETWORK INVERTIBLE audio samples 27
HOW TO MAKE A NETWORK INVERTIBLE audio samples 28
HOW TO MAKE A NETWORK INVERTIBLE 29
HOW TO MAKE A NETWORK INVERTIBLE (s, b) (s, b) (s, b) (s, b) (s, b) (s, b) Coupling network 30
HOW TO MAKE A NETWORK INVERTIBLE s ● + b s ● + b s ● + b s ● + b s ● + b s ● + b (s, b) (s, b) (s, b) (s, b) (s, b) (s, b) Coupling network 31
HOW TO MAKE A NETWORK INVERTIBLE (s, b) (s, b) (s, b) (s, b) (s, b) (s, b) Coupling network ( - b) / s ( - b) / s ( - b) / s ( ( ( - b) / s - b) / s - b) / s 32
https://github.com/NVIDIA/waveglow 33
DECREASING TEMPERATURE CAN HELP 𝜏 ~ 0.8 Gaussian Noise Mel-Spectrogram 34
PARALLEL SOLUTION WORKS Loss NV-WaveNet: 24-48khz (1.2x – 2.4x realtime) WaveGlow (published): 520 khz (24.5x realtime) 35
PARALLEL SOLUTION WORKS Loss NV-WaveNet: 24-48khz (1.2x – 2.4x realtime) WaveGlow (published): 520 khz (24.5x realtime) WaveGlow (internal smaller): 1,500 khz (70x realtime) 36
RELATED WORK Parallel WaveNet/ClariNet Very similar network/inference Very different training procedure WaveRNN More like optimized auto-regressive Can get some parallelism with subscale trick 37
OUTLINE 1.Text to Speech Synthesis 2.Tacotron 2 3.WaveGlow 4.TTS and Tensor Cores 38
INFERENCE SPEED UP with Tensor Cores – Automatic Mixed Precision 3 samples/s in MHz 1.8x 2 On DGX-1 1 Tesla V100 GPU Batch size: 1 1 0 w/o Tensor Cores w/ Tensor Cores 39
INFERENCE SPEED UP with Tensor Cores – Automatic Mixed Precision 3 125x samples/s in MHz 2 On DGX-1 1 Tesla V100 GPU Batch size: 1 70x 1 1x 0 real-time w/o Tensor w/ Tensor Cores Cores 40
TENSOR CORES SPEED UP MATRIX MULTIPLICATIONS x FP16 x FP16 + FP32 41
w/o Tensor Cores w/ Tensor Cores Inference time 29ms 15ms 2X FASTER INFERENCE WITH TENSOR CORES 42
TRAINING SPEED UP with Tensor Cores – Automatic Mixed Precision 1000 training time in hours 800 On DGX-1 600 1 Tesla V100 GPU over 1000 Epochs 400 1.9x faster 200 0 FP32 Tensor Cores 43
TRAINING WITH TENSOR CORES FP32 Tensor Cores 1 0 -1 -2 Tensor Cores Loss -3 achieve similar training loss -4 -5 -6 -7 -8 0k 10k 20k 30k 40k 50k 60k 70k 80k 90k 100k Iterations 44
USING TENSOR CORES WITH AMP Automatic Mixed Precision library that enables Tensor Cores transparently manages type conversions and master weights automatic loss scaling to prevents gradient underflow Different levels of optimization white/black list allow user to enforce precision Easy code adjustment 45
INFERENCE WITH AMP IS EASY Code Example FP32 46
INFERENCE WITH AMP IS EASY Code Example Tensor Cores with AMP FP32 1x 1.8x 47
TRAINING WITH AMP IS EASY Code Example FP32 48
TRAINING WITH AMP IS EASY Code Example Tensor Cores with AMP 1.9x speed up 49
CONCLUSION Tensor Cores achieve close to 2x faster inference and training on Waveglow AMP enables Tensor Cores transparently for training and inference Code available on NGC and github https://ngc.nvidia.com/catalog/model-scripts/ https://github.com/NVIDIA/tacotron2 https://github.com/NVIDIA/waveglow https://github.com/NVIDIA/apex/tree/master/apex/amp 50
Recommend
More recommend