Unsupervised acoustic unit discovery for speech synthesis using - PowerPoint PPT Presentation

Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks Interspeech 2019, Graz, Austria Ryan Eloff, Andr´ e Nortje, Benjamin van Niekerk, Avashna Govender, Leanne Nortje, Arnu Pretorius, Elan van Biljon, Ewald van der Westhuizen, Lisa van Staden, Herman Kamper Stellenbosch University, South Africa & University of Edinburgh, UK https://github.com/kamperh/suzerospeech2019

Advances in speech recognition 1 / 14

Advances in speech recognition • Addiction to text : 2000 hours transcribed speech audio; ∼ 350M/560M words text [Xiong et al., TASLP’17] 1 / 14

Advances in speech recognition • Addiction to text : 2000 hours transcribed speech audio; ∼ 350M/560M words text [Xiong et al., TASLP’17] • Sometimes not possible, e.g., for unwritten languages 1 / 14

Advances in speech recognition • Addiction to text : 2000 hours transcribed speech audio; ∼ 350M/560M words text [Xiong et al., TASLP’17] • Sometimes not possible, e.g., for unwritten languages • Very different from the way human infants learn language 1 / 14

Zero-Resource Speech Challenges (ZRSC) 2 / 14

ZRSC 2019: Text-to-speech without text Waveform generator Target voice ‘the dog ate the ball’ 3 / 14

ZRSC 2019: Text-to-speech without text Waveform generator 11 7 26 11 31 Target voice Acoustic model 3 / 14

What do we get for training? 4 / 14

What do we get for training? No labels 4 / 14

What do we get for training? No labels :) 4 / 14

What do we get for training? No labels :) Figure adapted from: http://zerospeech.com/2019 4 / 14

Approach: Compress, decode and synthesise Waveform Symbol-to-speech module FFTNet Vocoder y 1: T ˆ Filterbanks Decoder Compression model z 1: N Embed Speaker ID Discretise h 1: N Encoder x 1: T MFCCs 5 / 14

Approach: Compress, decode and synthesise Waveform Symbol-to-speech module FFTNet Vocoder y 1: T ˆ Filterbanks Decoder Compression model z 1: N Embed Training speaker Discretise h 1: N Encoder x 1: T MFCCs 5 / 14

Approach: Compress, decode and synthesise Waveform Symbol-to-speech module FFTNet Vocoder y 1: T ˆ Filterbanks Decoder Compression model z 1: N Embed Target speaker Discretise h 1: N Encoder x 1: T MFCCs 5 / 14

Approach: Compress, decode and synthesise Waveform Symbol-to-speech module FFTNet Vocoder y 1: T ˆ Filterbanks Decoder Compression model z 1: N Embed Speaker ID Discretise h 1: N Encoder x 1: T MFCCs 5 / 14

Discretisation methods h z 0 . 9 1 − 0 . 1 − 1 • Straight-through estimation (STE) threshold 0 . 3 1 binarisation: 0 . 7 1 − 0 . 8 − 1 h z 0 . 9 0 . 86 e ( h k + g k ) /τ • Categorical variational autoencoder − 0 . 1 0 . 01 � K k =1 e ( h k + g k ) /τ (CatVAE): 0 . 3 0 . 02 0 . 7 0 . 11 − 0 . 8 0 . 00 h z 0 . 9 0 . 8 • Vector-quantised variational − 0 . 1 − 0 . 2 autoencoder (VQ-VAE): Choose closest 0 . 3 0 . 3 embedding e 0 . 7 0 . 5 − 0 . 8 − 0 . 6 6 / 14

Neural network architectures • Encoder: Convolutional layers, each layer with a stride of 2 • Decoder: Transposed convolutions mirroring encoder • Waveform generation: FFTNet autoregressive vocoder • Also experimented with WaveNet: Sometimes gave noisy output • Bitrate: Set by number of symbols K and number of striding layers 7 / 14

Evaluation Human evaluation metrics: • Mean opinion score (MOS) • Character error rate (CER) • Similarity to the target speaker’s voice 8 / 14

Evaluation Human evaluation metrics: • Mean opinion score (MOS) • Character error rate (CER) • Similarity to the target speaker’s voice Objective evaluation metrics: • ABX discrimination • Bitrate 8 / 14

Evaluation Human evaluation metrics: • Mean opinion score (MOS) • Character error rate (CER) • Similarity to the target speaker’s voice Objective evaluation metrics: • ABX discrimination • Bitrate Two evaluation languages: • English: Used for development • Indonesian: Held out “surprise language” 8 / 14

ABX on English with speaker conditioning 30 20 ABX (%) 10 no speaker cond. speaker conditioning 0 STE VQ-VAE CatVAE 9 / 14

ABX on English for different compression rates 30 ABX (%) 20 10 no downsampling 0 64 256 512 64 256 512 64 256 512 STE VQ-VAE CatVAE 10 / 14

ABX on English for different compression rates 30 ABX (%) 20 10 no downsampling × 4 downsample 0 64 256 512 64 256 512 64 256 512 STE VQ-VAE CatVAE 10 / 14

ABX on English for different compression rates 30 ABX (%) 20 10 no downsampling × 4 downsample × 8 downsample 0 64 256 512 64 256 512 64 256 512 STE VQ-VAE CatVAE 10 / 14

ABX on English for different compression rates 64 70 30 85 79 90 116 103 75 164 124 154 93 473 194 215 100 139 188 478 682 644 190 646 686 ABX (%) 576 20 770 750 10 no downsampling × 4 downsample × 8 downsample 0 64 256 512 64 256 512 64 256 512 STE VQ-VAE CatVAE 10 / 14

Official evaluation results CER MOS Similarity ABX Model (%) [1, 5] [1, 5] (%) Bitrate English: DPGMM-Merlin 75 2.50 2.97 35.6 72 VQ-VAE-x8 75 2.31 2.49 25.1 88 VQ-VAE-x4 2.18 2.51 173 67 23.0 Supervised 44 2.77 2.99 29.9 38 Indonesian: DPGMM-Merlin 62 2.07 3.41 27.5 75 VQ-VAE-x8 58 1.94 1.95 17.6 69 VQ-VAE-x4 60 1.96 1.76 14.5 140 Supervised 28 3.92 3.95 16.1 35 11 / 14

Synthesised examples Model Input Synthesised output Target speaker English: VQ-VAE-x4 Play Play Play VQ-VAE-x4-new Play VQ-VAE-x4 Play Play Play VQ-VAE-x4-new Play Indonesian: VQ-VAE-x4 Play Play Play VQ-VAE-x4-new Play VQ-VAE-x4 Play Play Play VQ-VAE-x4-new Play 12 / 14

Conclusions • Speaker conditioning consistently improves performance • Different discretisation methods are similar (VQ-VAE slightly better) • Different models difficult to compare because of bitrate • Future: Does discritisation actually benefit feature learning? 13 / 14

Why do we have ten authors on this paper? Ryan Andr´ e Benjamin Avashna Leanne Eloff Nortje van Niekerk Govender Nortje Arnu Elan van Ewald van der Lisa van Herman Pretorius Biljon Westhuizen Staden Kamper 14 / 14

https://github.com/kamperh/suzerospeech2019

https://github.com/kamperh/suzerospeech2019 (Update coming soon)

Straight-through estimation (STE) binarisation h z • STE binarisation: 0 . 9 1 z k = 1 if h k ≥ 0 or z k = − 1 otherwise − 0 . 1 − 1 threshold 0 . 3 1 • For backpropagation we need: ∂J h 4 0 . 7 1 z 4 ∂ h − 0 . 8 − 1 • For single element: ∂J = ∂z k ∂J ∂h k ∂h k ∂z k • What is ∂z k with z k = threshold( h k ) ? Cannot solve directly ∂h k • Idea: If z k ≈ h k then we could use ∂J ≈ ∂J ∂h k ∂z k 16 / 14

Straight-through estimation (STE) binarisation As an example, let us say h k = 0 . 7 : 1 0 . 7 0 − 1 17 / 14

Straight-through estimation (STE) binarisation Instead of direct thresholding, let us set z k = 1 with probability 0 . 85 and z k = − 1 with probability 0 . 15 : 1 0 . 7 0 − 1 Estimated mean of z k over 500 samples: 0 . 668 18 / 14

Straight-through estimation (STE) binarisation • So, instead of direct thresholding, we set z k = h k + ǫ , where ǫ is sampled noise: � with probability 1+ h k 1 − h k 2 ǫ = with probability 1 − h k − h k − 1 2 • Since ǫ is zero-mean, the derivative of the expected value of z k is: ∂ E [ z k ] = 1 ∂h k • Therefore, gradients are passed unchanged through the thresholding operation: ∂J ∂ h ≈ ∂J ∂ z 19 / 14

Unsupervised acoustic unit discovery for speech synthesis using - PowerPoint PPT Presentation

Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks Interspeech 2019, Graz, Austria Ryan Eloff, Andr e Nortje, Benjamin van Niekerk, Avashna Govender, Leanne Nortje, Arnu Pretorius, Elan

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Acoustic Acoustic Control Systems BV Acoustic Acoustic Control Systems BV Control Systems BV

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Speech Processing 15-492/18-492 Speech Recognition Intro Acoustic modelling HMMs Speech

Unsupervised speech processing using acoustic word embeddings Herman Kamper School of

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary

11-752: Speech Synthesis Objectives Understand basic processing in speech synthesis

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

Speech Processing 15-492/18-492 Speech Synthesis Waveform generation 2 Speech Synthesis Text

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9

and how to protect against them Bianca Schroeder, Sotirios Damouras, Phillipa Gill Motivation

lavaan : an R package for structural equation modeling and more Yves Rosseel Department of Data

POIR 613: Computational Social Science Pablo Barber a School of International Relations

Latent Variables and Real-Time Forecasting in DSGE Models with Occasionally Binding

User Recommendation in Content Curation Platforms Jianling Wang, Ziwei Zhu and James Caverlee

About generative aspects of Variational Autoencoders LOD19 The Fifth International Conference

CS7015 (Deep Learning) : Lecture 19 Using joint distributions for classification and sampling,