Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech synthesis (Concluding lecture) Instructor: Preethi Jyothi Nov 6, 2017
Recall: SPSS framework O ô Speech Speech Train Parameter speech Synthesis Analysis Model Generation ˆ W λ Text Text text Analysis Analysis Training • Estimate acoustic model given speech u tu erances (O), word sequences (W) • ˆ λ = arg max p ( O | W, λ ) λ Synthesis • ˆ Find the most probable ô from and a given word sequence to be • λ synthesised, w p ( o | w, ˆ o = arg max ˆ λ ) o Synthesize speech from ô •
Synthesis using duration models Context Dependent Duration Models Context Dependent HMMs dis- tree Synthesis state State Duration stress- Densities TEXT Sentence HMM be T or 𝜍 State Duration d d 1 2 Use delta features c c c c c c c Mel-Cepstrum 1 2 3 4 5 6 T for smooth trajectories Pitch MLSA Filter the since actors, SYNTHETIC SPEECH - Image from Yoshimura et al., “Duration modelling for HMM-based speech synthesis”, ICSLP ‘98 an-
Transforming voice characteristics We studied about speaker • Transformed Model adaptation techniques for ASR Linear Transforms Maximum a posteriori • (MAP) estimation Maximum Likelihood • Linear Regression Regression Class General Model (MLLR) Can also be applied to speech synthesis • MLLR: estimate a set of linear transforms that map an existing model into an • adapted model s.t. the likelihood of the adaptation data is maximized For limited adaptation data, MLLR is more e ff ective than MAP • Image from Zen et al., “Statistical Parametric Speech Synthesis”, SPECOM 2009
Transforming voice characteristics What if no adaptation • data is available? λ 2 λ 1 I ( λ ′ , λ 2 ) I ( λ ′ , λ 1 ) λ ′ I ( λ ′ , λ 3 ) λ 3 I ( λ ′ , λ 4 ) λ 4 HMM parameters can be interpolated • Synthesize speech with varying voice characteristics not encountered • during training Image from Zen et al., “Statistical Parametric Speech Synthesis”, SPECOM 2009
GMM-based voice conversion Parallel training data: • Training Source Vocoder Align source and target Speech Analysis speech frame-by-frame DTW Alignment Target Vocoder Estimate a joint • Speech Analysis distribution GMM to JD-GMM Training model the joint PDF JD-GMM between source/target features Conversion GMM Acoustic Source Text Mixture Parameter Speech Analysis Decision Conversion At conversion time, • predict the most likely Converted Vocoder converted acoustic Speech Synthesis features given a source acoustic feature sequence Image from Ling et al., “Deep Learning for Acoustic Modeling in Parametric Speech Generation”, 2015
Neural approaches to speech generation
Recall: DNN-based speech synthesis Input feature Text TEXT extraction analysis Input layer Hidden layers Output layer x 1 h 1 h 1 h 1 Statistics (mean & var) of speech parameter vector sequence Input features including 1 11 21 31 features at frame 1 y 1 binary & numeric 1 x 1 h 1 h 1 h 1 2 12 22 32 Input features about linguistic y 1 • 2 x 1 h 1 h 1 h 1 3 13 23 33 contexts, numeric values (# of words, y 1 3 x 1 h 1 h 1 h 1 duration of the phoneme, etc.) 4 14 24 34 ... ... ... ... ... x T h T h T h T Input features including 1 11 21 31 Output features are spectral and • features at frame T y T binary & numeric 1 x T h T h T h T excitation parameters and their 2 12 22 32 y T 2 delta values x T h T h T h T 3 13 23 33 y T 3 x T h T h T h T 4 14 24 34 Waveform Parameter SPEECH synthesis generation Image from Zen et al., “Statistical Parametric Speech Synthesis using DNNs”, 2014
Recall: RNN-based speech synthesis Vocoder Waveform Output Features Access long range context • in both forward backward directions using biLSTMs Inference is expensive; • inherently have large latency Input features Text Input Feature Text Analysis Extraction Image from Fan et al., “TTS synthesis with BLSTM-based RNNs”, 2014
Frame-synchronous streaming ... ... ... Waveform speech synthesis Vocoder Vocoder Acoustic ... ... ... y ( i ) y ( i ) ˆ features ˆ d ( i ) ˆ 1 ... ... ... Recurrent output layer ... ... ... Acoustic LSTM-RNN L a Frame-level ... ... ... linguistic x ( i ) x ( i ) d ( i ) ˆ 1 features d ( i ) ˆ ... Phoneme ... Output layer durations ... ... Duration LSTM-RNN L d Phoneme-level ... ... x ( 1 ) x ( i ) x ( N ) linguistic features Linguistic feature extraction Text analysis TEXT Image from Zen & Sak, Unidirectional LSTM RNNs for low-latency speech synthesis, 2015
06/11/2017 https://blog.openai.com/content/images/2017/02/gen_models_diag_2.svg Deep generative models Code Real data (Gaussian, Deep generative model (images, Uniform, etc.) sounds, etc.) Example: Autoregressive models (Wavenet) Image from https://blog.openai.com/generative-models/ https://blog.openai.com/content/images/2017/02/gen_models_diag_2.svg 1/1
Wavenet Speech synthesis using an auto-regressive generative model • Generates waveform sample-by-sample:16kHz sampling rate • Gif from https://deepmind.com/blog/wavenet-generative-model-raw-audio/
Wavenet Wavenet uses “dilated convolutions” • Main limitation: Very slow generation rate [Oct 2017: Wavenet • deployed in Google Assistant 1 ] Gif from https://deepmind.com/blog/wavenet-generative-model-raw-audio/ 1 https://techcrunch.com/2017/10/04/googles-wavenet-machine-learning-based-speech-synthesis-comes-to-assistant/
Wavenet Reduced the gap between state-of-the-art and human performance • by > 50% Recording 1 Recording 2 Recording 3 Which of the three recordings sounded most natural? •
06/11/2017 https://blog.openai.com/content/images/2017/02/gen_models_diag_2.svg Deep generative models Code True data (Gaussian, Deep generative model (images, Uniform, etc.) sounds, etc.) Example: Generative Adversarial Networks (GANs) Image from https://blog.openai.com/generative-models/ https://blog.openai.com/content/images/2017/02/gen_models_diag_2.svg 1/1
GANs Training process is formulated as a game between a generator • network and a discriminative network Objective of the generator : Create samples that seem to be • from the same distribution as the training data Objective of the discriminator : Examine a generated sample • and distinguish between fake or real samples • Solution to this game is an equilibrium between the generator and the discriminator • Refer to [ Goodfellow16 ] for a detailed tutorial on GANs [ Goodfellow16 ]: https://arxiv.org/pdf/1701.00160.pdf
GANs for speech synthesis Discriminator: Generator produces • synthesised speech Binary OR which the classifier Discriminator distinguishes from real speech Linguistic features Natural samples During synthesis, a • Generator: random noise + MSE AND linguistic features generates speech Noise Predicted samples Image from Yang et al., “SPSS using GANs”, 2017
Course conclusion
Topics covered Formalism: Finite State Transducers Hybrid Deep Hidden Acoustic Speaker Discr. HMM-DNN Neural Markov Model Adaptation Training Systems Networks Models (phones) Acoustic Pronunciation Feature SEARCH G2P models Model Generator speech O signal Properties Search of speech Language Ngram/RNN word sequence algorithms sounds Model LMs W * Acoustic Signal Processing
Topics covered End-to-end Models Mapping acoustic signals Ngram/RNN to LMs O word sequences speech signal word sequence W * Also, briefly covered Conversational Agents Speech Synthesis
Exciting time to do speech research
Remaining coursework
Final Exam Syllabus 1. WFST algorithms + WFSTs used in ASR 2. EM algorithm 3. HMMs + Tied state Triphone HMMs 4. DNN/RNN-based acoustic models 5. N-gram/RNN language models 6. CTC end-to-end ASR 7. Pronunciation models 8. Search & decoding 9. Discriminative training for HMMs 10. Basics of speaker adaptation 11. HMM-based speech synthesis models In the final exam, questions can be asked on any of the 11 topics listed above. Weightage of topics will be shared later on Moodle.
Final Project Deliverables 5-8 page final report: • ✓ Task definition, Methodology, Prior work, Implementation Details, Experimental Setup, Experiments and Discussion, Error Analysis (if any), Summary Short talk summarizing the project: • ✓ Each team will get 10 mins for their presentation and 10 minutes for Q/A ✓ Clearly demarcate which team member worked on what part
Final Project Schedule Presentations will be tentatively held on Nov 27th and Nov • 28th The final report in pdf format should be sent to • pjyothi@cse.iitb.ac.in before Nov 27th The order of presentations will be decided on a lo tu ery basis • and shared via Moodle by Nov 18th
Final Project Grading Break-up of 20 points: • 6 points for the report • 4 points for the presentation • 6 points for Q/A • 4 points for overall evaluation of the project •
Recommend
More recommend