automatic speech recognition cs753 automatic speech
play

Automatic Speech Recognition (CS753) Automatic Speech Recognition - PowerPoint PPT Presentation

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech synthesis (Concluding lecture) Instructor: Preethi Jyothi Nov 6, 2017 Recall: SPSS framework O Speech Speech Train Parameter


  1. Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech synthesis (Concluding lecture) Instructor: Preethi Jyothi Nov 6, 2017 


  2. Recall: SPSS framework O ô Speech 
 Speech 
 Train 
 Parameter 
 speech Synthesis Analysis Model Generation ˆ W λ Text 
 Text 
 text Analysis Analysis Training • Estimate acoustic model given speech u tu erances (O), word sequences (W) • ˆ λ = arg max p ( O | W, λ ) λ Synthesis • ˆ Find the most probable ô from and a given word sequence to be • λ synthesised, w p ( o | w, ˆ o = arg max ˆ λ ) o Synthesize speech from ô •

  3. Synthesis using duration models Context Dependent Duration Models Context Dependent HMMs dis- tree Synthesis state State Duration stress- Densities TEXT Sentence HMM be T or 𝜍 State Duration d d 1 2 Use delta features 
 c c c c c c c Mel-Cepstrum 1 2 3 4 5 6 T for smooth trajectories Pitch MLSA Filter the since actors, SYNTHETIC SPEECH - Image from Yoshimura et al., “Duration modelling for HMM-based speech synthesis”, ICSLP ‘98 an-

  4. Transforming voice characteristics We studied about speaker • Transformed Model adaptation techniques for ASR Linear Transforms Maximum a posteriori • (MAP) estimation Maximum Likelihood • Linear Regression Regression Class General Model (MLLR) Can also be applied to speech synthesis • MLLR: estimate a set of linear transforms that map an existing model into an • adapted model s.t. the likelihood of the adaptation data is maximized For limited adaptation data, MLLR is more e ff ective than MAP • Image from Zen et al., “Statistical Parametric Speech Synthesis”, SPECOM 2009

  5. Transforming voice characteristics What if no adaptation • data is available? λ 2 λ 1 I ( λ ′ , λ 2 ) I ( λ ′ , λ 1 ) λ ′ I ( λ ′ , λ 3 ) λ 3 I ( λ ′ , λ 4 ) λ 4 HMM parameters can be interpolated • Synthesize speech with varying voice characteristics not encountered 
 • during training Image from Zen et al., “Statistical Parametric Speech Synthesis”, SPECOM 2009

  6. GMM-based voice conversion Parallel training data: • Training Source Vocoder Align source and target Speech Analysis speech frame-by-frame DTW Alignment Target Vocoder Estimate a joint • Speech Analysis distribution GMM to JD-GMM Training model the joint PDF JD-GMM between source/target features Conversion GMM Acoustic Source Text Mixture Parameter Speech Analysis Decision Conversion At conversion time, • predict the most likely Converted Vocoder converted acoustic Speech Synthesis features given a source acoustic feature sequence Image from Ling et al., “Deep Learning for Acoustic Modeling in Parametric Speech Generation”, 2015

  7. Neural approaches to speech generation

  8. Recall: DNN-based speech synthesis Input feature Text TEXT extraction analysis Input layer Hidden layers Output layer x 1 h 1 h 1 h 1 Statistics (mean & var) of speech parameter vector sequence Input features including 1 11 21 31 features at frame 1 y 1 binary & numeric 1 x 1 h 1 h 1 h 1 2 12 22 32 Input features about linguistic y 1 • 2 x 1 h 1 h 1 h 1 3 13 23 33 contexts, numeric values (# of words, y 1 3 x 1 h 1 h 1 h 1 duration of the phoneme, etc.) 4 14 24 34 ... ... ... ... ... x T h T h T h T Input features including 1 11 21 31 Output features are spectral and • features at frame T y T binary & numeric 1 x T h T h T h T excitation parameters and their 
 2 12 22 32 y T 2 delta values x T h T h T h T 3 13 23 33 y T 3 x T h T h T h T 4 14 24 34 Waveform Parameter SPEECH synthesis generation Image from Zen et al., “Statistical Parametric Speech Synthesis using DNNs”, 2014

  9. Recall: RNN-based speech synthesis Vocoder Waveform Output Features Access long range context • in both forward backward directions using biLSTMs Inference is expensive; 
 • inherently have large latency Input features Text Input Feature Text Analysis Extraction Image from Fan et al., “TTS synthesis with BLSTM-based RNNs”, 2014

  10. Frame-synchronous streaming 
 ... ... ... Waveform speech synthesis Vocoder Vocoder Acoustic ... ... ... y ( i ) y ( i ) ˆ features ˆ d ( i ) ˆ 1 ... ... ... Recurrent output layer ... ... ... Acoustic LSTM-RNN L a Frame-level ... ... ... linguistic x ( i ) x ( i ) d ( i ) ˆ 1 features d ( i ) ˆ ... Phoneme ... Output layer durations ... ... Duration LSTM-RNN L d Phoneme-level ... ... x ( 1 ) x ( i ) x ( N ) linguistic features Linguistic feature extraction Text analysis TEXT Image from Zen & Sak, Unidirectional LSTM RNNs for low-latency speech synthesis, 2015

  11. 06/11/2017 https://blog.openai.com/content/images/2017/02/gen_models_diag_2.svg Deep generative models Code Real data 
 (Gaussian, Deep generative model (images, Uniform, etc.) sounds, etc.) Example: 
 Autoregressive 
 models (Wavenet) Image from https://blog.openai.com/generative-models/ https://blog.openai.com/content/images/2017/02/gen_models_diag_2.svg 1/1

  12. Wavenet Speech synthesis using an auto-regressive generative model • Generates waveform sample-by-sample:16kHz sampling rate • Gif from https://deepmind.com/blog/wavenet-generative-model-raw-audio/

  13. Wavenet Wavenet uses “dilated convolutions” • Main limitation: Very slow generation rate [Oct 2017: Wavenet • deployed in Google Assistant 1 ] Gif from https://deepmind.com/blog/wavenet-generative-model-raw-audio/ 1 https://techcrunch.com/2017/10/04/googles-wavenet-machine-learning-based-speech-synthesis-comes-to-assistant/

  14. Wavenet Reduced the gap between state-of-the-art and human performance 
 • by > 50% Recording 1 Recording 2 Recording 3 Which of the three recordings sounded most natural? •

  15. 06/11/2017 https://blog.openai.com/content/images/2017/02/gen_models_diag_2.svg Deep generative models Code True data 
 (Gaussian, Deep generative model (images, Uniform, etc.) sounds, etc.) Example: 
 Generative Adversarial Networks (GANs) Image from https://blog.openai.com/generative-models/ https://blog.openai.com/content/images/2017/02/gen_models_diag_2.svg 1/1

  16. GANs Training process is formulated as a game between a generator • network and a discriminative network Objective of the generator : Create samples that seem to be • from the same distribution as the training data Objective of the discriminator : Examine a generated sample • and distinguish between fake or real samples • Solution to this game is an equilibrium between the generator and the discriminator • Refer to [ Goodfellow16 ] for a detailed tutorial on GANs [ Goodfellow16 ]: https://arxiv.org/pdf/1701.00160.pdf

  17. GANs for speech synthesis Discriminator: Generator produces 
 • synthesised speech Binary OR which the classifier Discriminator distinguishes from real speech Linguistic features Natural samples During synthesis, a • Generator: random noise + MSE AND linguistic features generates speech Noise Predicted samples Image from Yang et al., “SPSS using GANs”, 2017

  18. Course conclusion

  19. Topics covered Formalism: Finite State Transducers Hybrid Deep Hidden Acoustic 
 Speaker Discr. 
 HMM-DNN 
 Neural Markov Model Adaptation Training Systems Networks Models (phones) Acoustic 
 Pronunciation 
 Feature 
 SEARCH G2P models Model Generator speech 
 O signal 
 Properties Search of speech Language 
 Ngram/RNN word sequence 
 algorithms sounds Model LMs W * Acoustic 
 Signal Processing

  20. Topics covered End-to-end 
 Models Mapping acoustic signals 
 Ngram/RNN to 
 LMs O word sequences speech 
 signal 
 word sequence 
 W * Also, briefly covered Conversational 
 Agents Speech 
 Synthesis

  21. Exciting time to do speech research

  22. Remaining coursework

  23. Final Exam Syllabus 1. WFST algorithms + WFSTs used in ASR 2. EM algorithm 3. HMMs + Tied state Triphone HMMs 4. DNN/RNN-based acoustic models 5. N-gram/RNN language models 6. CTC end-to-end ASR 7. Pronunciation models 8. Search & decoding 9. Discriminative training for HMMs 10. Basics of speaker adaptation 11. HMM-based speech synthesis models In the final exam, questions can be asked on any of the 11 topics listed above. Weightage of topics will be shared later on Moodle.

  24. Final Project Deliverables 5-8 page final report: • ✓ Task definition, Methodology, Prior work, Implementation Details, Experimental Setup, Experiments and Discussion, Error Analysis (if any), Summary Short talk summarizing the project: • ✓ Each team will get 10 mins for their presentation and 10 minutes for Q/A ✓ Clearly demarcate which team member worked on what part

  25. Final Project Schedule Presentations will be tentatively held on Nov 27th and Nov • 28th The final report in pdf format should be sent to • pjyothi@cse.iitb.ac.in before Nov 27th The order of presentations will be decided on a lo tu ery basis • and shared via Moodle by Nov 18th

  26. Final Project Grading Break-up of 20 points: • 6 points for the report • 4 points for the presentation • 6 points for Q/A • 4 points for overall evaluation of the project •

Recommend


More recommend