end to end approach to asr tts and speech translation
play

End-to-end approach to ASR, TTS and Speech Translation Satoshi - PowerPoint PPT Presentation

End-to-end approach to ASR, TTS and Speech Translation Satoshi Nakamura 1,2 with Sakriani Sakti 1,2 , Andros Tjandra 1,2 ,Takatomo Kano, and Quoc Truong Do 1 Nara Institute of Science & Technology, Japan 2 RIKEN, Center for Advanced


  1. End-to-end approach to ASR, TTS and Speech Translation Satoshi Nakamura 1,2 with Sakriani Sakti 1,2 , Andros Tjandra 1,2 ,Takatomo Kano, and Quoc Truong Do 1 Nara Institute of Science & Technology, Japan 2 RIKEN, Center for Advanced Intelligence Project AIP, Japan 1 Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

  2. Outline • Machine Speech Chain • Machine Speech Chain: Listening while speaking • Andros Tjandra, Sakriani Sakti, Satoshi Nakamura, “Listening while Speaking: Speech Chain by Deep Learning”, ASRU 2017 • Spe Speech Cha hain in with ith One ne-shot Spe Speaker r Ada daptation • Andros Tjandra, Sakriani Sakti, Satoshi Nakamura1,2 “Machine Speech Chain with One -shot Speaker Adaptation”, Proceedings of INTERSPEECH 2018 • End nd-to to-end Fee eedback Lo Loss in in Spe Speech Cha hain in Fram ramework rk via ia Stra Straight-thro rough Estim imator • A. Tjandra, S. Sakti, S. Nakamura, "End-to-end Feedback Loss in Speech Chain Framework via Straight-through Estimator", in Proc. ICASSP, 2019 • End-to to-end Sp Speech-to to-speech Transla latio ion • Str Structu ture Ba Based Curr rriculum Lea Learning for End-to to-end English lish-Japanese Sp Spee eech Tra ranslati tion • Takatomo Kano, Sakriani Sakti, Satoshi Nakamura, “Structure Based Curriculum Learning for End - to-end English- Japanese Speech Translation”, INTERSPEECH2017 2 Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

  3. Outline • Machine Speech Chain • Machine Speech Chain: Listening while speaking • Andros Tjandra, Sakriani Sakti, Satoshi Nakamura, “Listening while Speaking: Speech Chain by Deep Learning”, ASRU 2017 • Spe Speech Cha hain in with ith One ne-shot Spe Speaker r Ada daptation • Andros Tjandra, Sakriani Sakti, Satoshi Nakamura1,2 “Machine Speech Chain with One -shot Speaker Adaptation”, Proceedings of INTERSPEECH 2018 • End nd-to to-end Fee eedback Lo Loss in in Spe Speech Cha hain in Fram ramework rk via ia Stra Straight-thro rough Estim imator • A. Tjandra, S. Sakti, S. Nakamura, "End-to-end Feedback Loss in Speech Chain Framework via Straight-through Estimator", in Proc. ICASSP, 2019 • End-to to-end Sp Speech-to to-speech Transla latio ion • Str Structu ture Ba Based Curr rriculum Lea Learning for End-to to-end English lish-Japanese Sp Spee eech Tra ranslati tion • Takatomo Kano, Sakriani Sakti, Satoshi Nakamura, “Structure Based Curriculum Learning for End - to-end English- Japanese Speech Translation”, INTERSPEECH2017 3 Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

  4. Motivation Background • In human communication → A closed-loop speech chain mechanism has a critical auditory feedback mechanism → Children who lose their hearing often have difficulty to produce clear speech Sensory nerves Auditory feedback Sensory Speaking Listening nerves Motor nerves 4 Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

  5. Speech Chain: Denes, Pinson 1973 5 Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

  6. Delayed Auditory ry Feedback *1,2 • DAF: • It can consist of a device that enables a user to speak into a microphone and then hear his or her voice in headphones a fraction of a second later • Effects in people who stutter • Those who stutter had an abnormal speech – auditory feedback loop that was corrected or bypassed while speaking under DAF. • Effects in normal speakers • DAF in non-stutterers to see what it can prove about the structure of the auditory and verbal pathways in the brain. • Indirect effects of delayed auditory feedback in non-stutterers include reduction in rate of speech, increase in intensity, and increase in fundamental frequency in order to overcome the effects of the feedback. Direct effects include repetition of syllables, mispronunciations, omissions, and omitted word endings. *1 Bernard S. Lee, “ Delayed Speech Feedback”, The Journal of the Acoustical Society of America 22 , 824 (1950); *2 Wikipedia “Delayed Auditory Feedback” 6 Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

  7. Human-Machine Interaction  Modality in Human-Machine Interaction  Providing a technology with ability to listen and speak “ Good afternoon ” “ How are you? ” Sensory nerves Auditory feedback Speech Sensory recognition nerves Speaking Listening “ Good afternoon ” Motor nerves Recognized words Speaking Speech Synthesis 7 Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

  8. Machine Speech Chain  Proposed Method  Develop a closed-loop speech chain model based on deep learning  The first deep learning model that integrates human speech perception & production behaviors Not only has the capability to listen and speak, “ Good afternoon ” but also listen while speaking Sensory nerves Auditory feedback “ How are you? ” Speaking Motor nerves Auditory feedback Speaking 8 Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

  9. Motivation Background • Despite the close relationship between speech perception & production → ASR and TTS researches have progressed independently Property ASR TTS Speech features MFCC MGC Mel-fbank log F0, Voice/Unvoice, BAP Text features Phoneme Phoneme + POS + LEX + … Character (Full context label) Model GMM-HMM GMM-HSMM Hybrid DNN/HMM DNN-HSMM End-to-end ASR End-to-end TTS 10 Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

  10. Machine Speech Chain • Definition: • 𝑦 = original speech, 𝑧 = original text • ො 𝑦 = predicted speech, ො 𝑧 = predicted text • 𝐵𝑇𝑆(𝑦): 𝑦 → ො 𝑧 (seq2seq model transforms speech to text) • 𝑈𝑈𝑇 𝑧 : 𝑧 → ො 𝑦 (seq2seq model transforms text to speech) 11 Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

  11. Machin ine Speech Chain in Case #1: Supervis ised Learnin ing wit ith Speech-Text xt Data • Given a pair speech-text 𝒚, 𝒛 • Train ASR and TTS in supervised learning • Directly optimized: → 𝐵𝑇𝑆 by minimize ℒ 𝐵𝑇𝑆 𝑧, ො 𝑧 → 𝑈𝑈𝑇 by minimizing loss between ℒ 𝑈𝑈𝑇 𝑦, ො 𝑦 • Update both ASR and TTS independently 12 Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

  12. Machin ine Speech Chain in Case #2: Unsupervised Learning wit ith Text xt Only ly • Given the unlabeled text features 𝒛 1. TTS generates speech features ො 𝑦 2. Based on ො 𝑦 , ASR tries to reconstruct text features ො 𝑧 3. Calculate ℒ 𝐵𝑇𝑆 (𝑧, ො 𝑧) between original text features 𝑧 and the predicted ො 𝑧 Possible to improve ASR with text only by the support of TTS 13 Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

  13. Machin ine Speech Chain in Case #3: Unsupervised Learning wit ith Speech Only ly • Given the unlabeled speech features 𝒚 1. ASR predicts the most possible transcription ො 𝑧 2. Based on ො 𝑧 , TTS tries to reconstruct speech features ො 𝑦 3. Calculate ℒ 𝑈𝑈𝑇 (𝑦, ො 𝑦) between original speech features 𝑦 and the predicted ො 𝑦 Possible to improve TTS with speech only by the support of ASR 14 Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

  14. Sequence-to to-Sequence ASR Input & output • 𝒚 = [𝑦 1 , … , 𝑦 𝑇 ] (speech feature) • 𝒛 = [𝑧 1 , … , 𝑧 𝑈 ] (text) Model states 𝑓 • ℎ 1..𝑇 = encoder states 𝑒 = decoder state at time 𝑢 • ℎ 𝑢 • 𝑏 𝑢 = attention probability at time t 𝑓 , ℎ 𝑢 𝑒 • 𝑏 𝑢 𝑡 = 𝐵𝑚𝑗𝑕𝑜 ℎ 𝑡 𝑓 ,ℎ 𝑢 𝑒 exp 𝑇𝑑𝑝𝑠𝑓 ℎ 𝑡 • 𝑏 𝑢 𝑡 = 𝑇 𝑓 ,ℎ 𝑢 𝑒 σ 𝑡=1 exp 𝑇𝑑𝑝𝑠𝑓 ℎ 𝑡 𝑓 (expected context) 𝑇 • 𝑑 𝑢 = σ 𝑡=1 𝑏 𝑢 𝑡 ∗ ℎ 𝑡 Loss function 𝑈 ℒ 𝐵𝑇𝑆 𝑧, 𝑞 𝑧 = − 1 𝑈 ෍ ෍ 1(𝑧 𝑢 = 𝑑) ∗ log 𝑞 𝑧 𝑢 [𝑑] 𝑢=1 𝑑∈[1..𝐷] 15 Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

  15. Sequence-to to-Sequence TTS Input & output 𝒚 𝑺 = [𝑦 1 , … , 𝑦 𝑇 ] (linear spectrogram feature) • 𝒚 𝑵 = [𝑦 1 , … , 𝑦 𝑇 ] (mel spectrogram feature) End of speech • prediction • 𝒛 = [𝑧 1 , … , 𝑧 𝑈 ] (text) Model states 𝑓 • ℎ 1..𝑇 = encoder states 𝑒 = decoder state at time 𝑢 Fully connected • ℎ 𝑡 • 𝑏 𝑡 = attention probability at time t 𝑓 (expected context) 𝑇 • 𝑑 𝑡 = σ 𝑡=1 𝑏 𝑡 𝑢 ∗ ℎ 𝑢 Loss function 𝑇 𝑦 = 1 𝑁 − ො 𝑁 2 + 𝑦 𝑡 𝑆 − ො 𝑆 2 ℒ 𝑈𝑈𝑇1 𝑦, ො 𝑇 ෍ 𝑦 𝑡 𝑦 𝑡 𝑦 𝑡 𝑡=1 Fully connected 𝑇 𝑐 = − 1 ℒ 𝑈𝑈𝑇2 𝑐, ෠ 𝑐 𝑡 log(෠ 𝑐 𝑡 ) + 1 − 𝑐 𝑡 log 1 − ෠ 𝑇 ෍ 𝑐 𝑡 𝑡=1 𝑦, 𝑐, ෠ 𝑦 + ℒ 𝑈𝑈𝑇2 𝑐, ෠ ℒ 𝑈𝑈𝑇 𝑦, ො 𝑐 = ℒ 𝑈𝑈𝑇1 𝑦, ො 𝑐 16 CBHG: Convolution Bank + Highway + bi-GRU Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

  16. 17 Satoshi Nakamura @ AHCLab, NAIST, Japan | Aug. 14th 2019 CCF MTG Xining China

Recommend


More recommend