Unifying Speech Recognition and Generation with Machine Speech - PowerPoint PPT Presentation

Unifying Speech Recognition and Generation with Machine Speech Chain Andros Tjandra , Sakriani Sakti, Satoshi Nakamura Nara Institute of Science & Technology, Nara, Japan RIKEN AIP, Japan 26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 1

Outline • Motivation • Machine Speech Chain • Sequence-to-Sequence ASR • Sequence-to-Sequence TTS • Experimental Setup & Results • Conclusion 26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 2

Motivation • ASR and TTS researches have progressed independently without exerting much mutual influence on each other. Property ASR TTS Speech features MFCC MGC Mel-fbank log F0, Voice/Unvoice, BAP Text features Phoneme Phoneme + POS + LEX Character (full context label) Model GMM-HMM GMM-HSMM Hybrid DNN/HMM DNN-HSMM End-to-end ASR End-to-end TTS 26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 3

Motivation (2) • In human communication, a closed-loop speech chain mechanism has a critical auditory feedback mechanism. • Children who lose their hearing often have difficulty to produce clear speech. 26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 4

This paper proposed … • Develop a closed-loop speech chain model based on deep learning model • The benefit of closed-loop architecture : • Train both ASR & TTS model together • Allow us to concatenate both labeled and unlabeled speech & text (semi-supervised learning) • In the inference stage, we could use both ASR & TTS module independently 26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 5

Machine Speech Chain • Definition: • 𝑦 = original speech, 𝑧 = original text • ො 𝑦 = predicted speech, ො 𝑧 = predicted text • 𝐵𝑇𝑆(𝑦): 𝑦 → ො 𝑧 (seq2seq model transform speech to text) • 𝑈𝑈𝑇 𝑧 : 𝑧 → ො 𝑦 (seq2seq model transform text to speech) 26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 6

Machine Speech Chain (2) • Case #1: Supervised training • We have a pair speech-text 𝑦, 𝑧 • Therefore we could directly optimized 𝐵𝑇𝑆 by minimize 𝑀𝑝𝑡𝑡 𝐵𝑇𝑆 𝑧, ො 𝑧 • and 𝑈𝑈𝑇 by minimizing loss between 𝑀𝑝𝑡𝑡 𝑈𝑈𝑇 𝑦, ො 𝑦 𝑦 = 𝑧 = “ texts ” 𝑀𝑝𝑡𝑡 𝐵𝑇𝑆 𝑧, ො 𝑧 𝑀𝑝𝑡𝑡 𝑈𝑈𝑇 𝑦, ො 𝑦 𝑧 = “ tex ” ො 𝑦 = ො TTS ASR 𝑦 = 𝑧 = “text” 26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 7

Machine Speech Chain (2) • Case #2: Unsupervised training 𝑦 = ො 𝑀𝑝𝑡𝑡𝑈𝑈𝑇(𝑦, ො 𝑦) with speech only 1. Given the unlabeled speech TTS features 𝑦 2. ASR predicts most possible 𝑧 = “ text ” ො transcription ො 𝑧 3. TTS based on ො 𝑧 tries to reconstruct ASR speech features ො 𝑦 4. Calculate 𝑀𝑝𝑡𝑡 𝑈𝑈𝑇 (𝑦, ො 𝑦) between 𝑦 = original speech features 𝑦 and predicted ො 𝑦 26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 8

Machine Speech Chain (2) • Case #3: Unsupervised training with 𝑀𝑝𝑡𝑡𝐵𝑇𝑆(𝑧, 𝑞 𝑧 ) 𝑧 = “ text ” ො text only 1. Given the unlabeled text features 𝑧 ASR 2. TTS generates speech features ො 𝑦 𝑦 = ො 3. ASR given ො 𝑦 tries to reconstruct speech features ො 𝑧 TTS 4. Calculate 𝑀𝑝𝑡𝑡 𝐵𝑇𝑆 (𝑧, ො 𝑧) between original text 𝑧 and predicted ො 𝑧 𝑧 = “text” 26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 9

Sequence-to-Sequence ASR Input & output • 𝒚 = [𝑦 1 , … , 𝑦 𝑇 ] (speech feature) • 𝒛 = [𝑧 1 , … , 𝑧 𝑈 ] (text) Model states 𝑓 • ℎ 1..𝑇 = encoder states 𝑒 = decoder state at time 𝑢 • ℎ 𝑢 • 𝑏 𝑢 = attention probability at time t 𝑓 , ℎ 𝑢 𝑒 • 𝑏 𝑢 𝑡 = 𝐵𝑚𝑗𝑕𝑜 ℎ 𝑡 𝑒 𝑓 ,ℎ 𝑢 exp 𝑇𝑑𝑝𝑠𝑓 ℎ 𝑡 • 𝑏 𝑢 𝑡 = 𝑇 𝑓 ,ℎ 𝑢 𝑒 σ 𝑡=1 exp 𝑇𝑑𝑝𝑠𝑓 ℎ 𝑡 𝑓 (expected context) 𝑇 • 𝑑 𝑢 = σ 𝑡=1 𝑏 𝑢 𝑡 ∗ ℎ 𝑡 Loss function 𝑈 ℒ 𝐵𝑇𝑆 𝑧, 𝑞 𝑧 = − 1 𝑈 ෍ ෍ 1(𝑧 𝑢 = 𝑑) ∗ log 𝑞 𝑧 𝑢 [𝑑] 𝑢=1 𝑑∈[1..𝐷] 26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 10

Sequence-to-Sequence TTS Input & output 𝒚 𝑺 = [𝑦 1 , … , 𝑦 𝑇 ] (linear spectrogram feature) • 𝒚 𝑵 = [𝑦 1 , … , 𝑦 𝑇 ] (mel spectrogram feature) • • 𝒛 = [𝑧 1 , … , 𝑧 𝑈 ] (text) Model states 𝑓 • ℎ 1..𝑇 = encoder states 𝑒 = decoder state at time 𝑢 • ℎ 𝑡 • 𝑏 𝑡 = attention probability at time t 𝑓 (expected context) • 𝑇 𝑑 𝑡 = σ 𝑡=1 𝑏 𝑡 𝑢 ∗ ℎ 𝑢 Loss function 𝑇 𝑦 = 1 𝑁 − ො 𝑁 2 + 𝑦 𝑡 𝑆 − ො 𝑆 2 ℒ 𝑈𝑈𝑇1 𝑦, ො 𝑇 ෍ 𝑦 𝑡 𝑦 𝑡 𝑦 𝑡 𝑡=1 𝑐 = − 1 𝑇 ℒ 𝑈𝑈𝑇2 𝑐, ෠ 𝑐 𝑡 log(෠ 𝑐 𝑡 ) + 1 − 𝑐 𝑡 log 1 − ෠ 𝑇 ෍ 𝑐 𝑡 𝑡=1 𝑦, 𝑐, ෠ 𝑦 + ℒ 𝑈𝑈𝑇2 𝑐, ෠ ℒ 𝑈𝑈𝑇 𝑦, ො 𝑐 = ℒ 𝑈𝑈𝑇1 𝑦, ො 𝑐 26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 11

Settings • Features • Speech: • 80 Mel-spectrogram (used by ASR & TTS) • 1024-dim linear magnitude spectrogram (SFFT) (used by TTS) • TTS reconstruct speech waveform by using Griffin-Lim to predict the phase & inverse STFT • Text: • Character-based prediction • a-z (26 alphabet) • 6 punctuation mark (,:’?. -) • 3 special tags <s> </s> <spc> (start, end, space) 26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 12

Experiment on Single-Speaker • Dataset • BTEC corpus (text), speech generated by Google TTS (using gTTS library) • Supervised training: 10000 utts (text & speech paired) • Unsupervised training: 40000 utts (text & speech unpaired) • Result Hyperparameter ASR TTS Data gen. CER Acc 𝛽 𝛾 Mel Raw mode (%) (%) Paired - - - 10.06 7.07 9.38 97.7 (10k) 0.25 1 greedy 5.83 6.21 8.49 98.4 0.5 1 greedy 5.75 6.25 8.42 98.4 +Unpaired (40k) 0.25 1 beam 5 5.44 6.24 8.44 98.3 0.5 1 beam 5 5.77 6.20 8.44 98.3 26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 13

Experiment on Multi-Speaker Task • Dataset • BTEC ATR-EDB corpus (text & speech) (25 male, 25 female) • Supervised training: 80 utts / spk (text & speech paired) • Unsupervised training: 360 utts / spk (text & speech unpaired) • Result Hyperparameter ASR TTS Data gen. CER Acc 𝛽 𝛾 Mel Raw mode (%) (%) Paired - - - 26.47 10.21 13.18 98.6 (80 utt/spk) 0.25 1 greedy 23.03 9.14 12.86 98.7 0.5 1 greedy 20.91 9.31 12.88 98.6 +Unpaired (remaining) 0.25 1 beam 5 22.55 9.36 12.77 98.6 0.5 1 beam 5 19.99 9.20 12.84 98.6 26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 14

Conclusion • Proposed a speech chain based on deep-learning model • Explored applications in single and multi-speaker tasks • Results: improved ASR & TTS performance by teaching each other using only unpaired data • Future work: Perform real-time feedback mechanisms similar to human approach 26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 15

 Thank you for listening  26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 19

Unifying Speech Recognition and Generation with Machine Speech - PowerPoint PPT Presentation

Unifying Speech Recognition and Generation with Machine Speech Chain Andros Tjandra , Sakriani Sakti, Satoshi Nakamura Nara Institute of Science & Technology, Nara, Japan RIKEN AIP, Japan 26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 1

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Project Overview Speech Speech Generation Generation Common Semantic Frame Speech Speech

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Combining Speech and Speaker Recognition - A Joint Modeling Approach Hang Su Supervised by:

Deep Graph Random Process for Relational-Thinking-Based Speech Recognition HENGGUAN HUANG,

FSLT Speech Some Applications Jrgen Trouvain Symbolic Annotations & Dictionaries

Large Scale Learning of Speaker Variation Eleanor Chodroff Co-mentors: Sanjeev Khudanpur

Automated Speech Recognition in Controller Communications applied to Workload Measurement Third

Chief Executive Officers Presentation to Shareholders DISCLAIMER The material in this

Voice Controlled Smart Spaces Florian Gratzer Advisors: Marc-Oliver Pahl Stefan Liebald

MammoClass 2nd Breast Cancer Workshop 2015 April 7 th 2015 Porto, Portugal Ricardo Sousa Rocha

Speech recognition in systems for human- computer interaction Ubiquitous Computing Seminar FS2014

Unifying Speech Recognition and Generation with Machine Speech - PowerPoint PPT Presentation

Unifying Speech Recognition and Generation with Machine Speech Chain Andros Tjandra , Sakriani Sakti, Satoshi Nakamura Nara Institute of Science & Technology, Nara, Japan RIKEN AIP, Japan 26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 1

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Project Overview Speech Speech Generation Generation Common Semantic Frame Speech Speech

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Combining Speech and Speaker Recognition - A Joint Modeling Approach Hang Su Supervised by:

Deep Graph Random Process for Relational-Thinking-Based Speech Recognition HENGGUAN HUANG,

FSLT Speech Some Applications Jrgen Trouvain Symbolic Annotations &amp; Dictionaries

Large Scale Learning of Speaker Variation Eleanor Chodroff Co-mentors: Sanjeev Khudanpur

Automated Speech Recognition in Controller Communications applied to Workload Measurement Third

Chief Executive Officers Presentation to Shareholders DISCLAIMER The material in this

Voice Controlled Smart Spaces Florian Gratzer Advisors: Marc-Oliver Pahl Stefan Liebald

MammoClass 2nd Breast Cancer Workshop 2015 April 7 th 2015 Porto, Portugal Ricardo Sousa Rocha

Speech recognition in systems for human- computer interaction Ubiquitous Computing Seminar FS2014

FSLT Speech Some Applications Jrgen Trouvain Symbolic Annotations & Dictionaries