Multilingual Speech Recognition With A Single End-To-End Model - PowerPoint PPT Presentation

Multilingual Speech Recognition With A Single End-To-End Model Shubham Toshniwal 1 , Tara N. Sainath 2 , Ron J. Weiss 2 , Bo Li 2 , Pedro Moreno 2 , Eugene Weinstein 2 , and Kanishka Rao 2 1 TTI Chicago 2 Google April 18, 2017

Why Multilingual Speech Recognition Models ? ◮ Remarkable progress in speech recognition in past few years ◮ Most of this success restricted to high resource languages, e.g. English ◮ Google Voice Search supports ∼ 120 out of 7000 languages ◮ Multilingual models: ◮ Utilize knowledge transfer across languages, and thus alleviate data requirement ◮ Successful in Neural Machine Translation (Google NMT) ◮ Easier to deploy and maintain

Conventional ASR Systems ◮ Traditional ASR systems are modular ◮ Require expert curated resources Feature Vectors Speech Words . . . ‘‘recognize speech’’ Feature Decoder Extraction Acoustic Pronunciation Language Model Dictionary Model

Conventional ASR Systems ◮ Traditional ASR systems are modular ◮ Require expert curated resources Feature Vectors Speech Words . . . ‘‘recognize speech’’ Feature Decoder Extraction Acoustic Pronunciation Language Model Dictionary Model ◮ Multilingual models: ◮ Focus on just the acoustic model (Lin, 2009; Ghoshal, 2013) ◮ Separate language model and pronunciation model required for each language

End-to-end ASR Models ◮ Encoder-decoder models achieved state-of-the-art result on Google Voice Search task (Chiu et al. 2018) ◮ Encoder-Decoder models are appealing because: ◮ Conceptually simple; subsume the acoustic model, pronunciation model, and language model in a single model. ◮ No need for expert curated resources! h 1 h 2 h 3 h T recognize speech EOS GO Acoustic x 1 x 2 x T x 3 Features

End-to-End Multilingual ASR Models y t Attention Layer u it = v T tanh( W 1 h i + W 2 d t ) α t = softmax( u t ) c t c t = � T g i =1 α it h i α t Encoder Decoder h 1 h 2 h 3 h T d t x 1 x 2 x 3 x T ◮ We use attention-based encoder-decoder models ◮ Decoder outputs one character per time step ◮ For multilingual models, take union over character sets

Multilingual Encoder-Decoder Models Model Training Inference Joint model No language ID No language ID ◮ Naive model; unaware of multilingual nature of data ◮ Can potentially handle code-switching

Multilingual Encoder-Decoder Models Model Training Inference Joint model No language ID No language ID Multitask model Language ID No language ID ◮ Trained to jointly recognize language ID and speech

Multilingual Encoder-Decoder Models Model Training Inference Joint model No language ID No language ID Multitask model Language ID No language ID Conditioned model Language ID Language ID ◮ Learnt embedding of language ID fed as input to condition the model ◮ Language ID embedding can be fed in: (a) Encoder, (b) Decoder, (c) Encoder & Decoder

Encoder-Conditioned Model h 1 h 2 h 3 h T Acoustic Features x 1 x 2 x 3 x T e L e L e L e L Language Embedding Encoder of encoder-conditioned model

Task ◮ Recognize 9 Indian languages with a single model ◮ Very little script overlap, except for Hindi and Marathi. ◮ The union of character sets is close to 1000 characters! ◮ But the languages have large overlap in phonetic space (Lavanya et al. 2005).

Experimental Setup ◮ Training data consists of dictated queries ◮ Average 230K queries ( ∼ 170 hrs) per language 364K 17 . 5 Fraction of Total Data (in %) 15 . 0 285K 12 . 5 243K 227K 232K 213K 10 . 0 192K 196K 164K 7 . 5 5 . 0 2 . 5 0 . 0 i i i a m i l u u a l t d h m i a n d t g d g r a a a u r a i l a U n H n a r T l e j n y a e u a M T B G a K l a M ◮ Baseline: Encoder-decoder models trained for individual languages

Joint vs Individual Joint 40 Individual 30 WER (in %) 20 10 0 i i i a m i l u u g l t d h i a a d m g d v n a t g r a a u r A n a i n l a U H a r T e l e j n y a t u T W B a a M G l K a M ◮ Joint model outperforms individual models on all languages!! ◮ The joint model is not even language aware at test time ◮ Overall a 21% relative reduction in Word Error Rate (WER)

Picking the Right Script 1.0 Bengali 1 0 0 0 0 0 0 0 0 0 0 0.99 0.003 0.001 0 0 0 0 0.001 0 Gujarati 0.8 0 0 1 0 0 0 0 0 0 0 Hindi Kannada 0.002 0.004 0.035 0.93 0.008 0 0 0.005 0.014 0 0.6 Malayalam 0 0.001 0.002 0.004 0.99 0 0 0.005 0.002 0 0.4 Marathi 0.001 0.007 0 0.021 0.003 0.95 0 0.002 0.018 0 0 0 0 0 0 0 1 0 0 0 Tamil 0.2 0 0.001 0 0.003 0.001 0 0 0.99 0 0 Telugu Urdu 0.001 0.003 0.009 0.004 0.003 0 0 0 0.98 0 0.0 i i i a m i l u u d a l t d h m i a d g d e g n a t r a a u r x n a i l a U H n a r l M i j a T e e u n y B M T G a a K a l M Rarely confused between languages

Joint vs Multitask Joint 35 Multitask 30 25 WER (in %) 20 15 10 5 0 i i i a m i l u u g l t d h i a a d m g d v n a t g r a a u r A n a i n l a U H a r T e l e j n y a t u T W B a a M G l K a M Insignificant gains from multitask training

Joint vs Conditioned Models Joint 35 Decoder Encoder 30 25 WER (in %) 20 15 10 5 0 Bengali Gujarati Hindi Kannada Malayalam Marathi Tamil Telugu Urdu Wt Avg ◮ As expected, conditioning the model on the language ID of speech helps ◮ Encoder conditioning: ◮ Performs better than decoder conditioning ◮ Potential acoustic model adaptation happening

Magic of Conditioning 1.0 1 0 0 0 0 0 0 0 0 0 Bengali 0 1 0 0 0 0 0 0 0 0 Gujarati 0.8 Hindi 0 0 1 0 0 0 0 0 0 0 Kannada 0 0 0 1 0 0 0 0 0 0 0.6 Malayalam 0 0 0 0 1 0 0 0 0 0 0.4 0 0 0 0 0 1 0 0 0 0 Marathi 0 0 0 0 0 0 1 0 0 0 Tamil 0.2 Telugu 0 0 0 0 0 0 0 1 0 0 Urdu 0 0 0 0 0 0 0 0 1 0 0.0 i i i a m i l u u d l t d h i a a d m g d e n a t g r a a u r x a i l a U n H n a r l i j a T e M e u n y M T B G a a K l a M

Testing the Limits: Code Switching ◮ Can the joint model code switch between 2 Indian languages (trained for recognizing them separately)

Testing the Limits: Code Switching ◮ Can the joint model code switch between 2 Indian languages (trained for recognizing them separately) ◮ Artificial test set of 1000 utterances of Tamil query followed by Hindi with 50ms silence in between ◮ The model does not code-switch :( ◮ Picks one of the two scripts and sticks with it ◮ From manual inspection: ◮ Transcribes either the Hindi/Tamil part in corresponding script ◮ Transliteration in rare cases

Feeding the Wrong Language ID ◮ Does the model obey acoustics or is it faithful to language ID?

Feeding the Wrong Language ID ◮ Does the model obey acoustics or is it faithful to language ID? ◮ Artificial dataset of 1000 Urdu queries tagged as Hindi ◮ Transliterates Urdu queries in Hindi’s script ◮ Learns to disentangle the acoustic-phonetic content from the language identity ◮ Transliterator as a byproduct!

Conclusion ◮ Encoder-Decoder models: ◮ Elegant and simple framework for multilingual models ◮ Outperform models trained for specific languages ◮ Rarely confused between individual languages ◮ Fail at code-switching ◮ Recent work along similar lines got promising results as well (Kim, 2017; Watanabe, 2017; Tong, 2018; Dalmia, 2018) ◮ Questions?

Conditioning Encoder is Enough 35 Encoder Encoder+Decoder 30 25 WER (in %) 20 15 10 5 0 i i i a m i l u u g l t d h i a a d m g d v g n a t A r i a a a u r n a H n a l l U r T e e j n y a t u T W B G a a M l K a M ◮ Conditioning decoder on top of conditioning the encoder doesn’t buy us much ◮ Possibly because the attention mechanism feeds in information from the encoder to the decoder

Multilingual Speech Recognition With A Single End-To-End Model - PowerPoint PPT Presentation

Multilingual Speech Recognition With A Single End-To-End Model Shubham Toshniwal 1 , Tara N. Sainath 2 , Ron J. Weiss 2 , Bo Li 2 , Pedro Moreno 2 , Eugene Weinstein 2 , and Kanishka Rao 2 1 TTI Chicago 2 Google April 18, 2017 Why Multilingual

Drupal 8s multilingual APIs Gbor Hojtsy DRUPAL 7 MULTILINGUAL DRUPAL 7 MULTILINGUAL Drupal

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Drupal 8 Multilingual Wonderland Gabor Hojtsy Acquia Foreign language site Multilingual site

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

Project Overview Speech Speech Generation Generation Common Semantic Frame Speech Speech

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Do You Hear What I Hear? Fingerprintin Smart Devices Through Embedded Acoustic Components A.Das,

Keyboard Acoustic Emanations Revisited Li Zhuang, Feng Zhou, and J.D. Tygar Presenter:

Training neural network acoustic models on (multichannel) waveforms Ron Weiss in SANE 2015

Our Project Acoustic and lexical effects on speech perception in Kaqchikel (Mayan) LSA 2017 Our

A combustion instability model accounting for dynamic flame-flow-acoustic interactions el Assier 1

Analog rotating black holes in a MHD inflow Based on ? Analog geometry for acoustic waves With

K-shot Learning of Acoustic Context Ivan Bocharov, Tjalling Tjalkens and Bert de Vries Eindhoven

Acoustic streaming modeling Milad Setareh Applied Mechanics/Fluid Dynamics, Amirkabir University