multilingual speech recognition with a single end to end
play

Multilingual Speech Recognition With A Single End-To-End Model - PowerPoint PPT Presentation

Multilingual Speech Recognition With A Single End-To-End Model Shubham Toshniwal 1 , Tara N. Sainath 2 , Ron J. Weiss 2 , Bo Li 2 , Pedro Moreno 2 , Eugene Weinstein 2 , and Kanishka Rao 2 1 TTI Chicago 2 Google April 18, 2017 Why Multilingual


  1. Multilingual Speech Recognition With A Single End-To-End Model Shubham Toshniwal 1 , Tara N. Sainath 2 , Ron J. Weiss 2 , Bo Li 2 , Pedro Moreno 2 , Eugene Weinstein 2 , and Kanishka Rao 2 1 TTI Chicago 2 Google April 18, 2017

  2. Why Multilingual Speech Recognition Models ? ◮ Remarkable progress in speech recognition in past few years ◮ Most of this success restricted to high resource languages, e.g. English ◮ Google Voice Search supports ∼ 120 out of 7000 languages ◮ Multilingual models: ◮ Utilize knowledge transfer across languages, and thus alleviate data requirement ◮ Successful in Neural Machine Translation (Google NMT) ◮ Easier to deploy and maintain

  3. Conventional ASR Systems ◮ Traditional ASR systems are modular ◮ Require expert curated resources Feature Vectors Speech Words . . . ‘‘recognize speech’’ Feature Decoder Extraction Acoustic Pronunciation Language Model Dictionary Model

  4. Conventional ASR Systems ◮ Traditional ASR systems are modular ◮ Require expert curated resources Feature Vectors Speech Words . . . ‘‘recognize speech’’ Feature Decoder Extraction Acoustic Pronunciation Language Model Dictionary Model ◮ Multilingual models: ◮ Focus on just the acoustic model (Lin, 2009; Ghoshal, 2013) ◮ Separate language model and pronunciation model required for each language

  5. End-to-end ASR Models ◮ Encoder-decoder models achieved state-of-the-art result on Google Voice Search task (Chiu et al. 2018) ◮ Encoder-Decoder models are appealing because: ◮ Conceptually simple; subsume the acoustic model, pronunciation model, and language model in a single model. ◮ No need for expert curated resources! h 1 h 2 h 3 h T recognize speech EOS GO Acoustic x 1 x 2 x T x 3 Features

  6. End-to-End Multilingual ASR Models y t Attention Layer u it = v T tanh( W 1 h i + W 2 d t ) α t = softmax( u t ) c t c t = � T g i =1 α it h i α t Encoder Decoder h 1 h 2 h 3 h T d t x 1 x 2 x 3 x T ◮ We use attention-based encoder-decoder models ◮ Decoder outputs one character per time step ◮ For multilingual models, take union over character sets

  7. Multilingual Encoder-Decoder Models Model Training Inference Joint model No language ID No language ID ◮ Naive model; unaware of multilingual nature of data ◮ Can potentially handle code-switching

  8. Multilingual Encoder-Decoder Models Model Training Inference Joint model No language ID No language ID Multitask model Language ID No language ID ◮ Trained to jointly recognize language ID and speech

  9. Multilingual Encoder-Decoder Models Model Training Inference Joint model No language ID No language ID Multitask model Language ID No language ID Conditioned model Language ID Language ID ◮ Learnt embedding of language ID fed as input to condition the model ◮ Language ID embedding can be fed in: (a) Encoder, (b) Decoder, (c) Encoder & Decoder

  10. Encoder-Conditioned Model h 1 h 2 h 3 h T Acoustic Features x 1 x 2 x 3 x T e L e L e L e L Language Embedding Encoder of encoder-conditioned model

  11. Task ◮ Recognize 9 Indian languages with a single model ◮ Very little script overlap, except for Hindi and Marathi. ◮ The union of character sets is close to 1000 characters! ◮ But the languages have large overlap in phonetic space (Lavanya et al. 2005).

  12. Experimental Setup ◮ Training data consists of dictated queries ◮ Average 230K queries ( ∼ 170 hrs) per language 364K 17 . 5 Fraction of Total Data (in %) 15 . 0 285K 12 . 5 243K 227K 232K 213K 10 . 0 192K 196K 164K 7 . 5 5 . 0 2 . 5 0 . 0 i i i a m i l u u a l t d h m i a n d t g d g r a a a u r a i l a U n H n a r T l e j n y a e u a M T B G a K l a M ◮ Baseline: Encoder-decoder models trained for individual languages

  13. Joint vs Individual Joint 40 Individual 30 WER (in %) 20 10 0 i i i a m i l u u g l t d h i a a d m g d v n a t g r a a u r A n a i n l a U H a r T e l e j n y a t u T W B a a M G l K a M ◮ Joint model outperforms individual models on all languages!! ◮ The joint model is not even language aware at test time ◮ Overall a 21% relative reduction in Word Error Rate (WER)

  14. Picking the Right Script 1.0 Bengali 1 0 0 0 0 0 0 0 0 0 0 0.99 0.003 0.001 0 0 0 0 0.001 0 Gujarati 0.8 0 0 1 0 0 0 0 0 0 0 Hindi Kannada 0.002 0.004 0.035 0.93 0.008 0 0 0.005 0.014 0 0.6 Malayalam 0 0.001 0.002 0.004 0.99 0 0 0.005 0.002 0 0.4 Marathi 0.001 0.007 0 0.021 0.003 0.95 0 0.002 0.018 0 0 0 0 0 0 0 1 0 0 0 Tamil 0.2 0 0.001 0 0.003 0.001 0 0 0.99 0 0 Telugu Urdu 0.001 0.003 0.009 0.004 0.003 0 0 0 0.98 0 0.0 i i i a m i l u u d a l t d h m i a d g d e g n a t r a a u r x n a i l a U H n a r l M i j a T e e u n y B M T G a a K a l M Rarely confused between languages

  15. Joint vs Multitask Joint 35 Multitask 30 25 WER (in %) 20 15 10 5 0 i i i a m i l u u g l t d h i a a d m g d v n a t g r a a u r A n a i n l a U H a r T e l e j n y a t u T W B a a M G l K a M Insignificant gains from multitask training

  16. Joint vs Conditioned Models Joint 35 Decoder Encoder 30 25 WER (in %) 20 15 10 5 0 Bengali Gujarati Hindi Kannada Malayalam Marathi Tamil Telugu Urdu Wt Avg ◮ As expected, conditioning the model on the language ID of speech helps ◮ Encoder conditioning: ◮ Performs better than decoder conditioning ◮ Potential acoustic model adaptation happening

  17. Magic of Conditioning 1.0 1 0 0 0 0 0 0 0 0 0 Bengali 0 1 0 0 0 0 0 0 0 0 Gujarati 0.8 Hindi 0 0 1 0 0 0 0 0 0 0 Kannada 0 0 0 1 0 0 0 0 0 0 0.6 Malayalam 0 0 0 0 1 0 0 0 0 0 0.4 0 0 0 0 0 1 0 0 0 0 Marathi 0 0 0 0 0 0 1 0 0 0 Tamil 0.2 Telugu 0 0 0 0 0 0 0 1 0 0 Urdu 0 0 0 0 0 0 0 0 1 0 0.0 i i i a m i l u u d l t d h i a a d m g d e n a t g r a a u r x a i l a U n H n a r l i j a T e M e u n y M T B G a a K l a M

  18. Testing the Limits: Code Switching ◮ Can the joint model code switch between 2 Indian languages (trained for recognizing them separately)

  19. Testing the Limits: Code Switching ◮ Can the joint model code switch between 2 Indian languages (trained for recognizing them separately) ◮ Artificial test set of 1000 utterances of Tamil query followed by Hindi with 50ms silence in between ◮ The model does not code-switch :( ◮ Picks one of the two scripts and sticks with it ◮ From manual inspection: ◮ Transcribes either the Hindi/Tamil part in corresponding script ◮ Transliteration in rare cases

  20. Feeding the Wrong Language ID ◮ Does the model obey acoustics or is it faithful to language ID?

  21. Feeding the Wrong Language ID ◮ Does the model obey acoustics or is it faithful to language ID? ◮ Artificial dataset of 1000 Urdu queries tagged as Hindi ◮ Transliterates Urdu queries in Hindi’s script ◮ Learns to disentangle the acoustic-phonetic content from the language identity ◮ Transliterator as a byproduct!

  22. Conclusion ◮ Encoder-Decoder models: ◮ Elegant and simple framework for multilingual models ◮ Outperform models trained for specific languages ◮ Rarely confused between individual languages ◮ Fail at code-switching ◮ Recent work along similar lines got promising results as well (Kim, 2017; Watanabe, 2017; Tong, 2018; Dalmia, 2018) ◮ Questions?

  23. Conditioning Encoder is Enough 35 Encoder Encoder+Decoder 30 25 WER (in %) 20 15 10 5 0 i i i a m i l u u g l t d h i a a d m g d v g n a t A r i a a a u r n a H n a l l U r T e e j n y a t u T W B G a a M l K a M ◮ Conditioning decoder on top of conditioning the encoder doesn’t buy us much ◮ Possibly because the attention mechanism feeds in information from the encoder to the decoder

Recommend


More recommend