pre training on high resource speech recognition improves
play

Pre-training on high-resource speech recognition improves - PowerPoint PPT Presentation

Pre-training on high-resource speech recognition improves low-resource speech-to-text translation Sameer Bansal Herman Kamper Karen Livescu Adam Lopez Sharon Goldwater Current systems Spanish Audio: ? English text: 2 Current systems


  1. Pre-training on high-resource speech recognition improves low-resource speech-to-text translation Sameer Bansal Herman Kamper Karen Livescu Adam Lopez Sharon Goldwater

  2. Current systems Spanish Audio: ? English text: 2

  3. Current systems Spanish Audio: Automatic Speech ola mi nombre es hodor Spanish text: Recognition ? English text: 3

  4. Current systems Spanish Audio: Automatic Speech ola mi nombre es hodor Spanish text: Recognition hi my name is hodor Machine English text: Translation 4

  5. ~100 languages supported by Google Translate ... 5

  6. Unwritten languages Mboshi: Bantu language, Republic of Congo, ~160K speakers ~3000 languages with no writing system Automatic Speech Mboshi text: not available Recognition 6

  7. Unwritten languages Mboshi: paired with French translations (Godard et al. 2018) ~3000 languages with no writing system Efforts to collect speech and translations using mobile apps ○ Aikuma : Bird et al. 2014, LIG-Aikuma : Blachon et al. 2016 7

  8. Haiti Earthquake, 2010 Survivors sent text messages to helpline People trapped in Moun kwense nan Sacred Heart Sakre Kè nan Church, PauP Pòtoprens ● International rescue teams face language barrier ● No automated tools available ● Volunteers from global Haitian diaspora help create parallel text corpora in short time [Munro 2010] 8

  9. Are we better prepared in 2019? Moun kwense nan Sakre Kè nan Pòtoprens People trapped in Sacred Heart Church, PauP Voice messages 9

  10. Can we build a speech-to-text translation ( ST ) system? … given as training data: (source audio) paired with translations ● Tens of hours of speech paired with text translations ● No source text available 10

  11. Neural models ... Spanish Audio: Weiss et al. (2017) Sequence-to-Sequence hi my name is hodor English text: Directly translate speech 11

  12. Spanish speech to English text Spanish Audio ● telephone speech (unscripted) ● realistic noise conditions ● multiple speakers and dialects Encoder ● crowdsourced English text translations Attention Closer to real-world conditions Decoder English text 12

  13. Spanish speech to English text Weiss et al. Good performance if trained on 100+ hours *for comparison text-to-text = 58 13

  14. But ... Weiss et al. Poor performance in low-resource settings *for comparison text-to-text = 58 14

  15. Goal: to improve translation performance 15

  16. Goal: to improve translation performance … without labeling more low-resource speech 16

  17. 100s of hours of monolingual speech paired with text available … typically used to train ASR systems English text French text (English Audio) (French Audio) Key idea: leverage monolingual data from a different high-resource language 17

  18. 100s of hours of monolingual speech paired with text available … typically used to train ASR systems English text French text (English Audio) (French Audio) Sequence-to-Sequence English text Spanish Audio ~20 hours of Spanish-English ? 18

  19. 100s of hours of monolingual speech paired with text available … typically used to train ASR systems Weiss et al. 2017 Spanish text Anastasopoulos and Chiang 2018 Bérard et al. 2018 ( Spanish Audio ) Sperber et al. 2019 Sequence-to-Sequence English text Spanish Audio ~20 hours of Spanish-English 19

  20. 100s of hours of monolingual speech paired with text available … typically used to train ASR systems English text French text (English Audio) (French Audio) Sequence-to-Sequence English text Spanish Audio ~20 hours of Spanish-English ? 20

  21. Why Spanish-English? 21

  22. Why Spanish-English? simulate low-resource settings and test our method 22

  23. Why Spanish-English? simulate low-resource settings and test our method Later: results on truly low-resource language --- Mboshi to French 23

  24. Method Audio Same model architecture for ASR and ST Encoder Attention *randomly initialized parameters Decoder text 24

  25. Pretrain on high-resource English audio 300 hours of English audio and text Encoder Attention *train until convergence Decoder English text 25

  26. Fine-tune on low-resource 20 hours Spanish-English English audio Spanish audio Encoder Encoder Attention Attention transfer from English ASR Decoder Decoder English text English text 26

  27. Fine-tune on low-resource 20 hours Spanish-English Spanish audio Encoder Attention *train until convergence Decoder English text 27

  28. Will this work? 28

  29. Spanish-English BLEU scores *for comparison Weiss et al. = 47.3 baseline 29

  30. Spanish-English BLEU scores pretraining *for comparison Weiss et al. = 47.3 baseline 30

  31. Spanish-English BLEU scores pretraining ● +9 BLEU *for comparison Weiss et al. = 47.3 baseline 31

  32. Spanish-English BLEU scores pretraining ● better performance with half the data *for comparison Weiss et al. = 47.3 baseline 32

  33. Further analysis pretraining 20 hours Spanish-English *for comparison Weiss et al. = 47.3 baseline 33

  34. Faster training time pretraining baseline 34

  35. Faster training time ● potentially useful in time critical scenarios pretraining 2 hours ~20 hours baseline 35

  36. Ablation: model parameters English Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 Attention Attention +English ASR 19.9 Decoder Decoder English text English text 36

  37. Ablation: model parameters English Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 random Attention Attention +English ASR 19.9 Decoder Decoder +English ASR: decoder 10.5 English text English text 37

  38. Ablation: model parameters English Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 Attention Attention +English ASR 19.9 random Decoder Decoder +English ASR: decoder 10.5 English text English text +English ASR: encoder 16.6 38

  39. Ablation: model parameters English Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 Attention Attention +English ASR 19.9 random Decoder Decoder +English ASR: decoder 10.5 English text English text +English ASR: encoder 16.6 … transferring encoder only parameters works well! 39

  40. Ablation: model parameters English Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 Attention Attention +English ASR 19.9 random Decoder Decoder +English ASR: decoder 10.5 English text English text +English ASR: encoder 16.6 … can pretrain on a language different from both source and target in ST pair 40

  41. Pretraining on French French Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 Attention Attention +English ASR 19.9 random Decoder Decoder +English ASR: encoder 16.6 French text English text +French ASR: encoder ? *only 20 hours of French ASR 41

  42. Pretraining on French French Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 Attention Attention +English ASR 19.9 random Decoder Decoder +English ASR: encoder 16.6 French text English text +French ASR: encoder 12.5 French ASR helps Spanish-English ST 42

  43. Takeaways ● Pretraining on a different language helps ● transfer all model parameters for best gains ● encoder parameters account for most of these … useful when target vocabulary is different 43

  44. … Mboshi-French ST 44

  45. Mboshi-French ST ● ST data by Godard et al. 2018 ○ ~4 hours of speech, paired with French translations ● Mboshi ○ Bantu language, Republic of Congo ○ Unwritten ○ ~160K speakers 45

  46. Mboshi-French: Results Mboshi Mboshi to French, N = 4 hours BLEU Encoder baseline ? Attention Decoder French text 46

  47. Mboshi-French: Results Mboshi Mboshi to French, N = 4 hours BLEU Encoder baseline 3.5 Attention Decoder French text *outperformed by a naive baseline 47

  48. Pretraining on French ASR French Mboshi Mboshi to French, N = 4 hours BLEU Encoder Encoder baseline 3.5 Attention Attention +French ASR: all ? Decoder Decoder French text French text transfer all parameters 48

  49. Pretraining on French ASR French Mboshi Mboshi to French, N = 4 hours BLEU Encoder Encoder baseline 3.5 Attention Attention +French ASR: all 5.9 Decoder Decoder French text French text French ASR helps Mboshi-French ST 49

  50. Pretraining on French ASR French Mboshi Mboshi to French, N = 4 hours BLEU Encoder Encoder baseline 3.5 Attention Attention +French ASR: all 5.9 Decoder Decoder French text French text French ASR helps Mboshi-French ST 50

  51. Pretraining on English ASR English Mboshi Mboshi to French, N = 4 hours BLEU Encoder Encoder baseline 3.5 Attention Attention +French ASR: all 5.9 random Decoder Decoder +English ASR: encoder ? English text French text using encoder trained on a lot more data 51

  52. Pretraining on English ASR English Mboshi Mboshi to French, N = 4 hours BLEU Encoder Encoder baseline 3.5 Attention Attention +French ASR: all 5.9 random Decoder Decoder +English ASR: encoder 5.3 English text French text English ASR helps Mboshi-French ST 52

Recommend


More recommend