Pre-training on high-resource speech recognition improves low-resource speech-to-text translation Sameer Bansal Herman Kamper Karen Livescu Adam Lopez Sharon Goldwater
Current systems Spanish Audio: ? English text: 2
Current systems Spanish Audio: Automatic Speech ola mi nombre es hodor Spanish text: Recognition ? English text: 3
Current systems Spanish Audio: Automatic Speech ola mi nombre es hodor Spanish text: Recognition hi my name is hodor Machine English text: Translation 4
~100 languages supported by Google Translate ... 5
Unwritten languages Mboshi: Bantu language, Republic of Congo, ~160K speakers ~3000 languages with no writing system Automatic Speech Mboshi text: not available Recognition 6
Unwritten languages Mboshi: paired with French translations (Godard et al. 2018) ~3000 languages with no writing system Efforts to collect speech and translations using mobile apps ○ Aikuma : Bird et al. 2014, LIG-Aikuma : Blachon et al. 2016 7
Haiti Earthquake, 2010 Survivors sent text messages to helpline People trapped in Moun kwense nan Sacred Heart Sakre Kè nan Church, PauP Pòtoprens ● International rescue teams face language barrier ● No automated tools available ● Volunteers from global Haitian diaspora help create parallel text corpora in short time [Munro 2010] 8
Are we better prepared in 2019? Moun kwense nan Sakre Kè nan Pòtoprens People trapped in Sacred Heart Church, PauP Voice messages 9
Can we build a speech-to-text translation ( ST ) system? … given as training data: (source audio) paired with translations ● Tens of hours of speech paired with text translations ● No source text available 10
Neural models ... Spanish Audio: Weiss et al. (2017) Sequence-to-Sequence hi my name is hodor English text: Directly translate speech 11
Spanish speech to English text Spanish Audio ● telephone speech (unscripted) ● realistic noise conditions ● multiple speakers and dialects Encoder ● crowdsourced English text translations Attention Closer to real-world conditions Decoder English text 12
Spanish speech to English text Weiss et al. Good performance if trained on 100+ hours *for comparison text-to-text = 58 13
But ... Weiss et al. Poor performance in low-resource settings *for comparison text-to-text = 58 14
Goal: to improve translation performance 15
Goal: to improve translation performance … without labeling more low-resource speech 16
100s of hours of monolingual speech paired with text available … typically used to train ASR systems English text French text (English Audio) (French Audio) Key idea: leverage monolingual data from a different high-resource language 17
100s of hours of monolingual speech paired with text available … typically used to train ASR systems English text French text (English Audio) (French Audio) Sequence-to-Sequence English text Spanish Audio ~20 hours of Spanish-English ? 18
100s of hours of monolingual speech paired with text available … typically used to train ASR systems Weiss et al. 2017 Spanish text Anastasopoulos and Chiang 2018 Bérard et al. 2018 ( Spanish Audio ) Sperber et al. 2019 Sequence-to-Sequence English text Spanish Audio ~20 hours of Spanish-English 19
100s of hours of monolingual speech paired with text available … typically used to train ASR systems English text French text (English Audio) (French Audio) Sequence-to-Sequence English text Spanish Audio ~20 hours of Spanish-English ? 20
Why Spanish-English? 21
Why Spanish-English? simulate low-resource settings and test our method 22
Why Spanish-English? simulate low-resource settings and test our method Later: results on truly low-resource language --- Mboshi to French 23
Method Audio Same model architecture for ASR and ST Encoder Attention *randomly initialized parameters Decoder text 24
Pretrain on high-resource English audio 300 hours of English audio and text Encoder Attention *train until convergence Decoder English text 25
Fine-tune on low-resource 20 hours Spanish-English English audio Spanish audio Encoder Encoder Attention Attention transfer from English ASR Decoder Decoder English text English text 26
Fine-tune on low-resource 20 hours Spanish-English Spanish audio Encoder Attention *train until convergence Decoder English text 27
Will this work? 28
Spanish-English BLEU scores *for comparison Weiss et al. = 47.3 baseline 29
Spanish-English BLEU scores pretraining *for comparison Weiss et al. = 47.3 baseline 30
Spanish-English BLEU scores pretraining ● +9 BLEU *for comparison Weiss et al. = 47.3 baseline 31
Spanish-English BLEU scores pretraining ● better performance with half the data *for comparison Weiss et al. = 47.3 baseline 32
Further analysis pretraining 20 hours Spanish-English *for comparison Weiss et al. = 47.3 baseline 33
Faster training time pretraining baseline 34
Faster training time ● potentially useful in time critical scenarios pretraining 2 hours ~20 hours baseline 35
Ablation: model parameters English Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 Attention Attention +English ASR 19.9 Decoder Decoder English text English text 36
Ablation: model parameters English Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 random Attention Attention +English ASR 19.9 Decoder Decoder +English ASR: decoder 10.5 English text English text 37
Ablation: model parameters English Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 Attention Attention +English ASR 19.9 random Decoder Decoder +English ASR: decoder 10.5 English text English text +English ASR: encoder 16.6 38
Ablation: model parameters English Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 Attention Attention +English ASR 19.9 random Decoder Decoder +English ASR: decoder 10.5 English text English text +English ASR: encoder 16.6 … transferring encoder only parameters works well! 39
Ablation: model parameters English Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 Attention Attention +English ASR 19.9 random Decoder Decoder +English ASR: decoder 10.5 English text English text +English ASR: encoder 16.6 … can pretrain on a language different from both source and target in ST pair 40
Pretraining on French French Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 Attention Attention +English ASR 19.9 random Decoder Decoder +English ASR: encoder 16.6 French text English text +French ASR: encoder ? *only 20 hours of French ASR 41
Pretraining on French French Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 Attention Attention +English ASR 19.9 random Decoder Decoder +English ASR: encoder 16.6 French text English text +French ASR: encoder 12.5 French ASR helps Spanish-English ST 42
Takeaways ● Pretraining on a different language helps ● transfer all model parameters for best gains ● encoder parameters account for most of these … useful when target vocabulary is different 43
… Mboshi-French ST 44
Mboshi-French ST ● ST data by Godard et al. 2018 ○ ~4 hours of speech, paired with French translations ● Mboshi ○ Bantu language, Republic of Congo ○ Unwritten ○ ~160K speakers 45
Mboshi-French: Results Mboshi Mboshi to French, N = 4 hours BLEU Encoder baseline ? Attention Decoder French text 46
Mboshi-French: Results Mboshi Mboshi to French, N = 4 hours BLEU Encoder baseline 3.5 Attention Decoder French text *outperformed by a naive baseline 47
Pretraining on French ASR French Mboshi Mboshi to French, N = 4 hours BLEU Encoder Encoder baseline 3.5 Attention Attention +French ASR: all ? Decoder Decoder French text French text transfer all parameters 48
Pretraining on French ASR French Mboshi Mboshi to French, N = 4 hours BLEU Encoder Encoder baseline 3.5 Attention Attention +French ASR: all 5.9 Decoder Decoder French text French text French ASR helps Mboshi-French ST 49
Pretraining on French ASR French Mboshi Mboshi to French, N = 4 hours BLEU Encoder Encoder baseline 3.5 Attention Attention +French ASR: all 5.9 Decoder Decoder French text French text French ASR helps Mboshi-French ST 50
Pretraining on English ASR English Mboshi Mboshi to French, N = 4 hours BLEU Encoder Encoder baseline 3.5 Attention Attention +French ASR: all 5.9 random Decoder Decoder +English ASR: encoder ? English text French text using encoder trained on a lot more data 51
Pretraining on English ASR English Mboshi Mboshi to French, N = 4 hours BLEU Encoder Encoder baseline 3.5 Attention Attention +French ASR: all 5.9 random Decoder Decoder +English ASR: encoder 5.3 English text French text English ASR helps Mboshi-French ST 52
Recommend
More recommend