Pre-training on high-resource speech recognition improves - PowerPoint PPT Presentation

Pre-training on high-resource speech recognition improves low-resource speech-to-text translation Sameer Bansal Herman Kamper Karen Livescu Adam Lopez Sharon Goldwater

Current systems Spanish Audio: ? English text: 2

Current systems Spanish Audio: Automatic Speech ola mi nombre es hodor Spanish text: Recognition ? English text: 3

Current systems Spanish Audio: Automatic Speech ola mi nombre es hodor Spanish text: Recognition hi my name is hodor Machine English text: Translation 4

~100 languages supported by Google Translate ... 5

Unwritten languages Mboshi: Bantu language, Republic of Congo, ~160K speakers ~3000 languages with no writing system Automatic Speech Mboshi text: not available Recognition 6

Unwritten languages Mboshi: paired with French translations (Godard et al. 2018) ~3000 languages with no writing system Efforts to collect speech and translations using mobile apps ○ Aikuma : Bird et al. 2014, LIG-Aikuma : Blachon et al. 2016 7

Haiti Earthquake, 2010 Survivors sent text messages to helpline People trapped in Moun kwense nan Sacred Heart Sakre Kè nan Church, PauP Pòtoprens ● International rescue teams face language barrier ● No automated tools available ● Volunteers from global Haitian diaspora help create parallel text corpora in short time [Munro 2010] 8

Are we better prepared in 2019? Moun kwense nan Sakre Kè nan Pòtoprens People trapped in Sacred Heart Church, PauP Voice messages 9

Can we build a speech-to-text translation ( ST ) system? … given as training data: (source audio) paired with translations ● Tens of hours of speech paired with text translations ● No source text available 10

Neural models ... Spanish Audio: Weiss et al. (2017) Sequence-to-Sequence hi my name is hodor English text: Directly translate speech 11

Spanish speech to English text Spanish Audio ● telephone speech (unscripted) ● realistic noise conditions ● multiple speakers and dialects Encoder ● crowdsourced English text translations Attention Closer to real-world conditions Decoder English text 12

Spanish speech to English text Weiss et al. Good performance if trained on 100+ hours *for comparison text-to-text = 58 13

But ... Weiss et al. Poor performance in low-resource settings *for comparison text-to-text = 58 14

Goal: to improve translation performance 15

Goal: to improve translation performance … without labeling more low-resource speech 16

100s of hours of monolingual speech paired with text available … typically used to train ASR systems English text French text (English Audio) (French Audio) Key idea: leverage monolingual data from a different high-resource language 17

100s of hours of monolingual speech paired with text available … typically used to train ASR systems English text French text (English Audio) (French Audio) Sequence-to-Sequence English text Spanish Audio ~20 hours of Spanish-English ? 18

100s of hours of monolingual speech paired with text available … typically used to train ASR systems Weiss et al. 2017 Spanish text Anastasopoulos and Chiang 2018 Bérard et al. 2018 ( Spanish Audio ) Sperber et al. 2019 Sequence-to-Sequence English text Spanish Audio ~20 hours of Spanish-English 19

100s of hours of monolingual speech paired with text available … typically used to train ASR systems English text French text (English Audio) (French Audio) Sequence-to-Sequence English text Spanish Audio ~20 hours of Spanish-English ? 20

Why Spanish-English? 21

Why Spanish-English? simulate low-resource settings and test our method 22

Why Spanish-English? simulate low-resource settings and test our method Later: results on truly low-resource language --- Mboshi to French 23

Method Audio Same model architecture for ASR and ST Encoder Attention *randomly initialized parameters Decoder text 24

Pretrain on high-resource English audio 300 hours of English audio and text Encoder Attention *train until convergence Decoder English text 25

Fine-tune on low-resource 20 hours Spanish-English English audio Spanish audio Encoder Encoder Attention Attention transfer from English ASR Decoder Decoder English text English text 26

Fine-tune on low-resource 20 hours Spanish-English Spanish audio Encoder Attention *train until convergence Decoder English text 27

Will this work? 28

Spanish-English BLEU scores *for comparison Weiss et al. = 47.3 baseline 29

Spanish-English BLEU scores pretraining *for comparison Weiss et al. = 47.3 baseline 30

Spanish-English BLEU scores pretraining ● +9 BLEU *for comparison Weiss et al. = 47.3 baseline 31

Spanish-English BLEU scores pretraining ● better performance with half the data *for comparison Weiss et al. = 47.3 baseline 32

Further analysis pretraining 20 hours Spanish-English *for comparison Weiss et al. = 47.3 baseline 33

Faster training time pretraining baseline 34

Faster training time ● potentially useful in time critical scenarios pretraining 2 hours ~20 hours baseline 35

Ablation: model parameters English Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 Attention Attention +English ASR 19.9 Decoder Decoder English text English text 36

Ablation: model parameters English Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 random Attention Attention +English ASR 19.9 Decoder Decoder +English ASR: decoder 10.5 English text English text 37

Ablation: model parameters English Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 Attention Attention +English ASR 19.9 random Decoder Decoder +English ASR: decoder 10.5 English text English text +English ASR: encoder 16.6 38

Ablation: model parameters English Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 Attention Attention +English ASR 19.9 random Decoder Decoder +English ASR: decoder 10.5 English text English text +English ASR: encoder 16.6 … transferring encoder only parameters works well! 39

Ablation: model parameters English Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 Attention Attention +English ASR 19.9 random Decoder Decoder +English ASR: decoder 10.5 English text English text +English ASR: encoder 16.6 … can pretrain on a language different from both source and target in ST pair 40

Pretraining on French French Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 Attention Attention +English ASR 19.9 random Decoder Decoder +English ASR: encoder 16.6 French text English text +French ASR: encoder ? *only 20 hours of French ASR 41

Pretraining on French French Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 Attention Attention +English ASR 19.9 random Decoder Decoder +English ASR: encoder 16.6 French text English text +French ASR: encoder 12.5 French ASR helps Spanish-English ST 42

Takeaways ● Pretraining on a different language helps ● transfer all model parameters for best gains ● encoder parameters account for most of these … useful when target vocabulary is different 43

… Mboshi-French ST 44

Mboshi-French ST ● ST data by Godard et al. 2018 ○ ~4 hours of speech, paired with French translations ● Mboshi ○ Bantu language, Republic of Congo ○ Unwritten ○ ~160K speakers 45

Mboshi-French: Results Mboshi Mboshi to French, N = 4 hours BLEU Encoder baseline ? Attention Decoder French text 46

Mboshi-French: Results Mboshi Mboshi to French, N = 4 hours BLEU Encoder baseline 3.5 Attention Decoder French text *outperformed by a naive baseline 47

Pretraining on French ASR French Mboshi Mboshi to French, N = 4 hours BLEU Encoder Encoder baseline 3.5 Attention Attention +French ASR: all ? Decoder Decoder French text French text transfer all parameters 48

Pretraining on French ASR French Mboshi Mboshi to French, N = 4 hours BLEU Encoder Encoder baseline 3.5 Attention Attention +French ASR: all 5.9 Decoder Decoder French text French text French ASR helps Mboshi-French ST 49

Pretraining on French ASR French Mboshi Mboshi to French, N = 4 hours BLEU Encoder Encoder baseline 3.5 Attention Attention +French ASR: all 5.9 Decoder Decoder French text French text French ASR helps Mboshi-French ST 50

Pretraining on English ASR English Mboshi Mboshi to French, N = 4 hours BLEU Encoder Encoder baseline 3.5 Attention Attention +French ASR: all 5.9 random Decoder Decoder +English ASR: encoder ? English text French text using encoder trained on a lot more data 51

Pretraining on English ASR English Mboshi Mboshi to French, N = 4 hours BLEU Encoder Encoder baseline 3.5 Attention Attention +French ASR: all 5.9 random Decoder Decoder +English ASR: encoder 5.3 English text French text English ASR helps Mboshi-French ST 52

Pre-training on high-resource speech recognition improves - PowerPoint PPT Presentation

Pre-training on high-resource speech recognition improves low-resource speech-to-text translation Sameer Bansal Herman Kamper Karen Livescu Adam Lopez Sharon Goldwater Current systems Spanish Audio: ? English text: 2 Current systems

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 20:

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

Unsupervised neural and Bayesian models for zero-resource speech processing MIT CSAIL, 15 Nov.

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 12: Acoustic

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 9: Brief

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFSTs in ASR

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFST

Combining Speech and Speaker Recognition - A Joint Modeling Approach Hang Su Supervised by:

Speech Recognition Speech Recognition Berlin Chen,

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 11: Recurrent

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 22: Speaker

Speech Separation for Recognition and Enhancement Dan Ellis Laboratory for Recognition and

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 16: Language

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 7: Hidden

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 21: Speaker

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 10: Deep Neural

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Grammars

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction