cr cros oss lin lingual al lan languag age mod model pr
play

Cr Cros oss-lin lingual al lan languag age mod model pr - PowerPoint PPT Presentation

Cr Cros oss-lin lingual al lan languag age mod model pr pretraini ning ng Alexis Conneau and Guillaume Lample Facebook AI Research 1 Why learning cross-lingual representations? 1 2 4 3 This is great. Cest super. Das ist toll.


  1. Cr Cros oss-lin lingual al lan languag age mod model pr pretraini ning ng Alexis Conneau and Guillaume Lample Facebook AI Research 1

  2. Why learning cross-lingual representations? 1 2 4 3 This is great. C’est super. Das ist toll. 2

  3. Cross-lingual language models 3

  4. Mult. Masked Language Modeling (MLM) Similar to BERT, we pretrain a Transformer model with MLM but in many languages: Multilingual Masked language modeling pretraining .. multilingual representations emerge from a single MLM trained on many languages. Devlin et al. – BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding (+ mBERT) 4

  5. Translation Language Modeling (TLM) Multilingual MLM is unsupervised, but we leverage parallel data with TLM: Translation language modeling (TLM) pretraining .. to encourage the model to leverage cross-lingual context when making predictions. 5

  6. Results on XLU benchmarks 6

  7. Results on Cross-lingual Classification (XNLI) Average XNLI accuracy on the 15 languages of XNLI The pretrained encoder is fine-tuned on the English for zero-shot cross-lingual classification XNLI(*) training data and then tested on 15 languages XNLI baseline 65.6 mBERT 66.3 XLM LASER 70.2 XLM (MLM) 71.5 XLM (MLM+TLM) 75.1 60 64 68 72 76 Average XNLI accuracy over 15 languages (*) Conneau et al. – XNLI: Evaluating Cross-lingual Sentence Representations (EMNLP 2018) 7

  8. Results on Unsupervised Machine Translation Initialization is key in unsupervised MT to bootstrap the iterative BT process Embedding layer initialization Full Transformer model initialization is essential for neural unsupervised MT (*) significantly improves performance (+7 BLEU) Embeddings pretrained 27.3 Full model pretrained (CLM) 30.5 Full model pretrained (MLM) 34.3 Supervised 2016 SOTA (Edinburgh) 36.2 20 24 28 32 36 40 BLEU (*) Lample et al. – Phrase-based and neural unsupervised machine translation (EMNLP 2018) 8

  9. Results on Supervised Machine Translation We also show the importance of pretraining for generation • Pretraining both the encoder and decoder improves BLEU score No pretraining • MLM better than LM pretraining Full model pretrained (CLM) • Back-translation + pretraining leads to the best BLEU score Full model pretrained (MLM) 20 24 28 32 36 40 • Pretraining is more important without back-translation with back-translation when supervised data is small 9

  10. Conclusion • Cross-lingual language model pretraining is very effective for XLU • New state of the art for cross-lingual classification on XNLI • Reduces the gap between unsupervised and supervised MT • Recent developments have improved XLM/mBERT models 10

  11. Thank you! Code and models available at github.com/facebookresearch/XLM Lample & Conneau – Cross-lingual Language Model Pretraining (NeurIPS 2019) 11

Recommend


More recommend