parameter sharing methods for multilingual self
play

Parameter Sharing Methods for Multilingual Self-Attentional - PowerPoint PPT Presentation

Parameter Sharing Methods for Multilingual Self-Attentional Translation Models Devendra Sachan 1 Graham Neubig 2 1 Data Solutions Team, Petuum Inc, USA 2 Language Technologies Institute, Carnegie Mellon University, USA Conference on Machine


  1. Parameter Sharing Methods for Multilingual Self-Attentional Translation Models Devendra Sachan 1 Graham Neubig 2 1 Data Solutions Team, Petuum Inc, USA 2 Language Technologies Institute, Carnegie Mellon University, USA Conference on Machine Translation , Nov 2018

  2. Multilingual Machine Translation English English German German Multilingual Machine Translation System Dutch Dutch Japanese Japanese ◮ Goal: Train a machine learning system to translate from multiple source languages to multiple target languages.

  3. Multilingual Machine Translation English English German German Multilingual Machine Translation System Dutch Dutch Japanese Japanese ◮ Goal: Train a machine learning system to translate from multiple source languages to multiple target languages. ◮ Multilingual models follow the multi-task learning (MTL) paradigm

  4. Multilingual Machine Translation English English German German Multilingual Machine Translation System Dutch Dutch Japanese Japanese ◮ Goal: Train a machine learning system to translate from multiple source languages to multiple target languages. ◮ Multilingual models follow the multi-task learning (MTL) paradigm 1. Models are jointly trained on data from several language pairs.

  5. Multilingual Machine Translation English English German German Multilingual Machine Translation System Dutch Dutch Japanese Japanese ◮ Goal: Train a machine learning system to translate from multiple source languages to multiple target languages. ◮ Multilingual models follow the multi-task learning (MTL) paradigm 1. Models are jointly trained on data from several language pairs. 2. Incorporate some degree of parameter sharing.

  6. One-to-Many Multilingual Translation German Multilingual Machine English Translation System Dutch ◮ Translation from a common source language (“En”) to multiple target languages (“De” and “Nl”)

  7. One-to-Many Multilingual Translation German Multilingual Machine English Translation System Dutch ◮ Translation from a common source language (“En”) to multiple target languages (“De” and “Nl”) ◮ Difficult task as we need to translate to (or generate) multiple target languages.

  8. Previous Approach: Separate Decoders Target Language 1: "De" Decoder 1 Shared Encoder Source Language: "En" Decoder 2 Target Language 2: "Nl" ◮ One shared encoder and one decoder per target language. 1 1 Multi-Task Learning for Multiple Language Translation, ACL 2015

  9. Previous Approach: Separate Decoders Target Language 1: "De" Decoder 1 Shared Encoder Source Language: "En" Decoder 2 Target Language 2: "Nl" ◮ One shared encoder and one decoder per target language. 1 ◮ Advantage: ability to model each target language separately. 1 Multi-Task Learning for Multiple Language Translation, ACL 2015

  10. Previous Approach: Separate Decoders Target Language 1: "De" Decoder 1 Shared Encoder Source Language: "En" Decoder 2 Target Language 2: "Nl" ◮ One shared encoder and one decoder per target language. 1 ◮ Advantage: ability to model each target language separately. ◮ Disadvantages: 1. Slower Training 1 Multi-Task Learning for Multiple Language Translation, ACL 2015

  11. Previous Approach: Separate Decoders Target Language 1: "De" Decoder 1 Shared Encoder Source Language: "En" Decoder 2 Target Language 2: "Nl" ◮ One shared encoder and one decoder per target language. 1 ◮ Advantage: ability to model each target language separately. ◮ Disadvantages: 1. Slower Training 2. Increased memory requirements 1 Multi-Task Learning for Multiple Language Translation, ACL 2015

  12. Previous Approach: Shared Decoder Target Language 1: "De" Shared Encoder Shared Decoder Source Language: "En" Target Language 2: "Nl" ◮ Single unified model: shared encoder and shared decoder for all language pairs. 2 2 Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation, ACL 2017

  13. Previous Approach: Shared Decoder Target Language 1: "De" Shared Encoder Shared Decoder Source Language: "En" Target Language 2: "Nl" ◮ Single unified model: shared encoder and shared decoder for all language pairs. 2 ◮ Advantages: ◮ Trivially implementable: using a standard bilingual translation model. 2 Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation, ACL 2017

  14. Previous Approach: Shared Decoder Target Language 1: "De" Shared Encoder Shared Decoder Source Language: "En" Target Language 2: "Nl" ◮ Single unified model: shared encoder and shared decoder for all language pairs. 2 ◮ Advantages: ◮ Trivially implementable: using a standard bilingual translation model. ◮ Constant number of trainable parameters. 2 Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation, ACL 2017

  15. Previous Approach: Shared Decoder Target Language 1: "De" Shared Encoder Shared Decoder Source Language: "En" Target Language 2: "Nl" ◮ Single unified model: shared encoder and shared decoder for all language pairs. 2 ◮ Advantages: ◮ Trivially implementable: using a standard bilingual translation model. ◮ Constant number of trainable parameters. ◮ Disadvantage: decoder’s ability to model multiple languages can be significantly reduced. 2 Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation, ACL 2017

  16. Our Proposed Approach: Partial Sharing Target Language 1: "De" Decoder 1 Shareable Shared Encoder Parameters Source Language: "En" Decoder 2 Target Language 2: "Nl" ◮ Share some but not all parameters.

  17. Our Proposed Approach: Partial Sharing Target Language 1: "De" Decoder 1 Shareable Shared Encoder Parameters Source Language: "En" Decoder 2 Target Language 2: "Nl" ◮ Share some but not all parameters. ◮ Generalizes previous approaches.

  18. Our Proposed Approach: Partial Sharing Target Language 1: "De" Decoder 1 Shareable Shared Encoder Parameters Source Language: "En" Decoder 2 Target Language 2: "Nl" ◮ Share some but not all parameters. ◮ Generalizes previous approaches. ◮ We focus on the self-attentional Transformer model.

  19. Transformer Model 3 Tied Linear Layer W ⊺ E Layer Norm N × Feed-Forward Network Sublayer W L 2 ReLU W L 1 Layer Norm z i Encoder-Decoder W 2 Attention Sublayer F a i Enc-Dec Inter Attention k i v i q i W 2 W 2 W 2 K V Q h i Encoder Layer Norm Hidden State z i Self-Attention W 1 Sublayer F a i Masked Self-Attention v i q i k i W 1 W 1 W 1 K V Q Layer Norm x i Position Encoding W E Embedding Layer 3 Attention is all you need, NIPS 2017

  20. Transformer Model 3 Tied Linear Layer W ⊺ E Layer Norm N × Feed-Forward Network Sublayer W L 2 ReLU ◮ Embedding Layer W L 1 Layer Norm z i Encoder-Decoder W 2 Attention Sublayer F a i Enc-Dec Inter Attention k i v i q i W 2 W 2 W 2 K V Q h i Encoder Layer Norm Hidden State z i Self-Attention W 1 Sublayer F a i Masked Self-Attention v i q i k i W 1 W 1 W 1 K V Q Layer Norm x i Position Encoding W E Embedding Layer 3 Attention is all you need, NIPS 2017

  21. Transformer Model 3 Tied Linear Layer W ⊺ E Layer Norm N × Feed-Forward Network Sublayer W L 2 ReLU ◮ Embedding Layer W L 1 Layer Norm ◮ Encoder Layer (2 sublayers) z i Encoder-Decoder W 2 Attention Sublayer F a i Enc-Dec Inter Attention k i v i q i W 2 W 2 W 2 K V Q h i Encoder Layer Norm Hidden State z i Self-Attention W 1 Sublayer F a i Masked Self-Attention v i q i k i W 1 W 1 W 1 K V Q Layer Norm x i Position Encoding W E Embedding Layer 3 Attention is all you need, NIPS 2017

  22. Transformer Model 3 Tied Linear Layer W ⊺ E Layer Norm N × Feed-Forward Network Sublayer W L 2 ReLU ◮ Embedding Layer W L 1 Layer Norm ◮ Encoder Layer (2 sublayers) z i 1. Self-attention Encoder-Decoder W 2 Attention Sublayer F a i Enc-Dec Inter Attention k i v i q i W 2 W 2 W 2 K V Q h i Encoder Layer Norm Hidden State z i Self-Attention W 1 Sublayer F a i Masked Self-Attention v i q i k i W 1 W 1 W 1 K V Q Layer Norm x i Position Encoding W E Embedding Layer 3 Attention is all you need, NIPS 2017

  23. Transformer Model 3 Tied Linear Layer W ⊺ E Layer Norm N × Feed-Forward Network Sublayer W L 2 ReLU ◮ Embedding Layer W L 1 Layer Norm ◮ Encoder Layer (2 sublayers) z i 1. Self-attention Encoder-Decoder W 2 Attention Sublayer F a i 2. Feed-forward network Enc-Dec Inter Attention k i v i q i W 2 W 2 W 2 K V Q h i Encoder Layer Norm Hidden State z i Self-Attention W 1 Sublayer F a i Masked Self-Attention v i q i k i W 1 W 1 W 1 K V Q Layer Norm x i Position Encoding W E Embedding Layer 3 Attention is all you need, NIPS 2017

  24. Transformer Model 3 Tied Linear Layer W ⊺ E Layer Norm N × Feed-Forward Network Sublayer W L 2 ReLU ◮ Embedding Layer W L 1 Layer Norm ◮ Encoder Layer (2 sublayers) z i 1. Self-attention Encoder-Decoder W 2 Attention Sublayer F a i 2. Feed-forward network Enc-Dec Inter Attention k i v i q i ◮ Decoder Layer (3 sublayers) W 2 W 2 W 2 K V Q h i Encoder Layer Norm Hidden State z i Self-Attention W 1 Sublayer F a i Masked Self-Attention v i q i k i W 1 W 1 W 1 K V Q Layer Norm x i Position Encoding W E Embedding Layer 3 Attention is all you need, NIPS 2017

Recommend


More recommend