sharp nearby fuzzy far away how neural language models
play

Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context - PowerPoint PPT Presentation

Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context Zhen Yang,Wei Chen, FengWang, Bo Xu Chinese Academy of Sciences Jing Ye NLP Lab, Soochow University 1 Outline Task Definition Challenges and Solution Approach


  1. Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context Zhen Yang,Wei Chen, FengWang, Bo Xu Chinese Academy of Sciences Jing Ye NLP Lab, Soochow University 1

  2. Outline  Task Definition  Challenges and Solution  Approach  Experiments  Conclusion 2 NLPLab, Soochow University

  3. Task Definition  NMT  Unsupervised neural machine translation(NMT) is a approach for machine translation  Without using any labeled data during training  Translation Tasks  English-German  English-French  Chinese-English 3 NLPLab, Soochow University

  4. Challenges and Solution  Challenges  Only one encoder is shared by the source and target language  Be weakly in keeping in the uniqueness and internal characteristics of each language(style, terminology, sentence structure)  Solution  Two independent encoders and decoders for each one language  The weight-sharing constraint to the two auto- encoder(AE)  Two different generative adversarial networks(GAN) 4 NLPLab, Soochow University

  5. Approach  Model Architecture  Encoders: ��� � , ��� �  Decoders: ��� � , ��� �  Local discriminator: � �  Global discriminator: � �� , � �� 5 NLPLab, Soochow University

  6. Approach  Model Architecture  Encoders: ��� � , ��� � The encoder is composed of a stack of four identical layers Each layer consists of a multi-head self-attention and a simple position-wise fully connected feed-forward network. 6 NLPLab, Soochow University

  7. Approach  Model Architecture  Decoders: ��� � , ��� � The decoder is composed of four identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sublayer, which performs multi-head attention over the output of the encoder stack. 7 NLPLab, Soochow University

  8. Approach  Model Architecture  Local discriminator: � � a multi-layer perceptron  Global discriminator: � �� , � �� the convolutional neural network (CNN) 8 NLPLab, Soochow University

  9. Approach  Model Architecture 9 NLPLab, Soochow University

  10. Approach  Directional Self-attention  The forward mask � � The later token only makes attention connections to the early tokens in the sequence  The backward mask � � 10 NLPLab, Soochow University

  11. Approach  Weight sharing  Sharing the weights of the last few layers of the ��� � , ��� �  Be responsible for extracting high-level representations of the input sentences  Sharing the first few layers of the ��� � and ��� � , which are expected to decode high-level representations that are vital for reconstructing the input sentences  The shared weights are utilized to map the hidden features extracted by the independent weights to the shared-latent space 11 NLPLab, Soochow University

  12. Approach  Embedding reinforced encoder  Using pretrained cross-lingual embeddings in the encoders that are kept fixed during training  the input sequence embedding vectors � � �� � , … , � � �  the initial output sequence of the encoder stack H � �� � , … , � � �  the final output sequence of the encoder  g is a gate unit ( � � , � � and b are trainable parameters and they are shared by the two encoders) 12 NLPLab, Soochow University

  13. Approach  Embedding reinforced encoder  Using pretrained cross-lingual embeddings in the encoders that are kept fixed during training  the input sequence embedding vectors � � �� � , … , � � �  the initial output sequence of the encoder stack H � �� � , … , � � �  the final output sequence of the encoder  g is a gate unit ( � � , � � and b are trainable parameters and they are shared by the two encoders) 13 NLPLab, Soochow University

  14. Approach  Local GAN  To classify between the encoding of source sentences and the encoding of target sentences  Local discriminator loss θ �� 、 θ ���� and θ ���� represent the parameters of the local discriminator and two encoders  Encoders loss 14 NLPLab, Soochow University

  15. Approach  Global GAN  aim to distinguish among the true sentences and generated sentences  update the whole parameters of the proposed model, including the parameters of encoders and decoders  In ��� �� , the Enct and Decs act as the generator, which generates the sentence � � � from � �  The � �� , implemented based on CNN, assesses whether the generated sentence � � � is the true target-language sentence or the generated sentence 15 NLPLab, Soochow University

  16. Approach  Global GAN  aim to distinguish among the true sentences and generated sentences  update the whole parameters of the proposed model, including the parameters of encoders and decoders  In ��� �� , the Enct and Decs act as the generator, which generates the sentence � � � from � �  The � �� , implemented based on CNN, assesses whether the generated sentence � � � is the true target-language sentence or the generated sentence 16 NLPLab, Soochow University

  17. Experiments Setup  Datasets  English-French:WMT14  a parallel corpus of about 30M pairs of sentences  selecting English sentences from 15M random pairs, and selecting the French sentences from the complementary set  English-German:WMT16  two monolingual training data of 1.8M sentences each  Chinese-English:LDC  1.6M sentence pairs randomly extracted from LDC corpora 17 NLPLab, Soochow University

  18. Experiments Setup  Tools  train the embeddings for each language independently by using word2vec(Mikolov et al., 2013)  apply the public implementation of the method proposed by (Artetxe et al., 2017a) to map these embeddings to a shared- latent space  Hyper-parameters Parameters Values Parameters Values Word embedding 512 Beam size 4 α dropout 0.1 0.6 Head number 8 GPU four K80 GPUS 18 NLPLab, Soochow University

  19. Experiments Setup  Model selection  stop training when the model achieves no improvement for the tenth evaluation on the development set  devlopment set:3000 source and target sentences extracted randomly from the monolingual training corpora  Evaluation metrics  BLEU(Papineni et al., 2002)  Chinese-English:the script mteval-v11b.pl  English-German,English-French:the script multi-belu.pl 19 NLPLab, Soochow University

  20. Experiments Setup  Baseline Systems  Word-by-word translation (WBW) It translates a sentence word-by-word, replacing each word with its nearest neighbor in the other language  Lample et al. (2017) The same training and testing sets with this paper Encoder:Bi-LSTM Decoder:a forward LSTM  Supervised training The same model as ours,but trained using the standard cross- entropy loss on the original parallel sentences 20 NLPLab, Soochow University

  21. Experiments Results  Number of the weight-sharing layers  We find that the number of weight-sharing layers shows much effect on the translation performance. 21 NLPLab, Soochow University

  22. Experiments Results  Number of the weight-sharing layers  The best translation performance is achieved when only one layer is shared in our system 22 NLPLab, Soochow University

  23. Experiments Results  Number of the weight-sharing layers  When all of the four layers are shared, we get poor translation performance in all of the three translation tasks. 23 NLPLab, Soochow University

  24. Experiments Results  Number of the weight-sharing layers  We notice that using two completely independent encoders, results in poor translation performance too. 24 NLPLab, Soochow University

  25. Experiments Results  Translation Results  The proposed approach obtains significant improvements than the word-by-word baseline system, with at least +5.01 BLEU points in English-to-German translation and up to +13.37 BLEU points in English-to-French translation. 25 NLPLab, Soochow University

  26. Experiments Results  Translation Results  Compared to the work of (Lample et al., 2017), our model achieves up to +1.92 BLEU points improvement on English- to-French translation task.  However, there is still a large room for improvement compared to the supervised upper bound. 26 NLPLab, Soochow University

  27. Experiments Results  Ablation Study  The best performance is obtained with the simultaneous use of all the tested elements.  The weight-sharing constraint, which is vital to map sentences of different languages to the shared latent space. 27 NLPLab, Soochow University

  28. Experiments Results  Ablation Study  The embedding-reinforced encoder 、 directional self-attention 、 local GANs and global GANs are all the importance of different components of the proposed system. 28 NLPLab, Soochow University

  29. Conclusion We propose the weight-sharing constraint in  unsupervised NMT. We also propose the embedding-reinforced  encoders, local GAN and global GAN into the proposed system. The experimental results reveal that our  approach achieves significant improvement. However, there is still a large room for  improvement compared to the supervised NMT. 29 NLPLab, Soochow University

  30. Q&A Thank you! 30 NLPLab, Soochow University

Recommend


More recommend