Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context Zhen Yang,Wei Chen, FengWang, Bo Xu Chinese Academy of Sciences Jing Ye NLP Lab, Soochow University 1
Outline  Task Definition  Challenges and Solution  Approach  Experiments  Conclusion 2 NLPLab, Soochow University
Task Definition  NMT  Unsupervised neural machine translation(NMT) is a approach for machine translation  Without using any labeled data during training  Translation Tasks  English-German  English-French  Chinese-English 3 NLPLab, Soochow University
Challenges and Solution  Challenges  Only one encoder is shared by the source and target language  Be weakly in keeping in the uniqueness and internal characteristics of each language(style, terminology, sentence structure)  Solution  Two independent encoders and decoders for each one language  The weight-sharing constraint to the two auto- encoder(AE)  Two different generative adversarial networks(GAN) 4 NLPLab, Soochow University
Approach  Model Architecture  Encoders: ��� � , ��� �  Decoders: ��� � , ��� �  Local discriminator: � �  Global discriminator: � �� , � �� 5 NLPLab, Soochow University
Approach  Model Architecture  Encoders: ��� � , ��� � The encoder is composed of a stack of four identical layers Each layer consists of a multi-head self-attention and a simple position-wise fully connected feed-forward network. 6 NLPLab, Soochow University
Approach  Model Architecture  Decoders: ��� � , ��� � The decoder is composed of four identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sublayer, which performs multi-head attention over the output of the encoder stack. 7 NLPLab, Soochow University
Approach  Model Architecture  Local discriminator: � � a multi-layer perceptron  Global discriminator: � �� , � �� the convolutional neural network (CNN) 8 NLPLab, Soochow University
Approach  Model Architecture 9 NLPLab, Soochow University
Approach  Directional Self-attention  The forward mask � � The later token only makes attention connections to the early tokens in the sequence  The backward mask � � 10 NLPLab, Soochow University
Approach  Weight sharing  Sharing the weights of the last few layers of the ��� � , ��� �  Be responsible for extracting high-level representations of the input sentences  Sharing the first few layers of the ��� � and ��� � , which are expected to decode high-level representations that are vital for reconstructing the input sentences  The shared weights are utilized to map the hidden features extracted by the independent weights to the shared-latent space 11 NLPLab, Soochow University
Approach  Embedding reinforced encoder  Using pretrained cross-lingual embeddings in the encoders that are kept fixed during training  the input sequence embedding vectors � � �� � , … , � � �  the initial output sequence of the encoder stack H � �� � , … , � � �  the final output sequence of the encoder  g is a gate unit ( � � , � � and b are trainable parameters and they are shared by the two encoders) 12 NLPLab, Soochow University
Approach  Embedding reinforced encoder  Using pretrained cross-lingual embeddings in the encoders that are kept fixed during training  the input sequence embedding vectors � � �� � , … , � � �  the initial output sequence of the encoder stack H � �� � , … , � � �  the final output sequence of the encoder  g is a gate unit ( � � , � � and b are trainable parameters and they are shared by the two encoders) 13 NLPLab, Soochow University
Approach  Local GAN  To classify between the encoding of source sentences and the encoding of target sentences  Local discriminator loss θ �� 、 θ ���� and θ ���� represent the parameters of the local discriminator and two encoders  Encoders loss 14 NLPLab, Soochow University
Approach  Global GAN  aim to distinguish among the true sentences and generated sentences  update the whole parameters of the proposed model, including the parameters of encoders and decoders  In ��� �� , the Enct and Decs act as the generator, which generates the sentence � � � from � �  The � �� , implemented based on CNN, assesses whether the generated sentence � � � is the true target-language sentence or the generated sentence 15 NLPLab, Soochow University
Approach  Global GAN  aim to distinguish among the true sentences and generated sentences  update the whole parameters of the proposed model, including the parameters of encoders and decoders  In ��� �� , the Enct and Decs act as the generator, which generates the sentence � � � from � �  The � �� , implemented based on CNN, assesses whether the generated sentence � � � is the true target-language sentence or the generated sentence 16 NLPLab, Soochow University
Experiments Setup  Datasets  English-French:WMT14  a parallel corpus of about 30M pairs of sentences  selecting English sentences from 15M random pairs, and selecting the French sentences from the complementary set  English-German:WMT16  two monolingual training data of 1.8M sentences each  Chinese-English:LDC  1.6M sentence pairs randomly extracted from LDC corpora 17 NLPLab, Soochow University
Experiments Setup  Tools  train the embeddings for each language independently by using word2vec(Mikolov et al., 2013)  apply the public implementation of the method proposed by (Artetxe et al., 2017a) to map these embeddings to a shared- latent space  Hyper-parameters Parameters Values Parameters Values Word embedding 512 Beam size 4 α dropout 0.1 0.6 Head number 8 GPU four K80 GPUS 18 NLPLab, Soochow University
Experiments Setup  Model selection  stop training when the model achieves no improvement for the tenth evaluation on the development set  devlopment set:3000 source and target sentences extracted randomly from the monolingual training corpora  Evaluation metrics  BLEU(Papineni et al., 2002)  Chinese-English:the script mteval-v11b.pl  English-German,English-French:the script multi-belu.pl 19 NLPLab, Soochow University
Experiments Setup  Baseline Systems  Word-by-word translation (WBW) It translates a sentence word-by-word, replacing each word with its nearest neighbor in the other language  Lample et al. (2017) The same training and testing sets with this paper Encoder:Bi-LSTM Decoder:a forward LSTM  Supervised training The same model as ours,but trained using the standard cross- entropy loss on the original parallel sentences 20 NLPLab, Soochow University
Experiments Results  Number of the weight-sharing layers  We find that the number of weight-sharing layers shows much effect on the translation performance. 21 NLPLab, Soochow University
Experiments Results  Number of the weight-sharing layers  The best translation performance is achieved when only one layer is shared in our system 22 NLPLab, Soochow University
Experiments Results  Number of the weight-sharing layers  When all of the four layers are shared, we get poor translation performance in all of the three translation tasks. 23 NLPLab, Soochow University
Experiments Results  Number of the weight-sharing layers  We notice that using two completely independent encoders, results in poor translation performance too. 24 NLPLab, Soochow University
Experiments Results  Translation Results  The proposed approach obtains significant improvements than the word-by-word baseline system, with at least +5.01 BLEU points in English-to-German translation and up to +13.37 BLEU points in English-to-French translation. 25 NLPLab, Soochow University
Experiments Results  Translation Results  Compared to the work of (Lample et al., 2017), our model achieves up to +1.92 BLEU points improvement on English- to-French translation task.  However, there is still a large room for improvement compared to the supervised upper bound. 26 NLPLab, Soochow University
Experiments Results  Ablation Study  The best performance is obtained with the simultaneous use of all the tested elements.  The weight-sharing constraint, which is vital to map sentences of different languages to the shared latent space. 27 NLPLab, Soochow University
Experiments Results  Ablation Study  The embedding-reinforced encoder 、 directional self-attention 、 local GANs and global GANs are all the importance of different components of the proposed system. 28 NLPLab, Soochow University
Conclusion We propose the weight-sharing constraint in  unsupervised NMT. We also propose the embedding-reinforced  encoders, local GAN and global GAN into the proposed system. The experimental results reveal that our  approach achieves significant improvement. However, there is still a large room for  improvement compared to the supervised NMT. 29 NLPLab, Soochow University
Q&A Thank you! 30 NLPLab, Soochow University
Recommend
More recommend