Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context Zhen Yang,Wei Chen, FengWang, Bo Xu Chinese Academy of Sciences Jing Ye NLP Lab, Soochow University 1
Outline Task Definition Challenges and Solution Approach Experiments Conclusion 2 NLPLab, Soochow University
Task Definition NMT Unsupervised neural machine translation(NMT) is a approach for machine translation Without using any labeled data during training Translation Tasks English-German English-French Chinese-English 3 NLPLab, Soochow University
Challenges and Solution Challenges Only one encoder is shared by the source and target language Be weakly in keeping in the uniqueness and internal characteristics of each language(style, terminology, sentence structure) Solution Two independent encoders and decoders for each one language The weight-sharing constraint to the two auto- encoder(AE) Two different generative adversarial networks(GAN) 4 NLPLab, Soochow University
Approach Model Architecture Encoders: ��� � , ��� � Decoders: ��� � , ��� � Local discriminator: � � Global discriminator: � �� , � �� 5 NLPLab, Soochow University
Approach Model Architecture Encoders: ��� � , ��� � The encoder is composed of a stack of four identical layers Each layer consists of a multi-head self-attention and a simple position-wise fully connected feed-forward network. 6 NLPLab, Soochow University
Approach Model Architecture Decoders: ��� � , ��� � The decoder is composed of four identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sublayer, which performs multi-head attention over the output of the encoder stack. 7 NLPLab, Soochow University
Approach Model Architecture Local discriminator: � � a multi-layer perceptron Global discriminator: � �� , � �� the convolutional neural network (CNN) 8 NLPLab, Soochow University
Approach Model Architecture 9 NLPLab, Soochow University
Approach Directional Self-attention The forward mask � � The later token only makes attention connections to the early tokens in the sequence The backward mask � � 10 NLPLab, Soochow University
Approach Weight sharing Sharing the weights of the last few layers of the ��� � , ��� � Be responsible for extracting high-level representations of the input sentences Sharing the first few layers of the ��� � and ��� � , which are expected to decode high-level representations that are vital for reconstructing the input sentences The shared weights are utilized to map the hidden features extracted by the independent weights to the shared-latent space 11 NLPLab, Soochow University
Approach Embedding reinforced encoder Using pretrained cross-lingual embeddings in the encoders that are kept fixed during training the input sequence embedding vectors � � �� � , … , � � � the initial output sequence of the encoder stack H � �� � , … , � � � the final output sequence of the encoder g is a gate unit ( � � , � � and b are trainable parameters and they are shared by the two encoders) 12 NLPLab, Soochow University
Approach Embedding reinforced encoder Using pretrained cross-lingual embeddings in the encoders that are kept fixed during training the input sequence embedding vectors � � �� � , … , � � � the initial output sequence of the encoder stack H � �� � , … , � � � the final output sequence of the encoder g is a gate unit ( � � , � � and b are trainable parameters and they are shared by the two encoders) 13 NLPLab, Soochow University
Approach Local GAN To classify between the encoding of source sentences and the encoding of target sentences Local discriminator loss θ �� 、 θ ���� and θ ���� represent the parameters of the local discriminator and two encoders Encoders loss 14 NLPLab, Soochow University
Approach Global GAN aim to distinguish among the true sentences and generated sentences update the whole parameters of the proposed model, including the parameters of encoders and decoders In ��� �� , the Enct and Decs act as the generator, which generates the sentence � � � from � � The � �� , implemented based on CNN, assesses whether the generated sentence � � � is the true target-language sentence or the generated sentence 15 NLPLab, Soochow University
Approach Global GAN aim to distinguish among the true sentences and generated sentences update the whole parameters of the proposed model, including the parameters of encoders and decoders In ��� �� , the Enct and Decs act as the generator, which generates the sentence � � � from � � The � �� , implemented based on CNN, assesses whether the generated sentence � � � is the true target-language sentence or the generated sentence 16 NLPLab, Soochow University
Experiments Setup Datasets English-French:WMT14 a parallel corpus of about 30M pairs of sentences selecting English sentences from 15M random pairs, and selecting the French sentences from the complementary set English-German:WMT16 two monolingual training data of 1.8M sentences each Chinese-English:LDC 1.6M sentence pairs randomly extracted from LDC corpora 17 NLPLab, Soochow University
Experiments Setup Tools train the embeddings for each language independently by using word2vec(Mikolov et al., 2013) apply the public implementation of the method proposed by (Artetxe et al., 2017a) to map these embeddings to a shared- latent space Hyper-parameters Parameters Values Parameters Values Word embedding 512 Beam size 4 α dropout 0.1 0.6 Head number 8 GPU four K80 GPUS 18 NLPLab, Soochow University
Experiments Setup Model selection stop training when the model achieves no improvement for the tenth evaluation on the development set devlopment set:3000 source and target sentences extracted randomly from the monolingual training corpora Evaluation metrics BLEU(Papineni et al., 2002) Chinese-English:the script mteval-v11b.pl English-German,English-French:the script multi-belu.pl 19 NLPLab, Soochow University
Experiments Setup Baseline Systems Word-by-word translation (WBW) It translates a sentence word-by-word, replacing each word with its nearest neighbor in the other language Lample et al. (2017) The same training and testing sets with this paper Encoder:Bi-LSTM Decoder:a forward LSTM Supervised training The same model as ours,but trained using the standard cross- entropy loss on the original parallel sentences 20 NLPLab, Soochow University
Experiments Results Number of the weight-sharing layers We find that the number of weight-sharing layers shows much effect on the translation performance. 21 NLPLab, Soochow University
Experiments Results Number of the weight-sharing layers The best translation performance is achieved when only one layer is shared in our system 22 NLPLab, Soochow University
Experiments Results Number of the weight-sharing layers When all of the four layers are shared, we get poor translation performance in all of the three translation tasks. 23 NLPLab, Soochow University
Experiments Results Number of the weight-sharing layers We notice that using two completely independent encoders, results in poor translation performance too. 24 NLPLab, Soochow University
Experiments Results Translation Results The proposed approach obtains significant improvements than the word-by-word baseline system, with at least +5.01 BLEU points in English-to-German translation and up to +13.37 BLEU points in English-to-French translation. 25 NLPLab, Soochow University
Experiments Results Translation Results Compared to the work of (Lample et al., 2017), our model achieves up to +1.92 BLEU points improvement on English- to-French translation task. However, there is still a large room for improvement compared to the supervised upper bound. 26 NLPLab, Soochow University
Experiments Results Ablation Study The best performance is obtained with the simultaneous use of all the tested elements. The weight-sharing constraint, which is vital to map sentences of different languages to the shared latent space. 27 NLPLab, Soochow University
Experiments Results Ablation Study The embedding-reinforced encoder 、 directional self-attention 、 local GANs and global GANs are all the importance of different components of the proposed system. 28 NLPLab, Soochow University
Conclusion We propose the weight-sharing constraint in unsupervised NMT. We also propose the embedding-reinforced encoders, local GAN and global GAN into the proposed system. The experimental results reveal that our approach achieves significant improvement. However, there is still a large room for improvement compared to the supervised NMT. 29 NLPLab, Soochow University
Q&A Thank you! 30 NLPLab, Soochow University
Recommend
More recommend