SwitchOut: An Efficient Data Augmentation for Neural Machine Translation Xinyi Wang ∗ , Hieu Pham ∗ , Zihang Dai, Graham Neubig November 2, 2018 ∗ :equal contribution 1 / 41
Data Augmentation Neural models are data hungry, while collecting data is expensive 1 image source:Medium 2 / 41
Data Augmentation Neural models are data hungry, while collecting data is expensive Prevalent in computer vision 1 1 image source:Medium 3 / 41
Data Augmentation Neural models are data hungry, while collecting data is expensive Prevalent in computer vision 1 More difficult for natural language ◮ Discrete vocabulary ◮ NMT sensitive to arbitrary noise 1 image source:Medium 4 / 41
Existing Strategies Word replacement 5 / 41
Existing Strategies Word replacement Dictionary [Fadaee et al., 2017] 6 / 41
Existing Strategies Word replacement Dictionary [Fadaee et al., 2017] Word dropout [Sennrich et al., 2016a] 7 / 41
Existing Strategies Word replacement Dictionary [Fadaee et al., 2017] Word dropout [Sennrich et al., 2016a] Reward Augmented Maximum Likelihood (RAML) [Norouzi et al., 2016] 8 / 41
Existing Strategies Word replacement Dictionary [Fadaee et al., 2017] Word dropout [Sennrich et al., 2016a] Reward Augmented Maximum Likelihood (RAML) [Norouzi et al., 2016] → Can we characterize all of the related approaches together? 9 / 41
Existing Strategies: RAML RAML [Norouzi et al., 2016] Motivation: NMT relies on imperfect partial translation at test time, but trained only on gold standard target 10 / 41
Existing Strategies: RAML RAML [Norouzi et al., 2016] Motivation: NMT relies on imperfect partial translation at test time, but trained only on gold standard target Solution: Sample corrupted target during training 11 / 41
Existing Strategies: RAML RAML [Norouzi et al., 2016] Motivation: NMT relies on imperfect partial translation at test time, but trained only on gold standard target Solution: Sample corrupted target during training Gold target y , corrupted � y , similarity measure r y exp { r y ( � y , y ) /τ } q ∗ ( � y | y , τ ) = � y ′ , y ) /τ } y ′ exp { r y ( � � 12 / 41
Formalize Data Augmentation Real data distribution: x , y ∼ p ( X , Y ) 13 / 41
Formalize Data Augmentation Real data distribution: x , y ∼ p ( X , Y ) Observed data distribution: x , y ∼ � p ( X , Y ) 14 / 41
Formalize Data Augmentation Real data distribution: x , y ∼ p ( X , Y ) Observed data distribution: x , y ∼ � p ( X , Y ) → Problem: p ( X , Y ) and � p ( X , Y ) might have large discrepancy 15 / 41
Formalize Data Augmentation Real data distribution: x , y ∼ p ( X , Y ) Observed data distribution: x , y ∼ � p ( X , Y ) → Problem: p ( X , Y ) and � p ( X , Y ) might have large discrepancy y ∼ q ( � X , � Data augmentation: � x , � Y ) 16 / 41
Design a good q ( � X , � Y ) q : function of observed ( x , y ) 17 / 41
Design a good q ( � X , � Y ) q : function of observed ( x , y ) How should q approximate p ? 18 / 41
Design a good q ( � X , � Y ) q : function of observed ( x , y ) How should q approximate p ? ◮ Diversity : larger support with all valid data pairs ( x , y ) � � ⋆ Entropy H q ( � x , � y | x , y ) is large 19 / 41
Design a good q ( � X , � Y ) q : function of observed ( x , y ) How should q approximate p ? ◮ Diversity : larger support with all valid data pairs ( x , y ) � � ⋆ Entropy H q ( � x , � y | x , y ) is large ◮ Smoothness : probability of similar data pairs are similar ⋆ q maximizes similarity measure r x ( x , � x ), r y ( y , � y ) 20 / 41
Design a good q ( � X , � Y ) q : function of observed ( x , y ) How should q approximate p ? ◮ Diversity : larger support with all valid data pairs ( x , y ) � � ⋆ Entropy H q ( � x , � y | x , y ) is large ◮ Smoothness : probability of similar data pairs are similar ⋆ q maximizes similarity measure r x ( x , � x ), r y ( y , � y ) τ : control effect of diversity; q should maximize � � � � q ( � x , � r x ( x , � x ) + r y ( y , � J ( q ) = τ · H y | x , y ) + E � y ) x , � y ∼ q 21 / 41
Mathematically Optimal q � � � � J ( q ) = τ · H q ( � x , � y | x , y ) + E � r x ( x , � x ) + r y ( y , � y ) x , � y ∼ q Solve for the best q exp { s ( � x , � y ; x , y ) /τ } q ∗ ( � x , � y | x , y ) = � x ′ , � y ′ ; x , y ) /τ } y ′ exp { s ( � x ′ , � � 22 / 41
Mathematically Optimal q � � � � J ( q ) = τ · H q ( � x , � y | x , y ) + E � r x ( x , � x ) + r y ( y , � y ) x , � y ∼ q Solve for the best q exp { s ( � x , � y ; x , y ) /τ } q ∗ ( � x , � y | x , y ) = � x ′ , � y ′ ; x , y ) /τ } y ′ exp { s ( � x ′ , � � Decompose x and y exp { r x ( � x , x ) /τ x } exp { r y ( � y , y ) /τ y } q ∗ ( � � � x , � y | x , y ) = x ′ , x ) /τ x } × y ′ , y ) /τ y } x ′ exp { r x ( � y ′ exp { r y ( � � � 23 / 41
Mathematically Optimal q � � � � J ( q ) = τ · H q ( � x , � y | x , y ) + E � r x ( x , � x ) + r y ( y , � y ) x , � y ∼ q Solve for the best q exp { s ( � x , � y ; x , y ) /τ } q ∗ ( � x , � y | x , y ) = � x ′ , � y ′ ; x , y ) /τ } y ′ exp { s ( � x ′ , � � Decompose x and y exp { r x ( � x , x ) /τ x } exp { r y ( � y , y ) /τ y } q ∗ ( � � � x , � y | x , y ) = x ′ , x ) /τ x } × y ′ , y ) /τ y } x ′ exp { r x ( � y ′ exp { r y ( � � � Formulate existing methods ◮ Dictionary: jointly on x and y , but deterministic and not diverse ◮ Word dropout: only x side with null token ◮ RAML: only y side 24 / 41
Formulate SwitchOut Augment both x and y ! 25 / 41
Formulate SwitchOut Augment both x and y ! Sample for x , y independently 26 / 41
Formulate SwitchOut Augment both x and y ! Sample for x , y independently Define r x ( � x , x ) and r y ( � y , y ) ◮ Negative Hamming Distance, following RAML 27 / 41
SwitchOut: Sample efficiently Given a sentence s = { s 1 , s 2 , ... s | s | } 1 How many words to corrupt? Assumption: only one token for swapping. P ( n ) ∝ exp( − n ) /τ 28 / 41
SwitchOut: Sample efficiently Given a sentence s = { s 1 , s 2 , ... s | s | } 1 How many words to corrupt? Assumption: only one token for swapping. P ( n ) ∝ exp( − n ) /τ 2 What is the corrupted sentence? P (randomly swap s i by another word) = n | s | See Appendix: Efficient batch implementation in PyTorch and Tensorflow 29 / 41
Experiments Datasets ◮ en-vi: IWSLT 2015 ◮ de-en: IWSLT 2016 ◮ en-de: WMT 2015 Models ◮ Transformer model ◮ Word-based, standard preprocessing 30 / 41
Results: RAML and word dropout Method en-de de-en en-vi src trg N/A N/A 21.73 29.81 27.97 WordDropout N/A 20.63 29.97 28.56 22.78 † 28.67 † SwitchOut N/A 29.94 N/A RAML 22.83 30.66 28.88 WordDropout RAML 20.69 30.79 28.86 23.13 † 30.98 † SwitchOut RAML 29.09 31 / 41
Results: RAML and word dropout SwitchOut on source > word dropout Method en-de de-en en-vi src trg N/A N/A 21.73 29.81 27.97 WordDropout N/A 20.63 29.97 28.56 22.78 † 28.67 † SwitchOut N/A 29.94 N/A RAML 22.83 30.66 28.88 WordDropout RAML 20.69 30.79 28.86 23.13 † 30.98 † SwitchOut RAML 29.09 32 / 41
Results: RAML and word dropout SwitchOut on source > word dropout SwitchOut on source and target > RAML Method en-de de-en en-vi src trg N/A N/A 21.73 29.81 27.97 WordDropout N/A 20.63 29.97 28.56 22.78 † 28.67 † SwitchOut N/A 29.94 N/A RAML 22.83 30.66 28.88 WordDropout RAML 20.69 30.79 28.86 23.13 † 30.98 † SwitchOut RAML 29.09 33 / 41
Where does SwitchOut help? More gain for sentences more different from training data 1 0.75 Gain in BLEU Gain in BLEU 0.5 0.5 0.25 0 0 -0.5 -0.25 -1 1350 2700 4050 5400 253 506 759 1012 Top K sentences Top K sentences Figure: Left : IWSLT 16 de-en. Right : IWSLT 15 en-vi. 34 / 41
Final Thoughts SwitchOut sampling is efficient and easy-to-use 35 / 41
Final Thoughts SwitchOut sampling is efficient and easy-to-use Work with any NMT architecture 36 / 41
Final Thoughts SwitchOut sampling is efficient and easy-to-use Work with any NMT architecture Formulation of data augmentation encompasses existing works and inspires future direction 37 / 41
Final Thoughts SwitchOut sampling is efficient and easy-to-use Work with any NMT architecture Formulation of data augmentation encompasses existing works and inspires future direction Thanks a lot for listening! Questions? 38 / 41
References Norouzi et al. (2016) Reward Augmented Maximum Likelihood for Neural Structured Prediction. In NIPS. Sennrich et al. (2016a) Edinburgh neural machine translation systems for wmt 16. In WMT. Sennrich et al. (2016b) Improving neural machine translation models with monolingual data. In ACL. Currey et al. (2017) Copied Monolingual Data Improves Low-Resource Neural Machine Translation. In WMT. Fadaee et al. (2017) Data Augmentation for Low-Resource Neural Machine Translation. In ACL. 39 / 41
Recommend
More recommend