SwitchOut: An Efficient Data Augmentation for Neural Machine - PowerPoint PPT Presentation

SwitchOut: An Efficient Data Augmentation for Neural Machine Translation Xinyi Wang ∗ , Hieu Pham ∗ , Zihang Dai, Graham Neubig November 2, 2018 ∗ :equal contribution 1 / 41

Data Augmentation Neural models are data hungry, while collecting data is expensive 1 image source:Medium 2 / 41

Data Augmentation Neural models are data hungry, while collecting data is expensive Prevalent in computer vision 1 1 image source:Medium 3 / 41

Data Augmentation Neural models are data hungry, while collecting data is expensive Prevalent in computer vision 1 More difficult for natural language ◮ Discrete vocabulary ◮ NMT sensitive to arbitrary noise 1 image source:Medium 4 / 41

Existing Strategies Word replacement 5 / 41

Existing Strategies Word replacement Dictionary [Fadaee et al., 2017] 6 / 41

Existing Strategies Word replacement Dictionary [Fadaee et al., 2017] Word dropout [Sennrich et al., 2016a] 7 / 41

Existing Strategies Word replacement Dictionary [Fadaee et al., 2017] Word dropout [Sennrich et al., 2016a] Reward Augmented Maximum Likelihood (RAML) [Norouzi et al., 2016] 8 / 41

Existing Strategies Word replacement Dictionary [Fadaee et al., 2017] Word dropout [Sennrich et al., 2016a] Reward Augmented Maximum Likelihood (RAML) [Norouzi et al., 2016] → Can we characterize all of the related approaches together? 9 / 41

Existing Strategies: RAML RAML [Norouzi et al., 2016] Motivation: NMT relies on imperfect partial translation at test time, but trained only on gold standard target 10 / 41

Existing Strategies: RAML RAML [Norouzi et al., 2016] Motivation: NMT relies on imperfect partial translation at test time, but trained only on gold standard target Solution: Sample corrupted target during training 11 / 41

Existing Strategies: RAML RAML [Norouzi et al., 2016] Motivation: NMT relies on imperfect partial translation at test time, but trained only on gold standard target Solution: Sample corrupted target during training Gold target y , corrupted � y , similarity measure r y exp { r y ( � y , y ) /τ } q ∗ ( � y | y , τ ) = � y ′ , y ) /τ } y ′ exp { r y ( � � 12 / 41

Formalize Data Augmentation Real data distribution: x , y ∼ p ( X , Y ) 13 / 41

Formalize Data Augmentation Real data distribution: x , y ∼ p ( X , Y ) Observed data distribution: x , y ∼ � p ( X , Y ) 14 / 41

Formalize Data Augmentation Real data distribution: x , y ∼ p ( X , Y ) Observed data distribution: x , y ∼ � p ( X , Y ) → Problem: p ( X , Y ) and � p ( X , Y ) might have large discrepancy 15 / 41

Formalize Data Augmentation Real data distribution: x , y ∼ p ( X , Y ) Observed data distribution: x , y ∼ � p ( X , Y ) → Problem: p ( X , Y ) and � p ( X , Y ) might have large discrepancy y ∼ q ( � X , � Data augmentation: � x , � Y ) 16 / 41

Design a good q ( � X , � Y ) q : function of observed ( x , y ) 17 / 41

Design a good q ( � X , � Y ) q : function of observed ( x , y ) How should q approximate p ? 18 / 41

Design a good q ( � X , � Y ) q : function of observed ( x , y ) How should q approximate p ? ◮ Diversity : larger support with all valid data pairs ( x , y ) � � ⋆ Entropy H q ( � x , � y | x , y ) is large 19 / 41

Design a good q ( � X , � Y ) q : function of observed ( x , y ) How should q approximate p ? ◮ Diversity : larger support with all valid data pairs ( x , y ) � � ⋆ Entropy H q ( � x , � y | x , y ) is large ◮ Smoothness : probability of similar data pairs are similar ⋆ q maximizes similarity measure r x ( x , � x ), r y ( y , � y ) 20 / 41

Design a good q ( � X , � Y ) q : function of observed ( x , y ) How should q approximate p ? ◮ Diversity : larger support with all valid data pairs ( x , y ) � � ⋆ Entropy H q ( � x , � y | x , y ) is large ◮ Smoothness : probability of similar data pairs are similar ⋆ q maximizes similarity measure r x ( x , � x ), r y ( y , � y ) τ : control effect of diversity; q should maximize � � � � q ( � x , � r x ( x , � x ) + r y ( y , � J ( q ) = τ · H y | x , y ) + E � y ) x , � y ∼ q 21 / 41

Mathematically Optimal q � � � � J ( q ) = τ · H q ( � x , � y | x , y ) + E � r x ( x , � x ) + r y ( y , � y ) x , � y ∼ q Solve for the best q exp { s ( � x , � y ; x , y ) /τ } q ∗ ( � x , � y | x , y ) = � x ′ , � y ′ ; x , y ) /τ } y ′ exp { s ( � x ′ , � � 22 / 41

Mathematically Optimal q � � � � J ( q ) = τ · H q ( � x , � y | x , y ) + E � r x ( x , � x ) + r y ( y , � y ) x , � y ∼ q Solve for the best q exp { s ( � x , � y ; x , y ) /τ } q ∗ ( � x , � y | x , y ) = � x ′ , � y ′ ; x , y ) /τ } y ′ exp { s ( � x ′ , � � Decompose x and y exp { r x ( � x , x ) /τ x } exp { r y ( � y , y ) /τ y } q ∗ ( � � � x , � y | x , y ) = x ′ , x ) /τ x } × y ′ , y ) /τ y } x ′ exp { r x ( � y ′ exp { r y ( � � � 23 / 41

Mathematically Optimal q � � � � J ( q ) = τ · H q ( � x , � y | x , y ) + E � r x ( x , � x ) + r y ( y , � y ) x , � y ∼ q Solve for the best q exp { s ( � x , � y ; x , y ) /τ } q ∗ ( � x , � y | x , y ) = � x ′ , � y ′ ; x , y ) /τ } y ′ exp { s ( � x ′ , � � Decompose x and y exp { r x ( � x , x ) /τ x } exp { r y ( � y , y ) /τ y } q ∗ ( � � � x , � y | x , y ) = x ′ , x ) /τ x } × y ′ , y ) /τ y } x ′ exp { r x ( � y ′ exp { r y ( � � � Formulate existing methods ◮ Dictionary: jointly on x and y , but deterministic and not diverse ◮ Word dropout: only x side with null token ◮ RAML: only y side 24 / 41

Formulate SwitchOut Augment both x and y ! 25 / 41

Formulate SwitchOut Augment both x and y ! Sample for x , y independently 26 / 41

Formulate SwitchOut Augment both x and y ! Sample for x , y independently Define r x ( � x , x ) and r y ( � y , y ) ◮ Negative Hamming Distance, following RAML 27 / 41

SwitchOut: Sample efficiently Given a sentence s = { s 1 , s 2 , ... s | s | } 1 How many words to corrupt? Assumption: only one token for swapping. P ( n ) ∝ exp( − n ) /τ 28 / 41

SwitchOut: Sample efficiently Given a sentence s = { s 1 , s 2 , ... s | s | } 1 How many words to corrupt? Assumption: only one token for swapping. P ( n ) ∝ exp( − n ) /τ 2 What is the corrupted sentence? P (randomly swap s i by another word) = n | s | See Appendix: Efficient batch implementation in PyTorch and Tensorflow 29 / 41

Experiments Datasets ◮ en-vi: IWSLT 2015 ◮ de-en: IWSLT 2016 ◮ en-de: WMT 2015 Models ◮ Transformer model ◮ Word-based, standard preprocessing 30 / 41

Results: RAML and word dropout Method en-de de-en en-vi src trg N/A N/A 21.73 29.81 27.97 WordDropout N/A 20.63 29.97 28.56 22.78 † 28.67 † SwitchOut N/A 29.94 N/A RAML 22.83 30.66 28.88 WordDropout RAML 20.69 30.79 28.86 23.13 † 30.98 † SwitchOut RAML 29.09 31 / 41

Results: RAML and word dropout SwitchOut on source > word dropout Method en-de de-en en-vi src trg N/A N/A 21.73 29.81 27.97 WordDropout N/A 20.63 29.97 28.56 22.78 † 28.67 † SwitchOut N/A 29.94 N/A RAML 22.83 30.66 28.88 WordDropout RAML 20.69 30.79 28.86 23.13 † 30.98 † SwitchOut RAML 29.09 32 / 41

Results: RAML and word dropout SwitchOut on source > word dropout SwitchOut on source and target > RAML Method en-de de-en en-vi src trg N/A N/A 21.73 29.81 27.97 WordDropout N/A 20.63 29.97 28.56 22.78 † 28.67 † SwitchOut N/A 29.94 N/A RAML 22.83 30.66 28.88 WordDropout RAML 20.69 30.79 28.86 23.13 † 30.98 † SwitchOut RAML 29.09 33 / 41

Where does SwitchOut help? More gain for sentences more different from training data 1 0.75 Gain in BLEU Gain in BLEU 0.5 0.5 0.25 0 0 -0.5 -0.25 -1 1350 2700 4050 5400 253 506 759 1012 Top K sentences Top K sentences Figure: Left : IWSLT 16 de-en. Right : IWSLT 15 en-vi. 34 / 41

Final Thoughts SwitchOut sampling is efficient and easy-to-use 35 / 41

Final Thoughts SwitchOut sampling is efficient and easy-to-use Work with any NMT architecture 36 / 41

Final Thoughts SwitchOut sampling is efficient and easy-to-use Work with any NMT architecture Formulation of data augmentation encompasses existing works and inspires future direction 37 / 41

Final Thoughts SwitchOut sampling is efficient and easy-to-use Work with any NMT architecture Formulation of data augmentation encompasses existing works and inspires future direction Thanks a lot for listening! Questions? 38 / 41

References Norouzi et al. (2016) Reward Augmented Maximum Likelihood for Neural Structured Prediction. In NIPS. Sennrich et al. (2016a) Edinburgh neural machine translation systems for wmt 16. In WMT. Sennrich et al. (2016b) Improving neural machine translation models with monolingual data. In ACL. Currey et al. (2017) Copied Monolingual Data Improves Low-Resource Neural Machine Translation. In WMT. Fadaee et al. (2017) Data Augmentation for Low-Resource Neural Machine Translation. In ACL. 39 / 41

SwitchOut: An Efficient Data Augmentation for Neural Machine - PowerPoint PPT Presentation

SwitchOut: An Efficient Data Augmentation for Neural Machine Translation Xinyi Wang , Hieu Pham , Zihang Dai, Graham Neubig November 2, 2018 :equal contribution 1 / 41 Data Augmentation Neural models are data hungry, while

Population Based Augmentation Efficient Learning of Augmentation Policy Schedules Daniel Ho , Eric

Data Augmentation in NLP 2020-03-21 Xiachong Feng Outline Why we need Data Augmentation?

Convolutional Neural Networks with Data Augmentation against Jitter-Based Countermeasures Eleonora

Data Augmentation for Context-Sensitive Neural Lemmatization Using Inflection Tables and Raw

for Speech Synthesis and Sensor Data Augmentation Deep Generative Neural Network Speech Text

and Augmentation for Deep Neural Network Training Trevor Gale Steven Eliuk Cameron Upright

for Data-Efficient GAN Training Shengyu Zhao 1,2 Ji Lin 1 Jun-Yan Zhu 3,4 Song Han 1 Zhijian Liu 1

Augmentation Introduction ImageNet Classification with Deep Convolutional Neural Networks,

Neural Networks Hugo Larochelle ( @hugo_larochelle ) Google Brain 2 NEURAL NETWORKS What

Does Data Augmentation Lead to Positive Margin? Dimitris Po-Ling Loh Shashank Rajput* Zhili

Neural Response Ranking for Social Conversation: A Data-Efficient Approach Igor Shalyminov, Ond

Two problems from neural data analysis: Sparse entropy estimation and efficient adaptive

Algorithms in Nature Pruning in neural networks Neural network development 1. Efficient signal

In Deep Learning Anima Anandkumar & Zachary Lipton DATA AUGMENTATION To improve

A Kernel Theory of Modern Data Augmentation Tr Tri Dao ao , Albert Gu, Alex Ratner, Virginia

image-augmentation April 9, 2019 1 Image Augmentation In [1]: % matplotlib inline import d2l

Diverse Paraphrasing and its Effectiveness in Data Augmentation Ashutosh Kumar*, Satwik

Improving Molecular Design by Stochastic Iterative Target Augmentation Kevin Yang, Wengong Jin,

Towards Automated Melanoma Detection with Deep Learning: Data Purification and Augmentation

Paraphrase generation: adversarial examples / data augmentation CS 685, Fall 2020 Advanced

Counterfactual Data Augmentation for Mitigating Gender Stereotypes in Languages with Rich

ECE 417 Fall 2018 Lecture 19: Mini-Batch Training and Data Augmentation Mark Hasegawa-Johnson

Embedding and Data Augmentation yanweifu@fudan.edu.cn

Efficient and Accurate Estimation of Lipschitz Constants for Deep Neural Networks Mahyar Fazlyab,

SwitchOut: An Efficient Data Augmentation for Neural Machine - PowerPoint PPT Presentation

SwitchOut: An Efficient Data Augmentation for Neural Machine Translation Xinyi Wang , Hieu Pham , Zihang Dai, Graham Neubig November 2, 2018 :equal contribution 1 / 41 Data Augmentation Neural models are data hungry, while

Population Based Augmentation Efficient Learning of Augmentation Policy Schedules Daniel Ho , Eric

Data Augmentation in NLP 2020-03-21 Xiachong Feng Outline Why we need Data Augmentation?

Convolutional Neural Networks with Data Augmentation against Jitter-Based Countermeasures Eleonora

Data Augmentation for Context-Sensitive Neural Lemmatization Using Inflection Tables and Raw

for Speech Synthesis and Sensor Data Augmentation Deep Generative Neural Network Speech Text

and Augmentation for Deep Neural Network Training Trevor Gale Steven Eliuk Cameron Upright

for Data-Efficient GAN Training Shengyu Zhao 1,2 Ji Lin 1 Jun-Yan Zhu 3,4 Song Han 1 Zhijian Liu 1

Augmentation Introduction ImageNet Classification with Deep Convolutional Neural Networks,

Neural Networks Hugo Larochelle ( @hugo_larochelle ) Google Brain 2 NEURAL NETWORKS What

Does Data Augmentation Lead to Positive Margin? Dimitris Po-Ling Loh Shashank Rajput* Zhili

Neural Response Ranking for Social Conversation: A Data-Efficient Approach Igor Shalyminov, Ond

Two problems from neural data analysis: Sparse entropy estimation and efficient adaptive

Algorithms in Nature Pruning in neural networks Neural network development 1. Efficient signal

In Deep Learning Anima Anandkumar &amp; Zachary Lipton DATA AUGMENTATION To improve

A Kernel Theory of Modern Data Augmentation Tr Tri Dao ao , Albert Gu, Alex Ratner, Virginia

image-augmentation April 9, 2019 1 Image Augmentation In [1]: % matplotlib inline import d2l

Diverse Paraphrasing and its Effectiveness in Data Augmentation Ashutosh Kumar*, Satwik

Improving Molecular Design by Stochastic Iterative Target Augmentation Kevin Yang, Wengong Jin,

Towards Automated Melanoma Detection with Deep Learning: Data Purification and Augmentation

Paraphrase generation: adversarial examples / data augmentation CS 685, Fall 2020 Advanced

Counterfactual Data Augmentation for Mitigating Gender Stereotypes in Languages with Rich

ECE 417 Fall 2018 Lecture 19: Mini-Batch Training and Data Augmentation Mark Hasegawa-Johnson

Embedding and Data Augmentation yanweifu@fudan.edu.cn

Efficient and Accurate Estimation of Lipschitz Constants for Deep Neural Networks Mahyar Fazlyab,

In Deep Learning Anima Anandkumar & Zachary Lipton DATA AUGMENTATION To improve