On Evaluation of Adversarial Perturbations for Sequence-to-Sequence Models Paul Michel, Xian Li, Graham Neubig, Juan Pino
Adversarial Attacks/Perturbations ● Apply a small (indistinguishable) perturbation to the input that elicit large changes in the output
Adversarial Attacks/Perturbations ● Apply a small (indistinguishable) perturbation to the input that elicit large changes in the output Figure from Goodfellow et al. (2014)
Adversarial Attacks/Perturbations ● Apply a small (indistinguishable) perturbation to the input that elicit large changes in the output Figure from Goodfellow et al. (2014)
Adversarial Attacks/Perturbations ● Apply a small (indistinguishable) perturbation to the input that elicit large changes in the output Figure from Goodfellow et al. (2014)
Adversarial Attacks/Perturbations ● Apply a small (indistinguishable) perturbation to the input that elicit large changes in the output Figure from Goodfellow et al. (2014)
Indistinguishable Perturbations ● Small perturbations are well defined in vision ○ Small l2 ~= indistinguishable to the human eye ... l2 distance
Indistinguishable Perturbations ● Small perturbations are well defined in vision ○ Small l2 ~= indistinguishable to the human eye ... l2 distance ● What about text ?
Not all Text Perturbations are Equal He’s very friendly
Not all Text Perturbations are Equal He’s very friendly He’s pretty friendly [Similar meaning] ✔
Not all Text Perturbations are Equal He’s very friendly He’s very annoying He’s pretty friendly [Difgerent meaning] [Similar meaning] ❌ ✔
Not all Text Perturbations are Equal He’s very friendly He’s very annoying He’s pretty friendly He’s She friendly [Difgerent meaning] [Similar meaning] [Nonsensical] ❌ ✔ ❌
Not all Text Perturbations are Equal He’s very friendly He’s very annoying He’s pretty friendly He’s She friendly He’s very freindly [Difgerent meaning] [Similar meaning] [Nonsensical] [Typo] ❌ ✔ ❌ ✔
Not all Text Perturbations are Equal He’s very friendly He’s very annoying He’s pretty friendly He’s She friendly He’s very freindly [Difgerent meaning] [Similar meaning] [Nonsensical] [Typo] ❌ ✔ ❌ ✔ ⇒ Can’t expect the model to output the same output!
Not all Text Perturbations are Equal He’s very friendly He’s very annoying He’s pretty friendly He’s She friendly He’s very freindly [Difgerent meaning] [Similar meaning] [Nonsensical] [Typo] ❌ ✔ ❌ ✔ ⇒ Can’t expect the model to output the same output! This paper: Why and How you should evaluate adversarial perturbations
A Framework for Evaluating Adversarial Attacks
Problem Definition Reference They plow it right back into filing more troll lawsuits. Original Ils le réinvestissent directement en engageant plus de procès.
Problem Definition Reference They plow it right back into filing more troll lawsuits. Original Ils le réinvestissent directement en engageant plus de procès.
Problem Definition Reference They plow it right back into filing more troll lawsuits. Base output They direct it directly by engaging Original Ils le réinvestissent directement en engageant more cases. plus de procès.
Problem Definition Reference They plow it right back into filing more troll lawsuits. Evaluate Base output They direct it directly by engaging Original Ils le réinvestissent directement en engageant more cases. plus de procès.
Problem Definition Reference They plow it right back into filing more troll lawsuits. Evaluate Base output They direct it directly by engaging Original Ils le réinvestissent directement en engageant more cases. plus de procès. Attack Ilss le réinvestissent dierctement en engagaent Adv. src plus de procès.
Problem Definition Reference They plow it right back into filing more troll lawsuits. Evaluate Base output They direct it directly by engaging Original Ils le réinvestissent directement en engageant more cases. plus de procès. Attack Adv. output .. de plus. Ilss le réinvestissent dierctement en engagaent Adv. src plus de procès.
Problem Definition Reference They plow it right back into filing more troll lawsuits. Evaluate Base output They direct it directly by engaging Original Ils le réinvestissent directement en engageant more cases. plus de procès. Attack Evaluate too! Adv. output .. de plus. Ilss le réinvestissent dierctement en engagaent Adv. src plus de procès.
Source Side Evaluation ● Evaluate meaning preservation on the source side ● Where is a similarity metric such that > He’s very friendly He’s pretty friendly He’s very friendly He’s very annoying > He’s very friendly He’s pretty friendly He’s very friendly He’s She friendly [...]
Target Side Evaluation ● Given , a similarity metric on the target side
Target Side Evaluation ● Given , a similarity metric on the target side ● Evaluate relative meaning destruction on the target side
Target Side Evaluation ● Given , a similarity metric on the target side ● Evaluate relative meaning destruction on the target side
Target Side Evaluation ● Given , a similarity metric on the target side ● Evaluate relative meaning destruction on the target side
Target Side Evaluation ● Given , a similarity metric on the target side ● Evaluate relative meaning destruction on the target side
Target Side Evaluation ● Given , a similarity metric on the target side ● Evaluate relative meaning destruction on the target side
Target Side Evaluation ● Given , a similarity metric on the target side ● Evaluate relative meaning destruction on the target side
Successful Adversarial Attacks ● Ensure that:
Successful Adversarial Attacks ● Ensure that: Source meaning destruction
Successful Adversarial Attacks ● Ensure that: Source meaning destruction Target meaning destruction
Successful Adversarial Attacks ● Ensure that: Source meaning destruction Target meaning destruction ● Destroy the meaning on the target side more than on the source side
Which similarity metric to use? “How would you rate the similarity between the meaning of these two sentences?” 0. The meaning is completely difgerent or one of the sentences ● Human evaluation is meaningless 1. The topic is the same but the meaning is difgerent ○ 6 point scale, details in paper 2. Some key information is difgerent 3. The key information is the same but the details difger 4. Meaning is essentially the same but some expressions are unnatural 5. Meaning is essentially equal and the two sentences are well-formed [Language]
Which similarity metric to use? “How would you rate the similarity between the meaning of these two sentences?” 0. The meaning is completely difgerent or one of the sentences ● Human evaluation is meaningless 1. The topic is the same but the meaning is difgerent ○ 6 point scale, details in paper 2. Some key information is difgerent 3. The key information is the same but the details difger 4. Meaning is essentially the same but some expressions are unnatural 5. Meaning is essentially equal and the two sentences are ● BLEU [Papineni et al., 2002] well-formed [Language] ○ Geometric mean of n-gram precision + length penalty
Which similarity metric to use? “How would you rate the similarity between the meaning of these two sentences?” 0. The meaning is completely difgerent or one of the sentences ● Human evaluation is meaningless 1. The topic is the same but the meaning is difgerent ○ 6 point scale, details in paper 2. Some key information is difgerent 3. The key information is the same but the details difger 4. Meaning is essentially the same but some expressions are unnatural 5. Meaning is essentially equal and the two sentences are ● BLEU [Papineni et al., 2002] well-formed [Language] ○ Geometric mean of n-gram precision + length penalty ● METEOR [Banerjee and Lavie, 2005] ○ Word matching taking into account stemming, synonyms, paraphrases...
Which similarity metric to use? “How would you rate the similarity between the meaning of these two sentences?” 0. The meaning is completely difgerent or one of the sentences ● Human evaluation is meaningless 1. The topic is the same but the meaning is difgerent ○ 6 point scale, details in paper 2. Some key information is difgerent 3. The key information is the same but the details difger 4. Meaning is essentially the same but some expressions are unnatural 5. Meaning is essentially equal and the two sentences are ● BLEU [Papineni et al., 2002] well-formed [Language] ○ Geometric mean of n-gram precision + length penalty ● METEOR [Banerjee and Lavie, 2005] ○ Word matching taking into account stemming, synonyms, paraphrases... ● chrF [Popović, 2015] ○ Character n-gram F-score
Experimental Setting
Data and Models ● Data ○ IWSLT 2016 dataset ○ {Czech, German, French} → English ● Models ○ LSTM based model ○ Transformer based model ○ Both word and sub-word based models
Recommend
More recommend