reinforcing adversarial robustness using model confidence
play

Reinforcing Adversarial Robustness using Model Confidence Induced by - PowerPoint PPT Presentation

Reinforcing Adversarial Robustness using Model Confidence Induced by Adversarial Training Xi Wu xiwu@cs.wisc.edu Joint work with Uyeong Jang, Jiefeng Chen, Lingjiao Chen, and Somesh Jha July 19, 2018 Xi Wu Model Confidence and Adversarial


  1. Reinforcing Adversarial Robustness using Model Confidence Induced by Adversarial Training Xi Wu xiwu@cs.wisc.edu Joint work with Uyeong Jang, Jiefeng Chen, Lingjiao Chen, and Somesh Jha July 19, 2018 Xi Wu Model Confidence and Adversarial Training 1 / 9

  2. Entirely wrong behavior of confidence naturally trained neural network: Xi Wu Model Confidence and Adversarial Training 2 / 9 • Small perturbations can cause highly confident but wrong predictions. • An example from (Goodfellows, Shlens, and Szegedy, ICLR 2015), on a

  3. A betuer behavior give natural data manifolds) Xi Wu Model Confidence and Adversarial Training 3 / 9 • Low confidence if the model “does not learn/know it.” • An intuitively good model for classifying pandas and gibbons (disks

  4. Main contributions of this work data distribution. Xi Wu Model Confidence and Adversarial Training 4 / 9 • In a precise formal sense, adversarial training by (Madry et al., ICLR 2017) gives betuer behavior of model confidence for points near the • The betuer behavior of model confidence induced by adversarial training can be used to improve adversarial robustness.

  5. Defining good behaviors of confidence (1/2) Intuition : Confident predictions of difgerent classes should be well separated. Xi Wu Model Confidence and Adversarial Training 5 / 9 A bad ( x , y ) ∼ D with poor confidence separation:

  6. Pr Defining good behaviors of confidence (2/2) Xi Wu Model Confidence and Adversarial Training 6 / 9 • D : Data generating distribution; d ( · , · ) : A distance metric; p, q ∈ [0 , 1] , δ ≥ 0 . • Bad event (Neighborhood has p -confident wrong predictions): B = {∃ y ′ ̸ = y, x ′ ∈ N ( x , δ ) , F θ ( x ′ ) y ′ ≥ p } • F is said to have ( p, q, δ ) -separation if [ ] B ≤ q. ( x ,y ) ∼D

  7. Adversarial Training by Madry et al. max Model Confidence and Adversarial Training Xi Wu Theorem (Informal, this work) Adversarial training formulation of Madry et al.: 7 / 9 minimize ρ ( θ ) , [ ] where ρ ( θ ) = ∆ ∈S L ( θ, x + ∆ , y ) E , ( x ,y ) ∼D For a large family of loss functions L , models trained as above achieve good ( p, q, δ ) -separation, where as p → 1 , q → 0 .

  8. Empirical results (summary) confidence-based defenses (as well as gradient-masking efgect). baseline model of Madry et al. two neighbors with highest confidences. Xi Wu Model Confidence and Adversarial Training 8 / 9 • We generate high-confidence atuacks in order to bypass • Finding 1: Confidence of models trained using Madry et al.’s objective behave much betuer than their natural counterparts. • Finding 2: A simple “nearest neighbor search” based on confidence corrects 20 % ∼ 25 % targeted adversarial examples that fool the • Finding 3: For > 98 % of test instances, correct label can be found in

  9. Qvestions? Please come to our poster session if you want to know more details! Xi Wu Model Confidence and Adversarial Training 9 / 9

Recommend


More recommend