@NicolasPapernot Gradient Masking in Machine Learning Nicolas Papernot Pennsylvania State University ARO Workshop on Adversarial Machine Learning, Stanford September 2017
@NicolasPapernot Thank you to our collaborators Sandy Huang (Berkeley) Pieter Abbeel (Berkeley) Somesh Jha (U of Wisconsin) Michael Backes (CISPA) Alexey Kurakin (Google) Dan Boneh (Stanford) Praveen Manoharan (CISPA) Z. Berkay Celik (Penn State) Patrick McDaniel (Penn State) Yan Duan (OpenAI) Arunesh Sinha (U of Michigan) Ian Goodfellow (Google) Ananthram Swami (US ARL) Matt Fredrikson (CMU) Florian Tramèr (Stanford) Kathrin Grosse (CISPA) Michael Wellman (U of Michigan) 2
Gradient Masking 3
Training Small when prediction is correct on legitimate input 4
Adversarial training Small when prediction is Small when prediction is correct on legitimate input correct on adversarial input 5
Gradient masking in adversarially trained models Direction of Direction of the adversarially another model’s trained model’s gradient gradient Tramèr et al. Ensemble Adversarial Training: Attacks and Defenses 6 Illustration adapted from slides by Florian Tramèr
Gradient masking in adversarially trained models Adversarial example Non-adversarial example Direction of Direction of the adversarially another model’s trained model’s gradient gradient Tramèr et al. Ensemble Adversarial Training: Attacks and Defenses 7 Illustration adapted from slides by Florian Tramèr
Gradient masking in adversarially trained models Adversarial example Non-adversarial example Non-adversarial example Direction of the Direction of adversarially trained another model’s model’s gradient gradient Tramèr et al. Ensemble Adversarial Training: Attacks and Defenses 8 Illustration adapted from slides by Florian Tramèr
Evading gradient masking (1) Threat model: white-box adversary Attack: (1) Random step (of norm alpha) (2) FGSM step (of norm eps - alpha) Tramèr et al. Ensemble Adversarial Training: Attacks and Defenses 9 Illustration adapted from slides by Florian Tramèr
Evading gradient masking (2) Threat model: black-box attack Attack: (1) Learn substitute for defended model (2) Find adversarial direction using substitute Papernot et al. Practical Black-Box Attacks against Machine Learning 10 Papernot et al. Towards the Science of Security and Privacy in Machine Learning
Attacking black-box models Local Black-box substitute ML sys “no truck sign” “STOP sign” (1) The adversary queries the remote ML system with synthetic inputs to learn a local substitute. 11 Papernot et al. Practical Black-box Attacks Against Machine Learning
Attacking black-box models “yield sign” Local Black-box substitute ML sys (2) The adversary uses the local substitute to craft adversarial examples. 12 Papernot et al. Practical Black-box Attacks Against Machine Learning
Adversarial example transferability 13 Papernot et al. Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples
Large adversarial subspaces enable transferability On average: 44 orthogonal directions -> 25 transfer 14 Tramèr et al. The Space of Transferable Adversarial Examples
Adversarial training Small when prediction is Small when prediction is correct on legitimate input correct on adversarial input gradient is not adversarial 15
Ensemble Adversarial Training 16
Ensemble adversarial training Intuition: present adversarial gradients from multiple models during training Adversarial Model A gradient 17
Ensemble adversarial training Intuition: present adversarial gradients from multiple models during training Model A Inference 18
Ensemble adversarial training Intuition: present adversarial gradients from multiple models during training Adversarial Model A Model B Model C Model D gradient 19
Ensemble adversarial training Intuition: present adversarial gradients from multiple models during training Adversarial Model A Model B Model C Model D gradient 20
Ensemble adversarial training Intuition: present adversarial gradients from multiple models during training Adversarial Model A Model B Model C Model D gradient 21
Ensemble adversarial training Intuition: present adversarial gradients from multiple models during training Adversarial Model A Model B Model C Model D gradient 22
Ensemble adversarial training Intuition: present adversarial gradients from multiple models during training Adversarial Model A Model B Model C Model D gradient 23
Ensemble adversarial training Intuition: present adversarial gradients from multiple models during training Model A Model B Model C Model D Inference 24
Experimental results on MNIST (from holdout) 25
Experimental results on ImageNet (from holdout) 26
Reproducible Adversarial ML research with CleverHans 27
CleverHans library guiding principles 1. Benchmark reproducibility 2. Can be used with any TensorFlow model 3. Always include state-of-the-art attacks and defenses 28
Growing community 1.1K+ stars 290+ forks 35 contributors 29
Adversarial examples represent worst-case distribution drifts 30 [DDS04] Dalvi et al. Adversarial Classification (KDD)
Adversarial examples are a tangible instance of hypothetical AI safety problems 31 Image source: http://www.nerdist.com/wp-content/uploads/2013/07/Space-Odyssey-4.jpg
nicolas@papernot.fr www.cleverhans.io @NicolasPapernot ? Thank you for listening! Get involved at: github.com/tensorflow/cleverhans This research was funded by: 32
Recommend
More recommend