Ensemble Adversarial Training: Attacks and Defenses

international conference on learning representations, 2018.

Cited by: 825|Bibtex|Views223
Other Links: academic.microsoft.com|arxiv.org
Weibo:
Generic with respect to the application domain, suggest that adversarial training can be improved by decoupling the generation of adversarial examples from the model being trained

Abstract:

Adversarial examples are perturbed inputs designed to fool machine learning models. Adversarial training injects such examples into training data to increase robustness. To scale this technique to large datasets, perturbations are crafted using fast single-step methods that maximize a linear approximation of the model's loss. We show th...More

Code:

Data:

Introduction
  • Machine learning (ML) models are often vulnerable to adversarial examples, maliciously perturbed inputs designed to mislead a model at test time (Biggio et al, 2013; Szegedy et al, 2013; Goodfellow et al, 2014b; Papernot et al, 2016a).
  • It is natural to ask whether it is possible, at scale, to achieve robustness against the class of black-box adversaries Towards this goal, Kurakin et al (2017b) adversarially trained an Inception v3 model (Szegedy et al, 2016b) on ImageNet using a “single-step” attack based on a linearization of the model’s loss (Goodfellow et al, 2014b)
  • Their trained model is robust to single-step perturbations but remains vulnerable to more costly “multi-step” attacks.
Highlights
  • Machine learning (ML) models are often vulnerable to adversarial examples, maliciously perturbed inputs designed to mislead a model at test time (Biggio et al, 2013; Szegedy et al, 2013; Goodfellow et al, 2014b; Papernot et al, 2016a)
  • We evaluate our Ensemble Adversarial Training strategy described in Section 3.4
  • Previous work on adversarial training at scale has produced encouraging results, showing strong robustness to adversarial examples (Goodfellow et al, 2014b; Kurakin et al, 2017b). These results are misleading, as the adversarially trained models remain vulnerable to simple black-box and white-box attacks
  • Generic with respect to the application domain, suggest that adversarial training can be improved by decoupling the generation of adversarial examples from the model being trained
  • Our experiments with Ensemble Adversarial Training show that the robustness attained to attacks from some models transfers to attacks from other models
  • A recent work by Xiao et al (2018) found Ensemble Adversarial Training to be resilient to such attacks on MNIST and CIFAR10, and often attaining higher robustness than models that were adversarially trained on iterative attacks
Methods
  • The authors show the existence of a degenerate minimum, as described in Section 3.3, for the adversarially trained Inception v3 model of Kurakin et al (2017b)
  • Their model was trained on a Step-LL attack with ≤ 16/256.
  • For 1,000 random test points, the authors find that for a standard Inception v3 model, step-LL gets within 19% of the optimum loss on average
  • This attack is a good candidate for adversarial training.
  • This effect is not due to the decision surface of v3adv being “too flat” near the data points: the average gradient norm is larger for v3adv (0.17) than for the standard v3 model (0.10)
Results
  • The authors' model achieved 97.9% accuracy on the clean test data.
Conclusion
  • Previous work on adversarial training at scale has produced encouraging results, showing strong robustness to adversarial examples (Goodfellow et al, 2014b; Kurakin et al, 2017b).
  • These results are misleading, as the adversarially trained models remain vulnerable to simple black-box and white-box attacks.
  • A recent work by Xiao et al (2018) found Ensemble Adversarial Training to be resilient to such attacks on MNIST and CIFAR10, and often attaining higher robustness than models that were adversarially trained on iterative attacks
Summary
  • Introduction:

    Machine learning (ML) models are often vulnerable to adversarial examples, maliciously perturbed inputs designed to mislead a model at test time (Biggio et al, 2013; Szegedy et al, 2013; Goodfellow et al, 2014b; Papernot et al, 2016a).
  • It is natural to ask whether it is possible, at scale, to achieve robustness against the class of black-box adversaries Towards this goal, Kurakin et al (2017b) adversarially trained an Inception v3 model (Szegedy et al, 2016b) on ImageNet using a “single-step” attack based on a linearization of the model’s loss (Goodfellow et al, 2014b)
  • Their trained model is robust to single-step perturbations but remains vulnerable to more costly “multi-step” attacks.
  • Methods:

    The authors show the existence of a degenerate minimum, as described in Section 3.3, for the adversarially trained Inception v3 model of Kurakin et al (2017b)
  • Their model was trained on a Step-LL attack with ≤ 16/256.
  • For 1,000 random test points, the authors find that for a standard Inception v3 model, step-LL gets within 19% of the optimum loss on average
  • This attack is a good candidate for adversarial training.
  • This effect is not due to the decision surface of v3adv being “too flat” near the data points: the average gradient norm is larger for v3adv (0.17) than for the standard v3 model (0.10)
  • Results:

    The authors' model achieved 97.9% accuracy on the clean test data.
  • Conclusion:

    Previous work on adversarial training at scale has produced encouraging results, showing strong robustness to adversarial examples (Goodfellow et al, 2014b; Kurakin et al, 2017b).
  • These results are misleading, as the adversarially trained models remain vulnerable to simple black-box and white-box attacks.
  • A recent work by Xiao et al (2018) found Ensemble Adversarial Training to be resilient to such attacks on MNIST and CIFAR10, and often attaining higher robustness than models that were adversarially trained on iterative attacks
Tables
  • Table1: Error rates (in %) of adversarial examples transferred between models. We use StepLL with = 16/256 for 10,000 random test inputs. Diagonal elements represent a white-box attack. The best attack for each target appears in bold. Similar results for MNIST models appear in Table 7
  • Table2: Error rates (in %) for Step-LL, R+Step-LL and a two-step Iter-LL on ImageNet. We use = 16/256, α = /2 on 10,000 random test inputs. R+FGSM results on MNIST are in Table 7
  • Table3: Models used for Ensemble Adversarial Training on ImageNet. The ResNets (<a class="ref-link" id="cHe_et+al_2016_a" href="#rHe_et+al_2016_a">He et al, 2016</a>) use either 50 or 101 layers. IncRes stands for Inception ResNet (<a class="ref-link" id="cSzegedy_et+al_2016_a" href="#rSzegedy_et+al_2016_a">Szegedy et al, 2016a</a>)
  • Table4: Error rates (in %) for Ensemble Adversarial Training on ImageNet. Error rates on clean data are computed over the full test set. For 10,000 random test set inputs, and = 16/256, we report error rates on white-box Step-LL and the worst-case error over a series of black-box attacks (Step-LL, R+Step-LL, FGSM, I-FGSM, PGD) transferred from the holdout models in Table 3. For both architectures, we mark methods tied for best in bold (based on 95% confidence)
  • Table5: Neural network architectures used in this work for the MNIST dataset. Conv: convolutional layer, FC: fully connected layer
  • Table6: Approximation ratio between optimal loss and loss induced by single-step attack on MNIST. Architecture B’ is the same as B without the input dropout layer
  • Table7: White-box and black-box attacks against standard and adversarially trained models. For each model, the strongest single-step white-box and black box attacks are marked in bold
  • Table8: Ensemble Adversarial Training on MNIST. For black-box robustness, we report the maximum and average error rate over a suite of 12 attacks, comprised of the FGSM, I-FGSM and PGD (<a class="ref-link" id="cMadry_et+al_2017_a" href="#rMadry_et+al_2017_a">Madry et al, 2017</a>) attacks applied to models A,B,C and D. We use = 16 in all cases. For each model architecture, we mark the models tied for best (at a 95% confidence level) in bold
  • Table9: Error rates (in %) of randomized single-step attacks transferred between models on ImageNet. We use R+Step-LL with = 16/256, α = /2 for 10,000 random test set samples. The white-box attack always outperforms black-box attacks
Download tables as Excel
Related work
  • Various defensive techniques against adversarial examples in deep neural networks have been proposed (Gu & Rigazio, 2014; Luo et al, 2015; Papernot et al, 2016c; Nayebi & Ganguli, 2017; Cisse et al, 2017) and many remain vulnerable to adaptive attackers (Carlini & Wagner, 2017a;b; Baluja & Fischer, 2017). Adversarial training (Szegedy et al, 2013; Goodfellow et al, 2014b; Kurakin et al, 2017b; Madry et al, 2017) appears to hold the greatest promise for learning robust models.

    Madry et al (2017) show that adversarial training on MNIST yields models that are robust to whitebox attacks, if the adversarial examples used in training closely maximize the model’s loss. Moreover, recent works by Sinha et al (2018), Raghunathan et al (2018) and Kolter & Wong (2017) even succeed in providing certifiable robustness for small perturbations on MNIST. As we argue in Appendix C, the MNIST dataset is peculiar in that there exists a simple “closed-form” denoising procedure (namely feature binarization) which leads to similarly robust models without adversarial training. This may explain why robustness to white-box attacks is hard to scale to tasks such as ImageNet (Kurakin et al, 2017b). We believe that the existence of a simple robust baseline for MNIST can be useful for understanding some limitations of adversarial training techniques.
Funding
  • Nicolas Papernot is supported by a Google PhD Fellowship in Security
  • Research was supported in part by the Army Research Laboratory, under Cooperative Agreement Number W911NF-13-2-0045 (ARL Cyber Security CRA), and the Army Research Office under grant W911NF-13-1-0421
Reference
  • Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
    Findings
  • Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mane. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
    Findings
  • Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.
    Findings
  • Shumeet Baluja and Ian Fischer. Adversarial transformation networks: Learning to generate adversarial examples. arXiv preprint arXiv:1703.09387, 2017.
    Findings
  • Wieland Brendel and Matthias Bethge. Comment on” biologically inspired protection of deep networks from adversarial attacks”. arXiv preprint arXiv:1704.01547, 2017.
    Findings
  • Jacob Buckman, Aurko Roy, Colin Raffel, and Ian Goodfellow. Thermometer encoding: One hot way to resist adversarial examples. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=S18Su--CW.
    Locate open access versionFindings
  • Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, 2017a.
    Google ScholarLocate open access versionFindings
  • Nicholas Carlini and David Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. arXiv preprint arXiv:1705.07263, 2017b.
    Findings
  • Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 15–26. ACM, 2017.
    Google ScholarLocate open access versionFindings
  • Moustapha Cisse, Bojanowski Piotr, Grave Edouard, Dauphin Yann, and Usunier Nicolas. Parseval networks: Improving robustness to adversarial examples. arXiv preprint arXiv:1704.08847, 2017.
    Findings
  • Charles J Colbourn. CRC handbook of combinatorial designs. CRC press, 2010.
    Google ScholarFindings
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
    Google ScholarLocate open access versionFindings
  • Logan Engstrom, Dimitris Tsipras, Ludwig Schmidt, and Aleksander Madry. A rotation and a translation suffice: Fooling cnns with simple transformations. arXiv preprint arXiv:1712.02779, 2017.
    Findings
  • Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014a.
    Google ScholarLocate open access versionFindings
  • Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014b.
    Findings
  • Shixiang Gu and Luca Rigazio. Towards deep neural network architectures robust to adversarial examples. arXiv preprint arXiv:1412.5068, 2014.
    Findings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • J Zico Kolter and Eric Wong. Provable defenses against adversarial examples via the convex outer adversarial polytope. arXiv preprint arXiv:1711.00851, 2017.
    Findings
  • Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. In ICLR, 2017a.
    Google ScholarLocate open access versionFindings
  • Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. In ICLR, 2017b.
    Google ScholarLocate open access versionFindings
  • Nips 2017: Defense against adversarial attack, 2017c.
    Google ScholarFindings
  • URL https://www.kaggle.com/c/
    Findings
  • Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
    Google ScholarLocate open access versionFindings
  • Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Yan Luo, Xavier Boix, Gemma Roig, Tomaso Poggio, and Qi Zhao. Foveation-based mechanisms alleviate adversarial examples. arXiv preprint arXiv:1511.06292, 2015.
    Findings
  • Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
    Findings
  • Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. arXiv preprint arXiv:0902.3430, 2009.
    Findings
  • Dmytro Mishkin, Nikolay Sergievskiy, and Jiri Matas. Systematic evaluation of convolution neural network advances on the imagenet. Computer Vision and Image Understanding, 2017.
    Google ScholarLocate open access versionFindings
  • Aran Nayebi and Surya Ganguli. Biologically inspired protection of deep networks from adversarial attacks. arXiv preprint arXiv:1703.09202, 2017.
    Findings
  • Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In Security and Privacy (EuroS&P), 2016 IEEE European Symposium on, pp. 372–387. IEEE, 2016a.
    Google ScholarLocate open access versionFindings
  • Nicolas Papernot, Patrick McDaniel, Arunesh Sinha, and Michael Wellman. Towards the science of security and privacy in machine learning. arXiv preprint arXiv:1611.03814, 2016b.
    Findings
  • Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In Security and Privacy (SP), 2016 IEEE Symposium on, pp. 582–597. IEEE, 2016c.
    Google ScholarLocate open access versionFindings
  • Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Asia Conference on Computer and Communications Security (ASIACCS), pp. 506–519. ACM, 2017.
    Google ScholarLocate open access versionFindings
  • Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certified defenses against adversarial examples. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Bys4ob-Rb.
    Locate open access versionFindings
  • Aman Sinha, Hongseok Namkoong, and John Duchi. Certifiable distributional robustness with principled adversarial training. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Hk6kPgZA-.
    Locate open access versionFindings
  • Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1):1929–1958, 2014.
    Google ScholarLocate open access versionFindings
  • Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
    Findings
  • Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi. Inception-v4, inceptionresnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261, 2016a.
    Findings
  • Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, pp. 2818–2826, 2016b.
    Google ScholarLocate open access versionFindings
  • Florian Tramer, Fan Zhang, Ari Juels, Michael K Reiter, and Thomas Ristenpart. Stealing machine learning models via prediction apis. In Usenix Security, 2016.
    Google ScholarLocate open access versionFindings
  • Florian Tramer, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel. The space of transferable adversarial examples. arXiv preprint arXiv:1704.03453, 2017.
    Findings
  • Chaowei Xiao, Bo Li, Jun-Yan Zhu, Warren He, Mingyan Liu, and Dawn Song. Generating adversarial examples with adversarial networks, 2018. URL https://openreview.net/forum?id=HknbyQbC-.
    Findings
  • Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, and Alan Yuille. Mitigating adversarial effects through randomization. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Sk9yuql0Z.
    Locate open access versionFindings
  • Chao Zhang, Lei Zhang, and Jieping Ye. Generalization bounds for domain adaptation. In Advances in neural information processing systems, pp. 3320–3328, 2012.
    Google ScholarLocate open access versionFindings
  • Attacks based on transferability (Szegedy et al., 2013) fall in this category, wherein the adversary selects a procedure train and model architecture H, trains a local model h over D, and computes adversarial examples on its local model h using white-box attack strategies.
    Google ScholarFindings
  • We provide A with the target’s training procedure train to capture knowledge of defensive strategies applied at training time, e.g., adversarial training (Szegedy et al., 2013; Goodfellow et al., 2014b) or ensemble adversarial training (see Section 4.2). For ensemble adversarial training, A also knows the architectures of all pre-trained models. In this work, we always mount black-box attacks that train a local model with a different architecture than the target model. We actually find that black-box attacks on adversarially trained models are stronger in this case (see Table 1).
    Google ScholarLocate open access versionFindings
  • The main focus of our paper is on non-interactive black-box adversaries as defined above. For completeness, we also formalize a stronger notion of interactive black-box adversaries that additionally issue prediction queries to the target model (Papernot et al., 2017). We note that in cases where ML models are deployed as part of a larger system (e.g., a self driving car), an adversary may not have direct access to the model’s query interface.
    Google ScholarLocate open access versionFindings
  • Papernot et al. (2017) show that such attacks are possible even if the adversary only gets access to a small number of samples from D. Note that if the target model’s prediction interface additionally returns class scores h(x), interactive black-box adversaries could use queries to the target model to estimate the model’s gradient (e.g., using finite differences) (Chen et al., 2017), and then apply the attacks in Section 3.2. We further discuss interactive black-box attack strategies in Section 5.
    Google ScholarLocate open access versionFindings
  • We and further define the average discrepancy A∗ with respect to a hypothesis space distance H as (Mansour et al., 2009)
    Google ScholarFindings
  • Finally, let RN (H) be the average Rademacher complexity of the distributions A1,..., Ak (Zhang et al., 2012). Note that RN (H) → 0 as N → ∞. The following theorem is a corollary of Zhang et al. (2012, Theorem 5.2): Theorem 5. Assume that H is a function class consisting of bounded functions. Then, with probability at least 1 −, sup |R(h, Atrain) − R(h, A∗)| ≤ discH(Atrain, A∗) + 2RN (H) + O
    Google ScholarLocate open access versionFindings
  • We re-iterate our ImageNet experiments on MNIST. For this simpler task, Madry et al. (2017) show that training on iterative attacks conveys robustness to white-box attacks with bounded ∞ norm. Our goal is not to attain similarly strong white-box robustness on MNIST, but to show that our observations on limitations of single-step adversarial training, extend to other datasets than ImageNet.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments