# Provably Minimally-Distorted Adversarial Examples

arXiv: Learning, 2018.

Keywords:

Weibo:

Abstract:

The ability to deploy neural networks in real-world, safety-critical systems is severely limited by the presence of adversarial examples: slightly perturbed inputs that are misclassified by the network. In recent years, several techniques have been proposed for increasing robustness to adversarial examples --- and yet most of these have b...More

Code:

Data:

Introduction

- Neural networks in particular, have seen significant success, recent work (Szegedy et al, 2014) has shown that an adversary can cause unintended behavior by performing slight modifications to the input at test-time.
- In neural networks used as classifiers, these adversarial examples are produced by taking some normal instance that is classified correctly, and applying a slight perturbation to cause it to be misclassified as any target desired by the adversary.
- This phenomenon, which has been shown to affect most state-of-the-art networks, poses a significant hindrance to deploying neural networks in safetycritical settings.
- Input images with width W and height H are represented as points in the space [0, 1]W ·H

Highlights

- While machine learning, and neural networks in particular, have seen significant success, recent work (Szegedy et al, 2014) has shown that an adversary can cause unintended behavior by performing slight modifications to the input at test-time
- In neural networks used as classifiers, these adversarial examples are produced by taking some normal instance that is classified correctly, and applying a slight perturbation to cause it to be misclassified as any target desired by the adversary
- We considered two neural networks — the one described in Section 3, denoted N, and a version of N that has been trained with adversarial training as described in (Madry et al, 2017), denoted N
- Neural networks hold great potential to be used in safetycritical systems, but their susceptibility to adversarial examples poses a significant hindrance
- We introduce provably minimally distorted adversarial examples and show how to construct them with formal verification approaches
- We evaluate one recent attack (Carlini and Wagner, 2017) and find it often produces adversarial examples whose distance is within 6.6% to 13% of optimal, and one defense (Madry et al, 2017), and find that it increases distortion to the nearest adversarial example by an average of 423% on the MNIST dataset for our tested networks

Results

- For evaluation purposes the authors arbitrarily selected 10 source images with known labels from the MNIST test set.
- For every combination of neural network, distance metric and labeled source image x, the authors considered each of the 9 other possible labels for x.
- For each of these the authors used the CW attack to produce an initial targeted adversarial example, and used Reluplex to search for a provably minimally distorted example.

Conclusion

- Neural networks hold great potential to be used in safetycritical systems, but their susceptibility to adversarial examples poses a significant hindrance.
- The burgeoning field of neural network verification can mitigate this problem, by allowing them to obtain an absolute measurement of the usefulness of a defense, regardless of the attack to be used against it.
- The authors evaluate one recent attack (Carlini and Wagner, 2017) and find it often produces adversarial examples whose distance is within 6.6% to 13% of optimal, and one defense (Madry et al, 2017), and find that it increases distortion to the nearest adversarial example by an average of 423% on the MNIST dataset for the tested networks.
- To the best of the knowledge, this is the first proof of robustness increase for a defense that was not designed to be proven secure

Summary

## Introduction:

Neural networks in particular, have seen significant success, recent work (Szegedy et al, 2014) has shown that an adversary can cause unintended behavior by performing slight modifications to the input at test-time.- In neural networks used as classifiers, these adversarial examples are produced by taking some normal instance that is classified correctly, and applying a slight perturbation to cause it to be misclassified as any target desired by the adversary.
- This phenomenon, which has been shown to affect most state-of-the-art networks, poses a significant hindrance to deploying neural networks in safetycritical settings.
- Input images with width W and height H are represented as points in the space [0, 1]W ·H
## Results:

For evaluation purposes the authors arbitrarily selected 10 source images with known labels from the MNIST test set.- For every combination of neural network, distance metric and labeled source image x, the authors considered each of the 9 other possible labels for x.
- For each of these the authors used the CW attack to produce an initial targeted adversarial example, and used Reluplex to search for a provably minimally distorted example.
## Conclusion:

Neural networks hold great potential to be used in safetycritical systems, but their susceptibility to adversarial examples poses a significant hindrance.- The burgeoning field of neural network verification can mitigate this problem, by allowing them to obtain an absolute measurement of the usefulness of a defense, regardless of the attack to be used against it.
- The authors evaluate one recent attack (Carlini and Wagner, 2017) and find it often produces adversarial examples whose distance is within 6.6% to 13% of optimal, and one defense (Madry et al, 2017), and find that it increases distortion to the nearest adversarial example by an average of 423% on the MNIST dataset for the tested networks.
- To the best of the knowledge, this is the first proof of robustness increase for a defense that was not designed to be proven secure

- Table1: Evaluating our technique on the MNIST dataset
- Table2: Comparing the 35 instances on which Reluplex terminated for both N, L∞ and N , L∞

Funding

- Proposes to address this difficulty through formal verification techniques
- Shows how to construct provably minimally distorted adversarial examples: given an arbitrary neural network and input sample, can construct adversarial examples which proves are of minimal distortion
- Demonstrates that one of the recent ICLR defense proposals, adversarial retraining, provably succeeds at increasing the distortion required to construct adversarial examples by a factor of 4.2
- Proposes a method for using formal verification to assess the effectiveness of adversarial example attacks and defenses
- Evaluates the robustness of adversarial training as performed by Madry et al. at defending against adversarial examples on the MNIST dataset

Reference

- A. Athalye, N. Carlini, and D. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.
- M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba. End to end learning for self-driving cars, 2016. Technical Report. http://arxiv.org/abs/1604.07316.
- N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. IEEE Symposium on Security and Privacy, 2017.
- R. Ehlers. Formal verification of piece-wise linear feed-forward neural networks. In Proc. 15th Int. Symp. on Automated Technology for Verification and Analysis (ATVA), 2017.
- I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
- J. Hendrik Metzen, T. Genewein, V. Fischer, and B. Bischoff. On detecting adversarial perturbations. In International Conference on Learning Representations, 2017. arXiv preprint arXiv:1702.04267.
- D. Hendrycks and K. Gimpel. Early methods for detecting adversarial images. In International Conference on Learning Representations (Workshop Track), 2017.
- R. Huang, B. Xu, D. Schuurmans, and C. Szepesvari. Learning with a strong adversary. CoRR, abs/1511.03034, 2015.
- X. Huang, M. Kwiatkowska, S. Wang, and M. Wu. Safety verification of deep neural networks, 2016. Technical Report. http://arxiv.org/abs/1610.06940.
- K. Julian, J. Lopez, J. Brush, M. Owen, and M. Kochenderfer. Policy compression for aircraft collision avoidance systems. In Proc. 35th Digital Avionics Systems Conf. (DASC), pages 1–10, 2016.
- G. Katz, C. Barrett, D. Dill, K. Julian, and M. Kochenderfer. Towards Proving the Adversarial Robustness of Deep Neural Networks. In Proc. 1st. Workshop on Formal Verification of Autonomous Vehicles (FVAV), pages 19–26, 2017a.
- G. Katz, C. Barrett, D. Dill, K. Julian, and M. Kochenderfer. Reluplex: An efficient SMT solver for verifying deep neural networks. In Proc. 29th Int. Conf. on Computer Aided Verification (CAV), pages 97–117, 2017b.
- G. Katz, C. Barrett, D. Dill, K. Julian, and M. Kochenderfer. Reluplex, 2017c. https://github.com/guykatzz/ ReluplexCav2017.
- A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial examples in the physical world. In International Conference on Learning Representations (Workshop Track), 2016.
- A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
- S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard. Deepfool: a simple and accurate method to fool deep neural networks. arXiv preprint arXiv:1511.04599, 2015.
- L. Pulina and A. Tacchella. An abstraction-refinement approach to verification of artificial neural networks. In Proc. 22nd Int. Conf. on Computer Aided Verification (CAV), pages 243–257, 2010.
- L. Pulina and A. Tacchella. Challenging SMT solvers to verify neural networks. AI Communications, 25(2):117–135, 2012.
- A. Rozsa, E. M. Rudd, and T. E. Boult. Adversarial diversity and hard positive generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 25–32, 2016.
- C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. 2014.
- F. Tramer, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel. Ensemble adversarial training: Attacks and defenses. International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rkZvSe-RZ.accepted as poster.
- C. Xiao, J.-Y. Zhu, B. Li, W. He, M. Liu, and D. Song. Spatially transformed adversarial examples. International Conference on Learning Representations, 2018.
- S. Zheng, Y. Song, T. Leung, and I. Goodfellow. Improving the robustness of deep neural networks via stability training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4480–4488, 2016.
- S. S. Zhengli Zhao, Dheeru Dua. Generating natural adversarial examples. International Conference on Learning Representations, 2018.

Tags

Comments