## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Maximum-Entropy Adversarial Data Augmentation for Improved Generalization and Robustness

NIPS 2020, (2020)

EI

Keywords

Abstract

Adversarial data augmentation has shown promise for training robust deep neural networks against unforeseen data shifts or corruptions. However, it is difficult to define heuristics to generate effective fictitious target distributions containing "hard" adversarial perturbations that are largely different from the source distribution. I...More

Introduction

- Deep neural networks can achieve good performance on the condition that the training and testing data are drawn from the same distribution.
- This condition might not hold true in practice.
- Information Bottleneck Principle.
- The Information Bottleneck (IB) [61] is a principled way to seek a latent representation Z that an input variable X contains about an output Y.

Highlights

- Deep neural networks can achieve good performance on the condition that the training and testing data are drawn from the same distribution
- We develop an efficient maximum-entropy regularizer to achieve the same goal by making the following contributions: (i) to the best of our knowledge, we are the first work to investigate adversarial data argumentation from an information theory perspective, and address the problem of generating “hard” adversarial perturbations from the Information Bottleneck (IB) principle which has not been studied yet; (ii) we theoretically show that the IB principle can be bounded by a maximum-entropy regularization term in the maximization phase of adversarial data argumentation, which results in a notable improvement over [68]; (iii) we show that our formulation holds in an approximate sense under certain non-deterministic conditions
- After engaging the Bayesian Neural Networks (BNNs), our performance is further improved. We believe this is because the BNN provides a better estimation of the predictive uncertainty in the maximization phase
- We introduced a maximum-entropy technique that regularizes adversarial data augmentation
- Experimental results on three standard benchmarks demonstrate that our method consistently outperforms the existing state of the art by a statistically significant margin
- It encourages the model to learn with fictitious target distributions by producing “hard” adversarial perturbations that enlarge predictive uncertainty of the current model

Methods

- The authors' main idea is to incorporate the IB principle into adversarial data augmentation so as to improve model robustness to large domain shifts.
- The authors start by adapting the IB Lagrangian (1) to supervised-learning scenarios so that the latent representation Z can be leveraged for classification purposes
- To this end, the authors modify the IB Lagrangian (1) following [1, 2, 5] to LIB(θ; X, Y ) := LCE(θ; X, Y ) + βI(X; Z), where the constraint on I(Y ; Z) is replaced with the risk associated to the prediction according to the loss function LCE.
- The network parameters are updated by the loss function LIB evaluated on the adversarial examples generated from the maximization phase

Results

- The authors' method enjoys the best performance and improves previous state of the art by a large margin (5% of accuracy on CIFAR-10-C and 4% on CIFAR-100C)
- These gains are achieved across different architectures and on both datasets.
- From the Fourier perspective [74], the performance gains from the adversarial perturbations lie primarily in high frequency domains, which are commonly occurring image corruptions
- These results demonstrate that the maximum-entropy term can regularize networks to be more robust to common image corruptions

Conclusion

- The authors introduced a maximum-entropy technique that regularizes adversarial data augmentation.
- It encourages the model to learn with fictitious target distributions by producing “hard” adversarial perturbations that enlarge predictive uncertainty of the current model.
- The authors demonstrate that the technique obtains state-of-the-art performance on MNIST, PACS, and CIFAR-10/100-C, and is extremely simple to implement.
- One major limitation of the method is that it cannot be directly applied to regression problems since the maximum-entropy lower bound is still difficult to compute in this case.
- The authors' future work might consider alternative measurements of information [49, 63] that are more suited for general machine learning applications

Summary

## Introduction:

Deep neural networks can achieve good performance on the condition that the training and testing data are drawn from the same distribution.- This condition might not hold true in practice.
- Information Bottleneck Principle.
- The Information Bottleneck (IB) [61] is a principled way to seek a latent representation Z that an input variable X contains about an output Y.
## Objectives:

Motivated by this conceptual observation, the authors aim to regularize adversarial data augmentation through maximizing the IB function.## Methods:

The authors' main idea is to incorporate the IB principle into adversarial data augmentation so as to improve model robustness to large domain shifts.- The authors start by adapting the IB Lagrangian (1) to supervised-learning scenarios so that the latent representation Z can be leveraged for classification purposes
- To this end, the authors modify the IB Lagrangian (1) following [1, 2, 5] to LIB(θ; X, Y ) := LCE(θ; X, Y ) + βI(X; Z), where the constraint on I(Y ; Z) is replaced with the risk associated to the prediction according to the loss function LCE.
- The network parameters are updated by the loss function LIB evaluated on the adversarial examples generated from the maximization phase
## Results:

The authors' method enjoys the best performance and improves previous state of the art by a large margin (5% of accuracy on CIFAR-10-C and 4% on CIFAR-100C)- These gains are achieved across different architectures and on both datasets.
- From the Fourier perspective [74], the performance gains from the adversarial perturbations lie primarily in high frequency domains, which are commonly occurring image corruptions
- These results demonstrate that the maximum-entropy term can regularize networks to be more robust to common image corruptions
## Conclusion:

The authors introduced a maximum-entropy technique that regularizes adversarial data augmentation.- It encourages the model to learn with fictitious target distributions by producing “hard” adversarial perturbations that enlarge predictive uncertainty of the current model.
- The authors demonstrate that the technique obtains state-of-the-art performance on MNIST, PACS, and CIFAR-10/100-C, and is extremely simple to implement.
- One major limitation of the method is that it cannot be directly applied to regression problems since the maximum-entropy lower bound is still difficult to compute in this case.
- The authors' future work might consider alternative measurements of information [49, 63] that are more suited for general machine learning applications

- Table1: Average classification accuracy (%) and standard deviation of models trained on MNIST [<a class="ref-link" id="c40" href="#r40">40</a>] and evaluated on SVHN [<a class="ref-link" id="c48" href="#r48">48</a>], MNIST-M [<a class="ref-link" id="c22" href="#r22">22</a>], SYN [<a class="ref-link" id="c22" href="#r22">22</a>] and USPS [<a class="ref-link" id="c15" href="#r15">15</a>]. The results are averaged over ten runs. Best performances are highlighted in bold. The results of PAR are obtained from [<a class="ref-link" id="c73" href="#r73">73</a>]
- Table2: Classification accuracy (%) of our approach on PACS dataset [<a class="ref-link" id="c41" href="#r41">41</a>] in comparison with the previously reported state-of-the-art results. Bold numbers indicate the best performance (two sets, one for each scenario engaging or forgoing domain identifications, respectively)
- Table3: Average classification accuracy (%). Across several architectures, our approach obtains CIFAR-10-C and CIFAR-100-C corruption robustness that exceeds the previous state of the art by a large margin. Best performances are highlighted in bold
- Table4: The settings of different target domains on PACS
- Table5: The settings of different network architectures on CIFAR-10-C and CIFAR-100-C

Funding

- Experimental results on three standard benchmarks demonstrate that our method consistently outperforms the existing state of the art by a statistically significant margin
- We note that our method achieves the best performance among techniques forgoing domain identifications
- Our method enjoys the best performance and improves previous state of the art by a large margin (5% of accuracy on CIFAR-10-C and 4% on CIFAR-100C)

Study subjects and analysis

datasets: 4

Other digit datasets, including SVHN [48], MNIST-M [22], SYN [22] and USPS [15], are leveraged for evaluating model performance. These four datasets contain large domain shifts from MNIST in terms of backgrounds, shapes and textures. PACS [41] is a recent dataset with different object style depictions and a more challenging domain shift than the MNIST experiment

samples: 10000

We follow the setup of [68] in experimenting with MNIST dataset. We use 10,000 samples from MNIST for training and evaluate prediction accuracy on the respective test sets of SVHN, MNIST-M, SYN and USPS. In order to work with comparable datasets, we resize all the images to 32 × 32, and treat images from MNIST and USPS as RGB images

Reference

- Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep representations. Journal of Machine Learning Research, 19(1):1947–1980, 2018.
- Alessandro Achille and Stefano Soatto. Information dropout: Learning optimal representations through noisy computation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 40(12):2897–2905, 2018.
- Alexander A. Alemi, Ian Fischer, and Joshua V. Dillon. Uncertainty in the variational information bottleneck. In Proceedings of the Conference on Uncertainty in Artificial Intelligence Workshops, 2018.
- Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin Murphy. Deep variational information bottleneck. In Proceedings of the International Conference on Learning Representations (ICLR), 2017.
- Rana Ali Amjad and Bernhard Claus Geiger. Learning representations for neural network-based classification using the information bottleneck principle. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2019.
- András Antos and Ioannis Kontoyiannis. Convergence properties of functional estimates for discrete distributions. Random Structures & Algorithms, 19(3-4):163–193, 2001.
- Yogesh Balaji, Swami Sankaranarayanan, and Rama Chellappa. Metareg: Towards domain generalization using meta-regularization. In Advances in Neural Information Processing Systems (NeurIPS), pages 998–1008, 2018.
- Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and Devon Hjelm. Mutual information neural estimation. In Proceedings of the International Conference on Machine Learning (ICML), pages 531–540, 2018.
- Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In Proceedings of the International Conference on Machine Learning (ICML), pages 1613–1622, 2015.
- Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan. Domain separation networks. In Advances in Neural Information Processing Systems (NeurIPS), pages 343–351, 2016.
- Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University Press, 2004.
- Hao Cheng, Dongze Lian, Shenghua Gao, and Yanlin Geng. Utilizing information bottleneck to evaluate the capability of deep neural networks for image classification. Entropy, 21(5):456, 2019.
- Thomas M. Cover and Joy A. Thomas. Elements of information theory. John Wiley & Sons, 2012.
- Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V. Le. AutoAugment: Learning augmentation strategies from data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 113–123, 2019.
- John S. Denker, W. R. Gardner, Hans Peter Graf, Donnie Henderson, Richard E. Howard, W. Hubbard, Lawrence D. Jackel, Henry S. Baird, and Isabelle Guyon. Neural network recognizer for hand-written zip code digits. In Advances in Neural Information Processing Systems (NeurIPS), pages 323–331, 1989.
- Terrance DeVries and Graham W. Taylor. Improved regularization of convolutional neural networks with Cutout. arXiv preprint arXiv:1708.04552, 2017.
- Sayna Ebrahimi, Mohamed Elhoseiny, Trevor Darrell, and Marcus Rohrbach. Uncertainty-guided continual learning with bayesian neural networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2020.
- Adar Elad, Doron Haviv, Yochai Blau, and Tomer Michaeli. Direct validation of the information bottleneck principle for deep nets. In Proceedings of the IEEE International Conference on Computer Vision Workshops, 2019.
- Carlos Florensa, Yan Duan, and Pieter Abbeel. Stochastic neural networks for hierarchical reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR), 2017.
- Yarin Gal and Zoubin Ghahramani. Bayesian convolutional neural networks with bernoulli approximate variational inference. arXiv preprint arXiv:1506.02158, 2015.
- Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the International Conference on Machine Learning (ICML), pages 1050–1059, 2016.
- Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In Proceedings of the International Conference on Machine Learning (ICML), page 1180–1189, 2015.
- Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
- Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In Proceedings of the International Conference on Learning Representations (ICLR), 2015.
- Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.
- Dan Hendrycks, Norman Mu, Ekin D. Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. AugMix: A simple data processing method to improve robustness and uncertainty. In Proceedings of the International Conference on Learning Representations (ICLR), 2020.
- Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4700–4708, 2017.
- Daniel Kang, Yi Sun, Dan Hendrycks, Tom Brown, and Jacob Steinhardt. Testing robustness against unforeseen adversaries. arXiv preprint arXiv:1908.08016, 2019.
- Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in Neural Information Processing Systems (NeurIPS), pages 5574–5584, 2017.
- Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2014.
- Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR), 2014.
- Artemy Kolchinsky, Brendan D. Tracey, and Steven Van Kuyk. Caveats for information bottleneck in deterministic scenarios. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.
- Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS), pages 1097–1105, 2012.
- Anders Krogh and John A. Hertz. A simple weight decay can improve generalization. In Advances in Neural Information Processing Systems (NeurIPS), pages 950–957, 1992.
- Solomon Kullback and Richard A. Leibler. On information and sufficiency. Annals of Mathematical Statistics, 22(1):79–86, 1951.
- Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial machine learning at scale. In Proceedings of the International Conference on Learning Representations (ICLR), 2017.
- Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems (NeurIPS), pages 6402–6413, 2017.
- Yann LeCun, Bernhard Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne Hubbard, and Lawrence D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, 1989.
- Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M. Hospedales. Deeper, broader and artier domain generalization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 5542–5550, 2017.
- Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M. Hospedales. Learning to generalize: Meta-learning for domain generalization. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2018.
- Da Li, Jianshu Zhang, Yongxin Yang, Cong Liu, Yi-Zhe Song, and Timothy M. Hospedales. Episodic training for domain generalization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1446–1455, 2019.
- Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. In Proceedings of the International Conference on Learning Representations (ICLR), 2016.
- Massimiliano Mancini, Samuel Rota Bulò, Barbara Caputo, and Elisa Ricci. Best sources forward: domain generalization through source-specific nets. In Proceedings of the IEEE International Conference on Image Processing (ICIP), pages 1353–1357, 2018.
- Colin McDiarmid. On the method of bounded differences, page 148–188. London Mathematical Society Lecture Note Series. Cambridge University Press, 1989.
- Radford M. Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
- Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In NeurIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
- Sherjil Ozair, Corey Lynch, Yoshua Bengio, Aaron Van den Oord, Sergey Levine, and Pierre Sermanet. Wasserstein dependency measure for representation learning. In Advances in Neural Information Processing Systems (NeurIPS), pages 15578–15588, 2019.
- Liam Paninski. Estimation of entropy and mutual information. Neural Computation, 15(6):1191–1253, 2003.
- Fengchun Qiao, Long Zhao, and Xi Peng. Learning to learn single domain generalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 12556–12565, 2020.
- Tim Salimans and Durk P. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS), pages 901–909, 2016.
- Ohad Shamir, Sivan Sabato, and Naftali Tishby. Learning and generalization with the information bottleneck. Theoretical Computer Science, 411(29-30):2696–2711, 2010.
- Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying some distributional robustness with principled adversarial training. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.
- Jasper Snoek, Yaniv Ovadia, Emily Fertig, Balaji Lakshminarayanan, Sebastian Nowozin, D. Sculley, Joshua Dillon, Jie Ren, and Zachary Nado. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems (NeurIPS), pages 13969–13980, 2019.
- Jiaming Song and Stefano Ermon. Understanding the limitations of variational mutual information estimators. In Proceedings of the International Conference on Learning Representations (ICLR), 2020.
- Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. In Proceedings of the International Conference on Learning Representations Workshops, 2014.
- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
- DJ Strouse and David J. Schwab. The deterministic information bottleneck. Neural Computation, 29(6):1611–1630, 2017.
- Charlie Tang and Russ R. Salakhutdinov. Learning stochastic feedforward neural networks. In Advances in Neural Information Processing Systems (NeurIPS), pages 530–538, 2013.
- Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method. In Proceedings of the Annual Allerton Conference on Communication, Control, and Computing, pages 368—-377, 1999.
- Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In Proceedings of the IEEE Information Theory Workshop (ITW), pages 1–5, 2015.
- Michael Tschannen, Josip Djolonga, Paul K. Rubenstein, Sylvain Gelly, and Mario Lucic. On mutual information maximization for representation learning. In Proceedings of the International Conference on Learning Representations (ICLR), 2020.
- Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7167–7176, 2017.
- Gregory Valiant and Paul Valiant. Estimating the unseen: an n/log (n)-sample estimator for entropy and support size, shown optimal via new clts. In Proceedings of the Annual ACM Symposium on Theory of Computing (STOC), pages 685–694, 2011.
- Vladimir N. Vapnik. Statistical Learning Theory. Wiley, 1998.
- Matias Vera, Pablo Piantanida, and Leonardo Rey Vega. The role of the information bottleneck in representation learning. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), pages 1580–1584, 2018.
- Riccardo Volpi, Hongseok Namkoong, Ozan Sener, John Duchi, Vittorio Murino, and Silvio Savarese. Generalizing to unseen domains via adversarial data augmentation. In Advances in Neural Information Processing Systems (NeurIPS), page 5339–5349, 2018.
- Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P. Xing. Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems (NeurIPS), pages 10506–10518, 2019.
- Haohan Wang, Zexue He, Zachary C. Lipton, and Eric P. Xing. Learning robust representations by projecting superficial statistics out. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.
- Yihong Wu and Pengkun Yang. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory, 62(6):3702–3720, 2016.
- Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1492–1500, 2017.
- Zhenlin Xu, Deyi Liu, Junlin Yang, and Marc Niethammer. Robust and generalizable visual representation learning via random convolutions. arXiv preprint arXiv:2007.13003, 2020.
- Dong Yin, Raphael Gontijo Lopes, Jon Shlens, Ekin Dogus Cubuk, and Justin Gilmer. A fourier perspective on model robustness in computer vision. In Advances in Neural Information Processing Systems (NeurIPS), pages 13255–13265, 2019.
- Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 6023–6032, 2019.
- Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Proceedings of the British Machine Vision Conference (BMVC), 2016.
- Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.
- Long Zhao, Xi Peng, Yuxiao Chen, Mubbasir Kapadia, and Dimitris N Metaxas. Knowledge as priors: Cross-modal knowledge generalization for datasets without superior knowledge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6528–6537, 2020.
- Long Zhao, Xi Peng, Yu Tian, Mubbasir Kapadia, and Dimitris N Metaxas. Semantic graph convolutional networks for 3D human pose regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3425–3435, 2019.

Tags

Comments