## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Deep Variational Information Bottleneck.

international conference on learning representations, (2019)

EI

Keywords

Abstract

We present a variational approximation to the information bottleneck of Tishby et al. (1999). This variational approach allows us to parameterize the information bottleneck model using a neural network and leverage the reparameterization trick for efficient training. We call this method “Deep Variational Information Bottleneck”, or Deep V...More

Code:

Data:

Introduction

- The authors adopt an information theoretic view of deep networks. The authors regard the internal representation of some intermediate layer as a stochastic encoding Z of the input source X, defined by a parametric encoder p(z|x; θ).1 The authors' goal is to learn an encoding that is maximally informative about the target Y , measured by the mutual information between the encoding and the target I(Z, Y ; θ), where I(Z, Y ; θ) =

dx dy p(z, y|θ) log p(z, y|θ) p(z|θ)p(y|θ) (1)

Given the data processing inequality, and the invariance of the mutual information to reparameterizations, if this was the only objective the authors could always ensure a maximally informative representation by taking the identity encoding of the data (Z = X), but this is not a useful representation of the data. - They fit a deterministic network by optimizing an objective that combines the usual cross entropy loss with an extra term which penalizes models for having low entropy predictive distributions.
- The authors present various experimental results, comparing the behavior of standard deterministic networks to stochastic neural networks trained by optimizing the VIB objective.

Highlights

- We adopt an information theoretic view of deep networks
- A natural and useful constraint to apply is on the mutual information between our encoding and the original data, I(X, Z) ≤ Ic, where Ic is the information constraint
- We propose to use variational inference to construct a lower bound on the information bottleneck (IB) objective in Equation 3
- By a series of experiments, that stochastic neural networks, fit using our VIB method, are robust to overfitting, since VIB finds a representation Z which ignores as many details of the input X as possible
- We recently discovered Chalk et al (2016), who independently developed the same variational lower bound on the IB objective as us
- We investigate if VIB offers similar advantages for ImageNet, a more challenging natural image classification

Results

- → 0, so the network deterministic model that has the same form as the stochastic encoding, and drop the Gaussian layer.
- To demonstrate that the VIB method can achieve competitive classification results, the authors compared against a deterministic MLP trained with various forms of regularization.
- In Figure 1(d) the authors plot the second term in the objective, the upper bound on the mutual information between the images X and the stochastic encoding Z, which in the case is the relative entropy between the encoding and the fixed isotropic unit Gaussian prior.
- The authors will show how training with the VIB objective makes models significantly more robust to such adversarial examples.
- On MNIST, Goodfellow et al (2014) reported that FGS could generate adversarial examples that fooled a maxout network approximately 90% of the time with = 0.25, where is the magnitude of the perturbation at each pixel.
- For the VIB models, the authors use 12 posterior samples of Z to compute the class label distribution p(y|x).
- Figure 5 plots the accuracy on FGS adversarial examples of the first 1000 images from the MNIST test set as a function of β.
- Figure 6 plots the accuracy on L2 optimization adversarial examples of the first 1000 images from the MNIST test set as a function of β.

Conclusion

- 4.2.5 IMAGENET RESULTS AND DISCUSSION VIB improved classification accuracy and adversarial robustness for toy datasets like MNIST.
- The VIB network is more robust to the targeted L2 optimization attack in both magnitude of perturbation and frequency of successful attack.
- There are many possible directions for future work, including: putting the VIB objective at multiple or every layer of a network; testing on real images; using richer parametric marginal approximations, rather than assuming r(z) = N (0, I); exploring the connections to differential privacy (see e.g., Wang et al (2016a); Cuff & Yu (2016)); and investigating open universe classification problems (see e.g., Bendale & Boult (2015)).

- Table1: Test set misclassification rate on permutation-invariant MNIST using K = 256. We compare our method (VIB) to an equivalent deterministic model using various forms of regularization. The discrepancy between our results for confidence penalty and label smoothing and the numbers reported in (Pereyra et al, 2016) are due to slightly different hyperparameters
- Table2: Quantitative results showing how the different Inception Resnet V2-based architectures (described in Section 4.2.5) respond to targeted L2 adversarial examples. Determ is the deterministic architecture, IRv2 is the unmodified Inception Resnet V2 architecture, and VIB(0.01) is the VIB architecture with β = 0.01. Successful target is the fraction of adversarial examples that caused the architecture to classify as the target class (soccer ball). Lower is better. L2 and L∞ are the average L distances between the original images and the adversarial examples. Larger values mean the adversary had to make a larger perturbation to change the class

Related work

- The idea of using information theoretic objectives for deep neural networks was pointed out in Tishby & Zaslavsky (2015b). However, they did not include any experimental results, since their approach for optimizing the IB objective relied on the iterative Blahut Arimoto algorithm, which is infeasible to apply to deep neural networks.

Variational inference is a natural way to approximate the problem. Variational bounds on mutual information have previously been explored in Agakov (2004), though not in conjunction with the information bottleneck objective. Mohamed & Rezende (2015) also explore variational bounds on mutual information, and apply them to deep neural networks, but in the context of reinforcement learning. We recently discovered Chalk et al (2016), who independently developed the same variational lower bound on the IB objective as us. However, they apply it to sparse coding problems, and use the kernel trick to achieve nonlinear mappings, whereas we apply it to deep neural networks, which are computationally more efficient. In addition, we are able to handle large datasets by using stochastic gradient descent, whereas they use batch variational EM.

Funding

- In order to pick a model with some “headroom” to improve, we decided to use the same architecture as in the (Pereyra et al, 2016) paper, namely an MLP with fully connected layers of the form 784 - 1024 - 1024 - 10, and ReLu activations. (Since we are not exploiting spatial information, this correpsonds to the “permutation invariant” version of MNIST.) The performance of this baseline is 1.38% error. (Pereyra et al, 2016) were able to improve this to 1.17% using their regularization technique
- We were able to improve this to 1.13% using our technique, as we explain below
- All models tested achieve over 98.4% accuracy on the unperturbed MNIST test set, so there is no appreciable measurement distortion due to underlying model accuracy
- We see that for a reasonably broad range of β values, the VIB models have significantly better accuracy on the adversarial examples than the deterministic models, which have an accuracy of 0% (the L2 optimization attack is very effective on traditional model architectures)
- The higher layers of the network continue to classify this representation with 80.4% accuracy; conditioned on this extraction the classification

Study subjects and analysis

posterior samples: 12

The architectures included a deterministic (base) model trained by MLE; a deterministic model trained with dropout (the dropout rate was chosen on the validation set); and a stochastic model trained with VIB for various values of β. For the VIB models, we use 12 posterior samples of Z to compute the class label distribution p(y|x). This helps ensure that the adversaries can get a consistent gradient when constructing the perturbation, and that they can get a consistent evaluation when checking if the perturbation was successful

samples: 12

We also ran the VIB models in “mean mode”, where the σs are forced to be 0. This had no noticeable impact on the results, so all reported results are for stochastic evaluation with 12 samples. 4.2.4 MNIST RESULTS AND DISCUSSION

Reference

- Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
- David Barber Felix Agakov. The IM algorithm: a variational approach to information maximization. In NIPS, volume 16, 2004.
- Shumeet Baluja, Michele Covell, and Rahul Sukthankar. The virtues of peer pressure: A simple method for discovering high-value mistakes. In Intl. Conf. Computer Analysis of Images and Patterns, 2015.
- Abhijit Bendale and Terrance Boult. Towards open world recognition. In CVPR, 2015.
- William Bialek, Ilya Nemenman, and Naftali Tishby. Predictability, complexity, and learning. Neural computation, 13(11):2409–2463, 2001.
- Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. In ICML, 2015.
- Ryan P. Browne and Paul D. McNicholas. Multivariate sharp quadratic bounds via Σ-strong convexity and the fenchel connection. Electronic Journal of Statistics, 9, 2015.
- Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. Arxiv, 2016.
- Matthew Chalk, Olivier Marre, and Gasper Tkacik. Relevant sparse codes with variational information bottleneck. In NIPS, 2016.
- G. Chechik, A Globersonand N. Tishby, and Y. Weiss. Information bottleneck for gaussian variables. J. of Machine Learning Research, 6:165188, 2005.
- Paul Cuff and Lanqing Yu. Differential privacy as a mutual information constraint. In ACM Conference on Computer and Communications Security (CCS), 2016.
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. IEEE, 2009.
- Alhussein Fawzi, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. Robustness of classifiers: from adversarial to random noise. In NIPS, 2016.
- Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In AI/Statistics, volume 9, pp. 249–256, 2010.
- Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In ICLR, 2015.
- Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-VAE: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017. URL https://openreview.net/pdf?id=Sy2fzU9gl.
- Ruitong Huang, Bing Xu, Dale Schuurmans, and Csaba Szepesvari. Learning with a strong adversary. CoRR, abs/1511.03034, 2015.
- Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015. Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In ICLR, 2014. Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. In
- ICLR Workshop, 2017. URL https://openreview.net/pdf?id=S1OufnIlx. Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, and Richard Zemel. The variational fair autoencoder. In ICLR, 2016. URL http://arxiv.org/abs/1511.00830. David JC MacKay. Information theory, inference and learning algorithms. Cambridge university press, 2003. Shakir Mohamed and Danilo Jimenez Rezende. Variational information maximisation for intrinsically motivated reinforcement learning. In NIPS, pp.2125–2133, 2015. Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations. Arxiv, 2016. Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool:a simple and accurate method to fool deep neural networks. In CVPR, 2016.
- Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In CVPR, 2015. URL http://arxiv.org/abs/1412.1897.
- Stephanie E Palmer, Olivier Marre, Michael J Berry, and William Bialek. Predictive information in a sensory population. PNAS, 112(22):6908–6913, 2015.
- Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In Proceedings of the 1st IEEE European Symposium on Security and Privacy, 2015.
- Gabriel Pereyra, George Tuckery, Jan Chorowski, and Lukasz Kaiser. Regularizing neural networks by penalizing confident output predictions. In ICLR Workshop, 2017. URL https://openreview.net/pdf?id=HyhbYrGYe.
- Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
- Leigh Robinson and Benjamin Graham. Confusing deep convolution networks by relabelling. arXiv preprint 1510.06925, 2015.
- Sara Sabour, Yanshuai Cao, Fartash Faghri, and David J Fleet. Adversarial manipulation of deep representations. In ICLR, 2016.
- Noam Slonim, Gurinder Singh Atwal, Gasper Tkacik, and William Bialek. Information-based clustering. PNAS, 102(51):18297–18302, 2005.
- Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In ICLR, 2014. URL http://arxiv.org/abs/1312.6199.
- Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi. Inception-v4, inceptionresnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261, 2016.
- N Tishby and N Zaslavsky. Deep learning and the information bottleneck principle. In IEEE Information Theory Workshop, pp. 1–5, April 2015a.
- N. Tishby, F.C. Pereira, and W. Biale. The information bottleneck method. In The 37th annual Allerton Conf. on Communication, Control, and Computing, pp. 368–377, 1999.
- Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In Information Theory Workshop (ITW), 2015 IEEE, pp. 1–5. IEEE, 2015b.
- Weina Wang, Lei Ying, and Junshan Zhang. On the relation between identifiability, differential privacy and Mutual-Information privacy. IEEE Trans. Inf. Theory, 62:5018–5029, 2016a.
- Weiran Wang, Honglak Lee, and Karen Livescu. Deep variational canonical correlation analysis. arXiv [cs.LG], 11 October 2016b. URL https://arxiv.org/abs/1610.03454.
- similar to the information theoretic objective for clustering introduced in Slonim et al. (2005).
- And this takes the form of a variational autoencoder (Kingma & Welling, 2014), except with the second KL divergence term having an arbitrary weight β.
- This precise setup, albeit with a different motivation was recently explored in Higgins et al. (2016), where they demonstrated that by changing the weight of the variational autoencoders regularization term, there were able to achieve latent representations that were more capable when it came ot zeroshot learning and understanding ”objectness”. In that work, they motivated their choice to change the relative weightings of the terms in the objective by appealing to notions in neuroscience. Here we demonstrate that appealing to the information bottleneck objective gives a principled motivation and could open the door to better understanding the optimal choice of β and more tools for accessing the importance and tradeoff of both terms.
- This setup (which is identical to our experiments) induces a classifier which is bounded by a quadratic function, which is interesting because the theoretical framework Fawzi et al. (2016) proves that quadratic classifiers have greater capacity for adversarial robustness than linear functions.
- We now derive an approximate bound using second order Taylor series expansion (TSE). The bound can be made proper via Browne & McNicholas (2015). However, using the TSE is sufficient to sketch the derivation.
- As indicated, rather than approximate the lse via TSE, we can make a sharp, quadratic upper bound via Browne & McNicholas (2015). However this merely changes the S(W μx) scaling in the exponential; the result is still log-quadratic.

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn