## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Phase Transitions for the Information Bottleneck in Representation Learning

ICLR, (2020)

EI

Abstract

In the Information Bottleneck (IB), when tuning the relative strength between compression and prediction terms, how do the two terms behave, and what's their relationship with the dataset and the learned representation? In this paper, we set out to answer these questions by studying multiple phase transitions in the IB objective: IB_β[p(z...More

Introduction

- The Information Bottleneck (IB) objective (Tishby et al, 2000): IBβ[p(z|x)] := I(X; Z) − βI(Y ; Z) (1)

explicitly trades off model compression (I(X; Z), I(·; ·) denoting mutual information) with predictive performance (I(Y ; Z)) using the Lagrange multiplier β, where X, Y are observed random variables, and Z is a learned representation of X. - From Eq (1) the authors see that when β → 0 it will encourage I(X; Z) = 0 which leads to a trivial representation Z that is independent of X, while when β → +∞, it reduces to a maximum likelihood objective1 that does not constrain the information flow
- Between these two extremes, how will the IB objective behave?
- In Wu et al (2019), the authors observe and study the learnability transition, i.e. the β value such that the IB objective transitions from a trivial global minimum to learning a nontrivial representation.
- To answer the full question, the authors need to consider the full range of β

Highlights

- The Information Bottleneck (IB) objective (Tishby et al, 2000): IBβ[p(z|x)] := I(X; Z) − βI(Y ; Z) (1)

explicitly trades off model compression (I(X; Z), I(·; ·) denoting mutual information) with predictive performance (I(Y ; Z)) using the Lagrange multiplier β, where X, Y are observed random variables, and Z is a learned representation of X - Based on the definition, we introduce a quantity G[p(z|x)] and use it to prove a theorem giving a practical condition for Information Bottleneck phase transitions
- We show that our theory and algorithm give tight matches with the observed phase transitions in categorical datasets, predict the onset of learning new classes and class difficulty in MNIST, and predict prominent transitions in CIFAR10 experiments (Section 6)
- We introduce the definition for Information Bottleneck phase transitions, and based on it derive a formula that gives a practical condition for Information Bottleneck phase transitions
- We reveal the close interplay between the Information Bottleneck objective, the dataset and the learned representation, as each phase transition is learning a nonlinear maximum correlation component in the orthogonal space of the learned representation
- We present an algorithm for finding the phase transitions, and show that it gives tight matches with observed phase transitions in categorical datasets, predicts onset of learning new classes and class difficulty in MNIST, and predicts prominent transitions in CIFAR10 experiments

Conclusion

- The authors observe and study the phase transitions in IB as the authors vary β.
- The authors further understand the formula via Jensen’s inequality and representational maximum correlation.
- The authors reveal the close interplay between the IB objective, the dataset and the learned representation, as each phase transition is learning a nonlinear maximum correlation component in the orthogonal space of the learned representation.
- The authors present an algorithm for finding the phase transitions, and show that it gives tight matches with observed phase transitions in categorical datasets, predicts onset of learning new classes and class difficulty in MNIST, and predicts prominent transitions in CIFAR10 experiments.
- The authors believe the approach will be applicable to other “trade-off” objectives, like β-VAE (Higgins et al, 2017) and InfoDropout (Achille & Soatto, 2018a), where the model’s ability to predict is balanced against a measure of complexity

- Table1: Class confusion matrix used in CIFAR10 experiments, reproduced from (<a class="ref-link" id="cWu_et+al_2019_a" href="#rWu_et+al_2019_a">Wu et al, 2019</a>). The value in row i, column j means for class i, the probability of labeling it as class j. The mean confusion across the classes is 20%

Related work

- The Information Bottleneck Method (Tishby et al, 2000) provides a tabular method based on the Blahut-Arimoto (BA) Algorithm (Blahut, 1972) to numerically solve the IB functional for the optimal encoder distribution P (Z|X), given the trade-off parameter β and the cardinality of the representation variable Z. This work has been extended in a variety of directions, including to the case where all three variables X, Y, Z are multivariate Gaussians (Chechik et al, 2005), cases of variational bounds on the IB and related functionals for amortized learning (Alemi et al, 2016; Achille & Soatto, 2018a; Fischer, 2018), and a more generalized interpretation of the constraint on model complexity as a Kolmogorov Structure Function (Achille et al, 2018). Previous theoretical analyses of IB include Rey & Roth (2012), which looks at IB through the lens of copula functions, and Shamir et al (2010), which starts to tackle the question of how to bound generalization with IB. We will make practical use of the original IB algorithm, as well as the amortized bounds of the Variational Informormation Bottleneck (Alemi et al, 2016) and the Conditional Entropy Bottleneck (Fischer, 2018).

Phase transitions, where key quantities change discontinuously with varying relative strength in the two-term trade-off, have been observed in many different learning domains, for multiple learning objectives. In Rezende & Viola (2018), the authors observe phase transitions in the latent representation of β-VAE for varying β. Strouse & Schwab (2017b) utilize the kink angle of the phase transitions in the Deterministic Information Bottleneck (DIB) (Strouse & Schwab, 2017a) to determine the optimal number of clusters for geometric clustering. Tegmark & Wu (2019) explicitly considers critical points in binary classification tasks using a discrete information bottleneck with a non-convex

Funding

- Introduces a definition for IB phase transitions as a qualitative change of the IB loss landscape, and show that the transitions correspond to the onset of learning new classes
- Provides two perspectives to understand the formula, revealing that each IB phase transition is finding a component of maximum correlation between X and Y orthogonal to the learned representation, in close analogy with canonical-correlation analysis in linear settings
- : identifies a qualitative change of the IB loss landscape w.r.t. p(z|x) for varying β as IB phase transitions
- Introduces a quantity G and use it to prove a theorem giving a practical condition for IB phase transitions

Reference

- Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research, 19(1):1947–1980, 2018a.
- Alessandro Achille and Stefano Soatto. Information dropout: Learning optimal representations through noisy computation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018b.
- Alessandro Achille, Glen Mbeng, and Stefano Soatto. The dynamics of differential learning i: Information-dynamics and task reachability. arXiv preprint arXiv:1810.02440, 2018.
- Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410, 2016.
- Venkat Anantharam, Amin Gohari, Sudeep Kamath, and Chandra Nair. On maximal correlation, hypercontractivity, and the data processing inequality studied by erkip and cover. arXiv preprint arXiv:1304.6133, 2013.
- Richard Blahut. Computation of channel capacity and rate-distortion functions. IEEE transactions on Information Theory, 18(4):460–473, 1972.
- Gal Chechik, Amir Globerson, Naftali Tishby, and Yair Weiss. Information bottleneck for gaussian variables. Journal of machine learning research, 6(Jan):165–188, 2005.
- Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.
- Ian Fischer. The conditional entropy bottleneck, 2018. URL openreview.net/forum?id= rkVOXhAqY7.
- Anirudh Goyal, Riashat Islam, Daniel Strouse, Zafarali Ahmed, Matthew Botvinick, Hugo Larochelle, Sergey Levine, and Yoshua Bengio. Infobot: Transfer and exploration via the information bottleneck. arXiv preprint arXiv:1901.10902, 2019.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
- Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Sy2fzU9gl.
- Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. URL https://arxiv.org/abs/1412.6980.
- Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, CIFAR, 2009.
- Xue Bin Peng, Angjoo Kanazawa, Sam Toyer, Pieter Abbeel, and Sergey Levine. Variational discriminator bottleneck: Improving imitation learning, inverse rl, and gans by constraining information flow. arXiv preprint arXiv:1810.00821, 2018.
- Mélanie Rey and Volker Roth. Meta-gaussian information bottleneck. In Advances in Neural Information Processing Systems, pp. 1916–1924, 2012.
- Danilo Jimenez Rezende and Fabio Viola. Taming VAEs. arXiv preprint arXiv:1810.00597, 2018.
- Ohad Shamir, Sivan Sabato, and Naftali Tishby. Learning and generalization with the information bottleneck. Theoretical Computer Science, 411(29-30):2696–2711, 2010.
- Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics-aware unsupervised discovery of skills. arXiv preprint arXiv:1907.01657, 2019.
- DJ Strouse and David J Schwab. The deterministic information bottleneck. Neural computation, 29 (6):1611–1630, 2017a.
- DJ Strouse and David J Schwab. The information bottleneck and geometric clustering. arXiv preprint arXiv:1712.09657, 2017b.
- Max Tegmark and Tailin Wu. Pareto-optimal data compression for binary classification tasks. arXiv preprint arXiv:1908.08961, 2019.
- https://www.perimeterinstitute.ca/videos/
- information-theory-deep-neural-networks-statistical-physics-aspects/, 2018.
- Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000.
- Tailin Wu, Ian Fischer, Isaac Chuang, and Max Tegmark. Learnability for the information bottleneck. arXiv preprint arXiv:1907.07331, 2019.
- S. Zagoruyko and N. Komodakis. Wide Residual Networks. arXiv: 1605.07146, 2016.
- Pablo Zegers. Fisher information properties. Entropy, 17(7):4918–4939, 2015.

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn