# Learning deep representations by mutual information estimation and maximization

arXiv: Machine Learning, Volume abs/1808.06670, 2018.

EI

Weibo:

Abstract:

In this work, we perform unsupervised learning of representations by maximizing mutual information between an input and the output of a deep neural network encoder. Importantly, we show that structure matters: incorporating knowledge about locality of the input to the objective can greatly influence a representationu0027s suitability for ...More

Code:

Data:

Introduction

- One core objective of deep learning is to discover useful representations, and the simple idea explored here is to train a representation-learning function, i.e. an encoder, to maximize the mutual information (MI) between its inputs and outputs.
- The authors combine MI maximization with prior matching in a manner similar to adversarial autoencoders (AAE, Makhzani et al, 2015) to constrain representations according to desired statistical properties.
- This approach is closely related to the infomax optimization principle (Linsker, 1988; Bell & Sejnowski, 1995), so the authors call the method Deep InfoMax (DIM).

Highlights

- One core objective of deep learning is to discover useful representations, and the simple idea explored here is to train a representation-learning function, i.e. an encoder, to maximize the mutual information (MI) between its inputs and outputs
- We introduce two new measures of representation quality, one based on Mutual Information Neural Estimation (MINE, Belghazi et al, 2018) and a neural dependency measure (NDM) based on the work by Brakel & Bengio (2017), and we use these to bolster our comparison of Deep InfoMax (DIM) to different unsupervised methods
- Deep InfoMax (DIM) follows Mutual information neural estimate (MINE) in this regard, though we find that the generator is unnecessary
- We investigate Jensen-Shannon divergence (JSD) and infoNCE in our experiments, and find that using infoNCE often outperforms JSD on downstream tasks, though this effect diminishes with more challenging data
- We found that DIM with the JSD loss is insensitive to the number of negative samples, and outperforms infoNCE as the number of negative samples becomes smaller
- Our results show that infoNCE tends to perform best, but differences between infoNCE and JSD diminish with larger datasets

Methods

- Note that the authors take CPC to mean ordered autoregression using summary features to predict “future” local features, independent of the constrastive loss used to evaluate the predictions (JSD, infoNCE, or DV).
- See the App.
- (A.2) for details of the neural net architectures used in the experiments
- See the App. (A.2) for details of the neural net architectures used in the experiments

Results

- The authors chose a crop size of 25% of the image size in width and depth with a stride of 12.5% the image size (e.g., 8 × 8 crops with 4 × 4 strides for CIFAR10, 16 × 16 crops with 8 × 8 strides for STL-10), so that there were a total of 7 × 7 local features.

Conclusion

- The authors introduced Deep InfoMax (DIM), a new method for learning unsupervised representations by maximizing mutual information, allowing for representations that contain locally-consistent information across structural “locations”.
- This provides a straightforward and flexible way to learn representations that perform well on a variety of tasks.
- The authors believe that this is an important direction in learning higher-level representations

Summary

## Introduction:

One core objective of deep learning is to discover useful representations, and the simple idea explored here is to train a representation-learning function, i.e. an encoder, to maximize the mutual information (MI) between its inputs and outputs.- The authors combine MI maximization with prior matching in a manner similar to adversarial autoencoders (AAE, Makhzani et al, 2015) to constrain representations according to desired statistical properties.
- This approach is closely related to the infomax optimization principle (Linsker, 1988; Bell & Sejnowski, 1995), so the authors call the method Deep InfoMax (DIM).
## Objectives:

The authors' goal is to train this network such that useful information about the input is extracted from the high-level features.## Methods:

Note that the authors take CPC to mean ordered autoregression using summary features to predict “future” local features, independent of the constrastive loss used to evaluate the predictions (JSD, infoNCE, or DV).- See the App.
- (A.2) for details of the neural net architectures used in the experiments
- See the App. (A.2) for details of the neural net architectures used in the experiments
## Results:

The authors chose a crop size of 25% of the image size in width and depth with a stride of 12.5% the image size (e.g., 8 × 8 crops with 4 × 4 strides for CIFAR10, 16 × 16 crops with 8 × 8 strides for STL-10), so that there were a total of 7 × 7 local features.## Conclusion:

The authors introduced Deep InfoMax (DIM), a new method for learning unsupervised representations by maximizing mutual information, allowing for representations that contain locally-consistent information across structural “locations”.- This provides a straightforward and flexible way to learn representations that perform well on a variety of tasks.
- The authors believe that this is an important direction in learning higher-level representations

- Table1: Classification accuracy (top 1) results on CIFAR10 and CIFAR100. DIM(L) (i.e., with the local-only objective) outperforms all other unsupervised methods presented by a wide margin. In addition, DIM(L) approaches or even surpasses a fully-supervised classifier with similar architecture
- Table2: Classification accuracy (top 1) results on Tiny ImageNet and STL-10. For Tiny ImageNet,
- Table3: Comparisons of DIM with Contrastive Predictive Coding (CPC, Oord et al, 2018). These experiments used a strided-crop architecture similar to the one used in Oord et al (2018). For
- Table4: Extended comparisons on CIFAR10. Linear classification results using SVM are over five runs. MS-SSIM is estimated by training a separate decoder using the fixed representation as input and minimizing the L2 loss with the original input. Mutual information estimates were done using MINE
- Table5: Augmenting infoNCE DIM with additional structural information – adding coordinate prediction tasks or occluding input patches when computing the global feature vector in DIM can improve the classification accuracy, particularly with the highly-compressed global features
- Table6: Global DIM network architecture
- Table7: Local DIM concat-and-convolve network architecture
- Table8: Local DIM encoder-and-dot architecture for global feature
- Table9: Local DIM encoder-and-dot architecture for local features
- Table10: Prior matching network architecture
- Table11: Generation scores on the Tiny Imagenet dataset for non-saturating GAN with contractive penalty (NS-GAN-CP), Wasserstein GAN with gradient penalty (WGAN-GP) and our method. Our encoder was penalized using CP

Related work

- There are many popular methods for learning representations. Classic methods, such as independent component analysis (ICA, Bell & Sejnowski, 1995) and self-organizing maps (Kohonen, 1998), generally lack the representational capacity of deep neural networks. More recent approaches include deep volume-preserving maps (Dinh et al, 2014; 2016), deep clustering (Xie et al, 2016; Chang et al, 2017), noise as targets (NAT, Bojanowski & Joulin, 2017), and self-supervised or co-learning (Doersch & Zisserman, 2017; Dosovitskiy et al, 2016; Sajjadi et al, 2016).

Generative models are also commonly used for building representations (Vincent et al, 2010; Kingma et al, 2014; Salimans et al, 2016; Rezende et al, 2016; Donahue et al, 2016), and mutual information (MI) plays an important role in the quality of the representations they learn. In generative models that rely on reconstruction (e.g., denoising, variational, and adversarial autoencoders, Vincent et al, 2008; Rifai et al, 2012; Kingma & Welling, 2013; Makhzani et al, 2015), the reconstruction error can be related to the MI as follows: Ie(X, Y ) = He(X) − He(X|Y ) ≥ He(X) − Re,d(X|Y ), (1)

where X and Y denote the input and output of an encoder which is applied to inputs sampled from some source distribution. Re,d(X|Y ) denotes the expected reconstruction error of X given the codes Y . He(X) and He(X|Y ) denote the marginal and conditional entropy of X in the distribution formed by applying the encoder to inputs sampled from the source distribution. Thus, in typical settings, models with reconstruction-type objectives provide some guarantees on the amount of information encoded in their intermediate representations. Similar guarantees exist for bi-directional adversarial models (Dumoulin et al, 2016; Donahue et al, 2016), which adversarially train an encoder / decoder to match their respective joint distributions or to minimize the reconstruction error (Chen et al, 2016).

Funding

- RDH received partial support from IVADO, NIH grants 2R01EB005846, P20GM103472, P30GM122734, and R01EB020407, and NSF grant 1539067
- AF received partial support from NIH grants R01EB020407, R01EB006841, P20GM103472, P30GM122734

Study subjects and analysis

imaging datasets: 4

DIM opens new avenues for unsupervised learning of representations and is an important step towards flexible formulations of representation learning objectives for specific end-goals. We test Deep InfoMax (DIM) on four imaging datasets to evaluate its representational properties:

• CIFAR10 and CIFAR100 (Krizhevsky & Hinton, 2009): two small-scale labeled datasets composed of 32 × 32 images with 10 and 100 classes respectively.

• Tiny ImageNet: A reduced version of ImageNet (Krizhevsky & Hinton, 2009) images scaled down to 64 × 64 with a total of 200 classes.

• STL-10 (Coates et al, 2011): a dataset derived from ImageNet composed of 96 × 96 images with a mixture of 100000 unlabeled training examples and 500 labeled examples per class. We use data augmentation with this dataset, taking random 64 × 64 crops and flipping horizontally during unsupervised learning.

• CelebA (Yang et al, 2015, Appendix A.5 only): An image dataset composed of faces labeled with 40 binary attributes

• CIFAR10 and CIFAR100 (Krizhevsky & Hinton, 2009): two small-scale labeled datasets composed of 32 × 32 images with 10 and 100 classes respectively.

• Tiny ImageNet: A reduced version of ImageNet (Krizhevsky & Hinton, 2009) images scaled down to 64 × 64 with a total of 200 classes.

• STL-10 (Coates et al, 2011): a dataset derived from ImageNet composed of 96 × 96 images with a mixture of 100000 unlabeled training examples and 500 labeled examples per class. We use data augmentation with this dataset, taking random 64 × 64 crops and flipping horizontally during unsupervised learning.

• CelebA (Yang et al, 2015, Appendix A.5 only): An image dataset composed of faces labeled with 40 binary attributes

imaging datasets: 4

App. (A.8) that this prior matching can be used alone to train a generator of image data. We test Deep InfoMax (DIM) on four imaging datasets to evaluate its representational properties:. • CIFAR10 and CIFAR100 (Krizhevsky & Hinton, 2009): two small-scale labeled datasets composed of 32 × 32 images with 10 and 100 classes respectively

Reference

- Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410, 2016.
- Luıs B Almeida. Linear and nonlinear ica based on mutual information. The Journal of Machine Learning Research, 4:1297–1318, 2003.
- Martin Arjovsky and Leon Bottou. Towards principled methods for training generative adversarial networks. In International Conference on Learning Representations, 2017.
- Suzanna Becker. An information-theoretic unsupervised learning algorithm for neural networks. University of Toronto, 1992.
- Suzanna Becker. Mutual information maximization: models of cortical self-organization. Network: Computation in neural systems, 7(1):7–31, 1996.
- Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and R Devon Hjelm. Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062, ICML’2018, 2018.
- Anthony J Bell and Terrence J Sejnowski. An information-maximization approach to blind separation and blind deconvolution. Neural computation, 7(6):1129–1159, 1995.
- Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 35(8):1798–1828, 2013.
- Piotr Bojanowski and Armand Joulin. Unsupervised learning by predicting noise. arXiv preprint arXiv:1704.05310, 2017.
- Philemon Brakel and Yoshua Bengio. Learning independent features with adversarial nets for non-linear ica. arXiv preprint arXiv:1710.05050, 2017.
- Jianlong Chang, Lingfeng Wang, Gaofeng Meng, Shiming Xiang, and Chunhong Pan. Deep adaptive image clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5879–5887, 2017.
- Tian Qi Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of disentanglement in variational autoencoders. arXiv preprint arXiv:1802.04942, 2018.
- Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180, 2016.
- Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 215–223, 2011.
- Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
- Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
- Carl Doersch and Andrew Zisserman. Multi-task self-supervised visual learning. In The IEEE International Conference on Computer Vision (ICCV), 2017.
- Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, 2015.
- Jeff Donahue, Philipp Krahenbuhl, and Trevor Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016.
- M.D Donsker and S.R.S Varadhan. Asymptotic evaluation of certain markov process expectations for large time, iv. Communications on Pure and Applied Mathematics, 36(2):183–212, 1983.
- Alexey Dosovitskiy, Philipp Fischer, Jost Tobias Springenberg, Martin Riedmiller, and Thomas Brox. Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE transactions on pattern analysis and machine intelligence, 38(9):1734–1747, 2016.
- Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier Mastropietro, and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
- Abel Gonzalez-Garcia, Joost van de Weijer, and Yoshua Bengio. Image-to-image translation for cross-domain disentanglement. arXiv preprint arXiv:1805.09730, 2018.
- Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Scholkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
- Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028, 2017.
- Michael Gutmann and Aapo Hyvarinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297–304, 2010.
- Michael U Gutmann and Aapo Hyvarinen. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research, 13 (Feb):307–361, 2012.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. Openreview, 2016.
- Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002.
- R Devon Hjelm, Athul Paul Jacob, Tong Che, Adam Trischler, Kyunghyun Cho, and Yoshua Bengio. Boundary-seeking generative adversarial networks. In International Conference on Learning Representations, 2018.
- Weihua Hu, Takeru Miyato, Seiya Tokui, Eiichi Matsumoto, and Masashi Sugiyama. Learning discrete representations via information maximizing self-augmented training. arXiv preprint arXiv:1702.08720, 2017.
- Aapo Hyvarinen and Erkki Oja. Independent component analysis: algorithms and applications. Neural networks, 13(4):411–430, 2000.
- Aapo Hyvarinen and Petteri Pajunen. Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks, 12(3):429–439, 1999.
- Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
- Xu Ji, Joao F Henriques, and Andrea Vedaldi. Invariant information distillation for unsupervised image segmentation and clustering. arXiv preprint arXiv:1807.06653, 2018.
- Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
- Diederik Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, pp. 3581–3589, 2014.
- Teuvo Kohonen. The self-organizing map. Neurocomputing, 21(1-3):1–6, 1998.
- Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
- Ralph Linsker. Self-organization in a perceptual network. IEEE Computer, 21(3):105–117, 1988. doi: 10.1109/2.36. URL https://doi.org/10.1109/2.36.
- Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
- Zhuang Ma and Michael Collins. Noise contrastive estimation and negative sampling for conditional models: Consistency and statistical efficiency. arXiv preprint arXiv:1809.01812, 2018.
- Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
- Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually converge? In International Conference on Machine Learning, pp. 3478–3487, 2018.
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119, 2013.
- Andriy Mnih and Koray Kavukcuoglu. Learning word embeddings efficiently with noise-contrastive estimation. In Advances in neural information processing systems, pp. 2265–2273, 2013.
- Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pp. 271–279, 2016.
- Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016.
- Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A. Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
- Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
- Danilo Jimenez Rezende, Shakir Mohamed, Ivo Danihelka, Karol Gregor, and Daan Wierstra. Oneshot generalization in deep generative models. arXiv preprint arXiv:1603.05106, 2016.
- Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. Contractive autoencoders: Explicit invariance during feature extraction. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pp. 833–840.
- Omnipress, 2011.
- Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In Advances in Neural Information Processing Systems, pp. 1163–1171, 2016.
- Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pp. 2234–2242, 2016.
- Jurgen Schmidhuber. Learning factorial codes by predictability minimization. Neural Computation, 4(6):863–879, 1992.
- Valentin Thomas, Jules Pondard, Emmanuel Bengio, Marc Sarfati, Philippe Beaudoin, Marie-Jean Meurs, Joelle Pineau, Doina Precup, and Yoshua Bengio. Independently controllable features. arXiv preprint arXiv:1708.01289, 2017.
- Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pp. 1096–1103. ACM, 2008.
- Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(Dec):3371–3408, 2010.
- Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for image quality assessment. In Signals, Systems and Computers, 2004. Conference Record of the Thirty-Seventh Asilomar Conference on, volume 2, pp. 1398–1402.
- Laurenz Wiskott and Terrence J Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural computation, 14(4):715–770, 2002.
- Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised deep embedding for clustering analysis. In International conference on machine learning, pp. 478–487, 2016.
- Shuo Yang, Ping Luo, Chen-Change Loy, and Xiaoou Tang. From facial parts responses to face detection: A deep learning approach. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3676–3684, 2015.
- Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. Lsun: Construction of a largescale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.

Full Text

Tags

Comments