## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# A Bayesian Nonparametrics View into Deep Representations

NIPS 2020, (2020)

EI

Keywords

Abstract

We investigate neural network representations from a probabilistic perspective. Specifically, we leverage Bayesian nonparametrics to construct models of neural activations in Convolutional Neural Networks (CNNs) and latent representations in Variational Autoencoders (VAEs). This allows us to formulate a tractable complexity measure for di...More

Code:

Data:

Introduction

- Neural networks that differ only in initial parameter values converge to different minima of the cost function.
- The authors focus on two goals: characterizing sets of representations that are effectively reachable by convolutional networks and uncovering structure in latent spaces learned by variational autoencoders.
- To construct such characterizations the authors adopt Dirichlet Process Gaussian Mixture Models (DP-GMMs) as density models for deep representations.
- The authors' main contributions are: (1) the authors propose probabilistic models for neural representations and use them to characterize sets of learned representations, (2) the authors show that memorizing nets learn vastly more complex representations than network trained on real data, (3) the authors demonstrate markedly different effects of two common forms of regularization on the complexity of learned representations and (4) the authors characterize latent spaces learned by β-VAEs and MMD-VAEs, demonstrating marked differences in representational capacity of their aggregated posteriors

Highlights

- Neural networks that differ only in initial parameter values converge to different minima of the cost function
- We focus on two goals: characterizing sets of representations that are effectively reachable by convolutional networks and uncovering structure in latent spaces learned by variational autoencoders
- Our main contributions are: (1) we propose probabilistic models for neural representations and use them to characterize sets of learned representations, (2) we show that memorizing nets learn vastly more complex representations than network trained on real data, (3) we demonstrate markedly different effects of two common forms of regularization on the complexity of learned representations and (4) we characterize latent spaces learned by β-Variational Autoencoders (VAEs) and Maximum Mean Discrepancy (MMD)-VAEs, demonstrating marked differences in representational capacity of their aggregated posteriors
- We presented a Bayesian Nonparametrics framework for investigating neural representations
- The main strength of this probabilistic approach is that it allows us to investigate representations that are effectively reachable by gradient-based training, rather than quantifying only the theoretical model complexity. We used it to compare complexity of representations learned by Convolutional Neural Networks (CNNs) and to explore structure of latent spaces learned by VAEs

Methods

- The authors employ DP-GMMs to investigate representational complexity in CNNs that can exploit patterns in data and networks that are forced to memorize random labels.
- This gives them a CGS trace from which the authors can recover the posterior predictive p(z∗ | Dz) over the latent space learned by this particular VAE.
- The authors use this inferred distributions as proxies to investigate aggregated posteriors.
- For notational simplicity the authors will drop conditioning on Dz in the analysis below, and write p(z∗) for the DP-GMM posterior predictive

Conclusion

- The authors presented a Bayesian Nonparametrics framework for investigating neural representations.
- The main strength of this probabilistic approach is that it allows them to investigate representations that are effectively reachable by gradient-based training, rather than quantifying only the theoretical model complexity
- The authors used it to compare complexity of representations learned by CNNs and to explore structure of latent spaces learned by VAEs. The authors' results show marked differences between memorizing networks and networks that learn on true data, as well as between two form of regularization, namely dropout and image augmentation.
- The authors uncover cases, such as dropout nets, where learned representations are sensitive to network initialization, raising doubts whether capturing semantics of network units is useful in these settings

Summary

## Introduction:

Neural networks that differ only in initial parameter values converge to different minima of the cost function.- The authors focus on two goals: characterizing sets of representations that are effectively reachable by convolutional networks and uncovering structure in latent spaces learned by variational autoencoders.
- To construct such characterizations the authors adopt Dirichlet Process Gaussian Mixture Models (DP-GMMs) as density models for deep representations.
- The authors' main contributions are: (1) the authors propose probabilistic models for neural representations and use them to characterize sets of learned representations, (2) the authors show that memorizing nets learn vastly more complex representations than network trained on real data, (3) the authors demonstrate markedly different effects of two common forms of regularization on the complexity of learned representations and (4) the authors characterize latent spaces learned by β-VAEs and MMD-VAEs, demonstrating marked differences in representational capacity of their aggregated posteriors
## Objectives:

In this article the authors aim to go beyond pairwise similarities and characterize neural representations from a probabilistic perspective.- Rather than focusing mainly on network similarity, the goal is to compare networks with respect to the complexity of effectively learnable representations or structure of the learned latent space
## Methods:

The authors employ DP-GMMs to investigate representational complexity in CNNs that can exploit patterns in data and networks that are forced to memorize random labels.- This gives them a CGS trace from which the authors can recover the posterior predictive p(z∗ | Dz) over the latent space learned by this particular VAE.
- The authors use this inferred distributions as proxies to investigate aggregated posteriors.
- For notational simplicity the authors will drop conditioning on Dz in the analysis below, and write p(z∗) for the DP-GMM posterior predictive
## Conclusion:

The authors presented a Bayesian Nonparametrics framework for investigating neural representations.- The main strength of this probabilistic approach is that it allows them to investigate representations that are effectively reachable by gradient-based training, rather than quantifying only the theoretical model complexity
- The authors used it to compare complexity of representations learned by CNNs and to explore structure of latent spaces learned by VAEs. The authors' results show marked differences between memorizing networks and networks that learn on true data, as well as between two form of regularization, namely dropout and image augmentation.
- The authors uncover cases, such as dropout nets, where learned representations are sensitive to network initialization, raising doubts whether capturing semantics of network units is useful in these settings

Related work

- Several recent works explored similarity of representations learned by neural networks. Raghu et al [2017] construct neurons’ representations as vectors of their responses over a fixed set of inputs. This differs from a typical notion of a neural representation understood as a vector of activations in a network layer given a single input example. They show that representations learned by networks trained from different initializations exhibit similarity in canonical directions. A follow up work by Morcos et al [2018] proposes an alternative way to subsume correlation in canonical directions. They study similarity of neural representations in memorizing and learning networks, compare similarity of representations in wide and narrow networks and investigate training dynamics in RNNs. More recently, Kornblith et al [2019] proposed a kernel-based similarity index that more reliably captures correspondence between network representations. This allowed them, among others, to pinpoint depth-related pathologies in convolutional networks. The main difference between these works and our approach is that we do not seek to construct a similarity score for pairs of layer representations. We instead investigate distributions of neural representations learned across many trained networks and study aggregated posteriors in deep generative models. Rather than focusing mainly on network similarity, our goal is to compare networks with respect to the complexity of effectively learnable representations or structure of the learned latent space. This requires a more flexible tool than a similarity score, which in our case is a nonparametric mixture model. A work more akin to ours was presented by Montavon et al [2011], whose aim was to verify whether successive network layers construct representations that are increasingly good at solving the underlying task. Still, their analysis sheds no light on the complexity of the set of representations that can be effectively reached by a specific network architecture and training regime.

Funding

- Research presented in this work was supported by funds assigned to AGH University of Science and Technology by the Polish Ministry of Science and Higher Education
- This research was supported in part by PL-Grid Infrastructure

Study subjects and analysis

samples: 105

We then calculate mean, minimum and maximum KL divergence across remaining Gibbs steps. In each step we take 105 samples from the posterior predictive. Representational complexity in Convolutional Networks

Reference

- Devansh Arpit, Stanisław Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron C. Courville, Yoshua Bengio, and Simon LacosteJulien. A closer look at memorization in deep networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 233–242, 2017.
- Hassan Ashtiani, Shai Ben-David, Nicholas J. A. Harvey, Christopher Liaw, Abbas Mehrabian, and Yaniv Plan. Nearly tight sample complexity bounds for learning mixtures of gaussians via sample compression schemes. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, 3-8 December 2018, Montréal, Canada, pages 3416–3425, 2018.
- Christopher P. Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in β-VAE. arXiv preprint arXiv:1804.03599, 2018.
- Xi Chen, Diederik P. Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
- Shuyang Gao, Rob Brekelmans, Greg Ver Steeg, and Aram Galstyan. Auto-encoding total correlation explanation. In The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, pages 1157–1166, 2019.
- Subhashis Ghosal and Aad Van der Vaart. Fundamentals of nonparametric Bayesian inference. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2017.
- L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa, M. Specter, and L. Kagal. Explaining explanations: An overview of interpretability of machine learning. In 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), pages 80–89, 2018.
- Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander J. Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(25):723–773, 2012.
- Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. β-VAE: Learning basic visual concepts with a constrained variational framework. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
- Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
- Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey E. Hinton. Similarity of neural network representations revisited. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pages 3519–3529, 2019.
- Jun S Liu, Wing Hung Wong, and Augustine Kong. Covariance structure of the gibbs sampler with applications to the comparisons of estimators and augmentation schemes. Biometrika, 81(1):27–40, 1994.
- Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 3730–3738. IEEE Computer Society, 2015.
- Jeffrey W. Miller and Matthew T. Harrison. A simple example of dirichlet process mixture inconsistency for the number of components. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pages 199–206, 2013.
- Jeffrey W. Miller and Matthew T. Harrison. Inconsistency of pitman-yor process mixtures for the number of components. Journal of Machine Learning Research, 15(96):3333–3370, 2014.
- Jeffrey W. Miller and Matthew T. Harrison. Mixture models with a prior on the number of components. Journal of the American Statistical Association, 113(521):340–356, 2018.
- Gregoire Montavon, Mikio L. Braun, and Klaus-Robert Muller. Kernel analysis of deep networks. Journal of Machine Learning Research, 12(78):2563–2581, 2011.
- Ari S. Morcos, Maithra Raghu, and Samy Bengio. Insights on representational similarity in neural networks with canonical correlation. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, 3-8 December 2018, Montréal, Canada, pages 5732–5741, 2018.
- Radford M Neal. Markov chain sampling methods for dirichlet process mixture models. Journal of computational and graphical statistics, 9(2):249–265, 2000.
- Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. SVCCA: singular vector canonical correlation analysis for deep learning dynamics and interpretability. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 6076–6085, 2017.
- Greg Ver Steeg and Aram Galstyan. Discovering structure in high-dimensional data through correlation explanation. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 577–585, 2014.
- Ilya O. Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schölkopf. Wasserstein autoencoders. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.
- Andrew Gordon Wilson and Pavel Izmailov. Bayesian deep learning and a probabilistic perspective of generalization. arXiv preprint arXiv:2002.08791, 2020.
- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
- Shengjia Zhao, Jiaming Song, and Stefano Ermon. InfoVAE: Information maximizing variational autoencoders. arXiv preprint arXiv:1706.02262, 2017.
- Shengjia Zhao, Jiaming Song, and Stefano Ermon. InfoVAE: Balancing learning and inference in variational autoencoders. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pages 5885–5892, 2019.

Tags

Comments