Wasserstein Dependency Measure for Representation Learning

Corey Lynch
Corey Lynch
Aäron van den Oord
Aäron van den Oord

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), pp. 15578-15588, 2019.

Cited by: 10|Bibtex|Views311
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
We explore the fundamental limitations of prior mutual informationbased estimators, present several problem settings where these limitations manifest themselves, resulting in poor representation learning performance, and show that Wasserstein predictive coding mitigates these iss...

Abstract:

Mutual information maximization has emerged as a powerful learning objective for unsupervised representation learning obtaining state-of-the-art performance in applications such as object recognition, speech recognition, and reinforcement learning. However, such approaches are fundamentally limited since a tight lower bound on mutual info...More

Code:

Data:

0
Introduction
Highlights
  • Recent success in supervised learning can arguably be attributed to the paradigm shift from engineering representations to learning representations [LeCun et al, 2015]
  • We propose a new objective, which is a lower bound on both contrastive predictive coding and the dual Wasserstein dependency measure, by keeping both the Lipschitz class of functions and the log exp, which we call Wasserstein predictive coding (WPC):
  • Our first main experimental contribution is to show the effect of dataset size on the performance of mutual information-based representation learning, in particular, of (a) contrastive predictice coding (CPC)
  • We proposed a new representation learning objective as an alternative to mutual information
  • We explore the fundamental limitations of prior mutual informationbased estimators, present several problem settings where these limitations manifest themselves, resulting in poor representation learning performance, and show that WPC mitigates these issues to a large extent
  • We choose to keep the log exp term, since it decreases the variance when we use samples to estimate the gradient, which we found to improve performance in practice
  • Our results indicate that Lipschitz continuity is highly beneficial for representation learning, and an exciting direction for future work is to develop better techniques for enforcing Lipschitz continuity
Methods
  • The goal of the representation learning task is to learn representation encoders f ∈ F and g ∈ F, such that the representations f (x) and g(y) capture the underlying generative factors of variation represented by the latent variable z.
  • The authors measure the quality of the representations by learning linear classifiers predicting the underlying latent variables z.
  • This methodology is standard in the self-supervised representation learning literature
Results
  • The authors kept the training dataset size fixed at 50,000 samples
  • This confirms the hypothesis that mutual information-based representation learning suffers when the mutual information is large.
  • When the number of characters is 3, the exponential of the mutual information is 55 × 52 × 48 = 137280 which is larger than the dataset size
  • This is the case where CPC is no longer a good lower bound estimator for the mutual information, and the representation learning performance drops down significantly
Conclusion
  • The authors proposed a new representation learning objective as an alternative to mutual information.
  • This objective which the authors refer to as the Wasserstein dependency measure, uses the Wasserstein distance in place of KL divergence in mutual information.
  • As better regularization methods are developed, the authors expect the quality of representations learned via Wasserstein dependency measure to improve
Summary
  • Introduction:

    Recent success in supervised learning can arguably be attributed to the paradigm shift from engineering representations to learning representations [LeCun et al, 2015].
  • Representations can be learned via implicit generative methods [Goodfellow et al, 2014, Dumoulin et al, 2016, Donahue et al, 2016, Odena et al, 2017], via explicit generative models [Kingma and Welling, 2013, Rezende and Mohamed, 2015, Dinh et al, 2016, Rezende and Mohamed, 2015, Kingma et al, 2016], and self-supervised learning [Becker and Hinton, 1992, Doersch et al, 2015, Zhang et al, 2016, Doersch and Zisserman, van den Oord et al, 2018, Wei et al, 2018, Hjelm et al, 2019].
  • Self-supervised learning techniques have demonstrated state-of-the-art performance in speech and image understanding [van den Oord et al, 2018, Hjelm et al, 2019], reinforcement learning [Jaderberg et al, 2016, Dwibedi et al, 2018, Kim et al, 2018], imitation learning [Sermanet et al, 2017, Aytar et al, 2018], and natural language processing [Devlin et al, 2018, Radford et al.]
  • Objectives:

    For SpatialMultiOmniglot, the authors aim to learn f (x) which captures the class of each of the characters in the image.
  • Methods:

    The goal of the representation learning task is to learn representation encoders f ∈ F and g ∈ F, such that the representations f (x) and g(y) capture the underlying generative factors of variation represented by the latent variable z.
  • The authors measure the quality of the representations by learning linear classifiers predicting the underlying latent variables z.
  • This methodology is standard in the self-supervised representation learning literature
  • Results:

    The authors kept the training dataset size fixed at 50,000 samples
  • This confirms the hypothesis that mutual information-based representation learning suffers when the mutual information is large.
  • When the number of characters is 3, the exponential of the mutual information is 55 × 52 × 48 = 137280 which is larger than the dataset size
  • This is the case where CPC is no longer a good lower bound estimator for the mutual information, and the representation learning performance drops down significantly
  • Conclusion:

    The authors proposed a new representation learning objective as an alternative to mutual information.
  • This objective which the authors refer to as the Wasserstein dependency measure, uses the Wasserstein distance in place of KL divergence in mutual information.
  • As better regularization methods are developed, the authors expect the quality of representations learned via Wasserstein dependency measure to improve
Tables
  • Table1: WPC outperforms CPC on the SplitCelebA dataset
Download tables as Excel
Funding
  • We choose to keep the log exp term, since it decreases the variance when we use samples to estimate the gradient, which we found to improve performance in practice
  • WPC performs significantly better than CPC
Study subjects and analysis
samples: 50000
We were able to control the mutual information in the data by controlling the number of characters in the images. We kept the training dataset size fixed at 50,000 samples. This confirms our hypothesis that mutual information-based representation learning indeed suffers when the mutual information is large

samples: 50000
Left - The SpatialMultiOmniglot dataset consists of pairs of images (x, y) each comprising of multiple Omniglot characters in a grid, where the characters in y are the next characters in the alphabet of the characters in x. Middle - The Shapes3D dataset is a collection of colored images of an object in a room. Each image corresponds to a unique value for the underlying latent variables: color of object, color of wall, color of floor, shape of object, size of object, viewing angle. Right - The SplitCelebA dataset consists of pairs of images p(x, y) where x and y are the left and right halves of the same CelebA image, respectively. Performance of CPC and WPC (a) on SpatialMultiOmniglot using fully-connected neural networks, (b) on StackedMultiOmniglot using fully-connected networks, and (c) using convolutional neural networks. Top - WPC consistently performs better than CPC over different dataset sizes, especially when using fully-connected networks. Middle - WPC is more robust to minibatch size, while CPC’s performance drops rapidly on reduction in minibatch size. Bottom - As mutual information is increased, WPC’s drop in performance is more gradual, while CPC’s drop in performance is drastic (when mutual information passes log dataset size). bottom) shows the performance of CPC and WPC as the mutual information increases. We were able to control the mutual information in the data by controlling the number of characters in the images. We kept the training dataset size fixed at 50,000 samples. This confirms our hypothesis that mutual information-based representation learning indeed suffers when the mutual information is large. As can be seen, for small number (1 and 2) of characters, CPC has near-perfect representation learning. The exponential of the mutual information in this case is 55 and 55 × 52 = 2860 (i.e. the product of alphabet class sizes), which is smaller than the dataset size. However, when the number of characters is 3, the exponential of the mutual information is 55 × 52 × 48 = 137280 which is larger than the dataset size. This is the case where CPC is no longer a good lower bound estimator for the mutual information, and the representation learning performance drops down significantly. Performance of CPC and WPC on MultiviewShapes3D using (a,b) fully-connected networks, and (c,d) convolutional network. WPC performs consistently better than CPC for multiple dataset and minibatch sizes

Reference
  • Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
    Google ScholarLocate open access versionFindings
  • Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
    Google ScholarLocate open access versionFindings
  • Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
    Findings
  • Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016.
    Findings
  • Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier gans. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pages 2642–2651. JMLR.org, 2017. URL http://dl.acm.org/citation.cfm?id=3305890.3305954.
    Locate open access versionFindings
  • Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
    Findings
  • Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015.
    Findings
  • Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
    Findings
  • Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pages 4743–4751, 2016.
    Google ScholarLocate open access versionFindings
  • Suzanna Becker and Geoffrey E Hinton. Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355(6356):161, 1992.
    Google ScholarLocate open access versionFindings
  • Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 1422–1430, 2015.
    Google ScholarLocate open access versionFindings
  • Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
    Findings
  • Donglai Wei, Joseph J. Lim, Andrew Zisserman, and William T. Freeman. Learning and using the arrow of time. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
    Google ScholarLocate open access versionFindings
  • R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. International Conference on Learning Representations (ICLR), 2019.
    Google ScholarLocate open access versionFindings
  • Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
    Findings
  • Debidatta Dwibedi, Jonathan Tompson, Corey Lynch, and Pierre Sermanet. Learning actionable representations from visual observations. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1577–1584. IEEE, 2018.
    Google ScholarLocate open access versionFindings
  • Hyoungseok Kim, Jaekyeom Kim, Yeonwoo Jeong, Sergey Levine, and Hyun Oh Song. Emi: Exploration with mutual information maximizing state and action embeddings. arXiv preprint arXiv:1810.01176, 2018.
    Findings
  • Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, and Sergey Levine. Time-contrastive networks: Self-supervised learning from video. arXiv preprint arXiv:1704.06888, 2017.
    Findings
  • Yusuf Aytar, Tobias Pfaff, David Budden, Tom Le Paine, Ziyu Wang, and Nando de Freitas. Playing hard exploration games by watching youtube. arXiv preprint arXiv:1805.11592, 2018.
    Findings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
    Findings
  • Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.
    Findings
  • Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and Devon Hjelm. Mutual information neural estimation. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 531–540, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/belghazi18a.html.
    Locate open access versionFindings
  • Ben Poole, Sherjil Ozair, Aäron Van den Oord, Alexander A Alemi, and George Tucker. On variational bounds of mutual information. In International Conference on Machine Learning, 2019.
    Google ScholarLocate open access versionFindings
  • David McAllester and Karl Statos. Formal limitations on the measurement of mutual information. arXiv preprint arXiv:1811.04251, 2018.
    Findings
  • Ralph Linsker. Self-organization in a perceptual network. Computer, 21(3):105–117, 1988.
    Google ScholarLocate open access versionFindings
  • Anthony J Bell and Terrence J Sejnowski. An information-maximization approach to blind separation and blind deconvolution. Neural computation, 7(6):1129–1159, 1995.
    Google ScholarLocate open access versionFindings
  • Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. Estimating mutual information. Physical review E, 69(6):066138, 2004.
    Google ScholarLocate open access versionFindings
  • Ilya Nemenman, William Bialek, and Rob de Ruyter van Steveninck. Entropy and information in neural spike trains: Progress on the sampling problem. Physical Review E, 69(5):056111, 2004.
    Google ScholarLocate open access versionFindings
  • XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.
    Google ScholarLocate open access versionFindings
  • Bishnu S Atal and Manfred R Schroeder. Adaptive predictive coding of speech signals. Bell System Technical Journal, 49(8):1973–1986, 1970.
    Google ScholarLocate open access versionFindings
  • Peter Elias. Predictive coding–i. IRE Transactions on Information Theory, 1(1):16–24, 1955.
    Google ScholarLocate open access versionFindings
  • Andy Clark. Whatever next? predictive brains, situated agents, and the future of cognitive science. Behavioral and brain sciences, 36(3):181–204, 2013.
    Google ScholarLocate open access versionFindings
  • Stephanie E Palmer, Olivier Marre, Michael J Berry, and William Bialek. Predictive information in a sensory population. Proceedings of the National Academy of Sciences, 112(22):6908–6913, 2015.
    Google ScholarLocate open access versionFindings
  • Gašper Tkacik and William Bialek. Information processing in living systems. Annual Review of Condensed Matter Physics, 7:89–117, 2016.
    Google ScholarLocate open access versionFindings
  • Karl Friston and Stefan Kiebel. Predictive coding under the free-energy principle. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 364(1521):1211–1221, 2009.
    Google ScholarLocate open access versionFindings
  • Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
    Findings
Full Text
Your rating :
0

 

Tags
Comments