Representation Learning with Contrastive Predictive Coding

arXiv: Learning, Volume abs/1807.03748, 2018.

Cited by: 501|Bibtex|Views115
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
In this paper we presented Contrastive Predictive Coding, a framework for extracting compact latent representations to encode predictions over future observations

Abstract:

While supervised learning has enabled great progress in many applications, unsupervised learning has not seen such widespread adoption, and remains an important and challenging endeavor for artificial intelligence. In this work, we propose a universal unsupervised learning approach to extract useful representations from high-dimensional d...More

Code:

Data:

0
Introduction
  • Learning high-level representations from labeled data with layered differentiable models in an endto-end fashion is one of the biggest successes in artificial intelligence so far.
  • These techniques made manually specified features largely redundant and have greatly improved state-of-the-art in several real-world applications [1, 2, 3].
  • Unsupervised learning is an important stepping stone towards robust and generic representation learning
Highlights
  • Learning high-level representations from labeled data with layered differentiable models in an endto-end fashion is one of the biggest successes in artificial intelligence so far
  • Features that are useful to transcribe human speech may be less suited for speaker identification, or music genre prediction
  • Instead we model a density ratio which preserves the mutual information between xt+k and ct (Equation 1) as follows: fk(xt+k, ct) p(xt+k |ct ) p(xt+k )
  • In this paper we presented Contrastive Predictive Coding (CPC), a framework for extracting compact latent representations to encode predictions over future observations
  • CPC combines autoregressive modeling and noise-contrastive estimation with intuitions from predictive coding to learn abstract representations in an unsupervised fashion. We tested these representations in a wide variety of domains: audio, images, natural language and reinforcement learning and achieve strong or stateof-the-art performance when used as stand-alone features
  • The simplicity and low computational requirements to train the model, together with the encouraging results in challenging reinforcement learning domains when used in conjunction with the main loss are exciting developments towards useful unsupervised learning that applies universally to many more data modalities
Results
  • The authors found that more advanced sentence encoders did not significantly improve the results, which may be due to the simplicity of the transfer tasks, and the fact that bag-of-words models usually perform well on many NLP tasks [48].
Conclusion
  • In this paper the authors presented Contrastive Predictive Coding (CPC), a framework for extracting compact latent representations to encode predictions over future observations.
  • CPC combines autoregressive modeling and noise-contrastive estimation with intuitions from predictive coding to learn abstract representations in an unsupervised fashion.
  • The authors tested these representations in a wide variety of domains: audio, images, natural language and reinforcement learning and achieve strong or stateof-the-art performance when used as stand-alone features.
  • The simplicity and low computational requirements to train the model, together with the encouraging results in challenging reinforcement learning domains when used in conjunction with the main loss are exciting developments towards useful unsupervised learning that applies universally to many more data modalities
Summary
  • Introduction:

    Learning high-level representations from labeled data with layered differentiable models in an endto-end fashion is one of the biggest successes in artificial intelligence so far.
  • These techniques made manually specified features largely redundant and have greatly improved state-of-the-art in several real-world applications [1, 2, 3].
  • Unsupervised learning is an important stepping stone towards robust and generic representation learning
  • Results:

    The authors found that more advanced sentence encoders did not significantly improve the results, which may be due to the simplicity of the transfer tasks, and the fact that bag-of-words models usually perform well on many NLP tasks [48].
  • Conclusion:

    In this paper the authors presented Contrastive Predictive Coding (CPC), a framework for extracting compact latent representations to encode predictions over future observations.
  • CPC combines autoregressive modeling and noise-contrastive estimation with intuitions from predictive coding to learn abstract representations in an unsupervised fashion.
  • The authors tested these representations in a wide variety of domains: audio, images, natural language and reinforcement learning and achieve strong or stateof-the-art performance when used as stand-alone features.
  • The simplicity and low computational requirements to train the model, together with the encouraging results in challenging reinforcement learning domains when used in conjunction with the main loss are exciting developments towards useful unsupervised learning that applies universally to many more data modalities
Tables
  • Table1: LibriSpeech phone and speaker classification results. For phone classification there are 41 possible classes and for speaker classification 251. All models used the same architecture and the same audio input sizes
  • Table2: LibriSpeech phone classification ablation experiments. More details can be found in Section 3.1
  • Table3: ImageNet top-1 unsupervised classification results. *Jigsaw is not directly comparable to the other AlexNet results because of architectural differences
  • Table4: ImageNet top-5 unsupervised classification results. Previous results with MS, Ex, RP and Col were taken from [<a class="ref-link" id="c36" href="#r36">36</a>] and are the best reported results on this task
  • Table5: Classification accuracy on five common NLP benchmarks. We follow the same transfer learning setup from Skip-thought vectors [<a class="ref-link" id="c26" href="#r26">26</a>] and use the BookCorpus dataset as source. [<a class="ref-link" id="c40" href="#r40">40</a>] is an unsupervised approach to learning sentence-level representations. [<a class="ref-link" id="c26" href="#r26">26</a>] is an alternative unsupervised learning approach. [<a class="ref-link" id="c41" href="#r41">41</a>] is the same skip-thought model with layer normalization trained for 1M iterations
Download tables as Excel
Related work
  • CPC is a new method that combines predicting future observations (predictive coding) with a probabilistic contrastive loss (Equation 4). This allows us to extract slow features, which maximize the mutual information of observations over long time horizons. Contrastive losses and predictive coding have individually been used in different ways before, which we will now discuss.

    Contrastive loss functions have been used by many authors in the past. For example, the techniques proposed by [21, 22, 23] were based on triplet losses using a max-margin approach to separate positive from negative examples. More recent work includes Time Contrastive Networks [24] which proposes to minimize distances between embeddings from multiple viewpoints of the same scene and whilst maximizing distances between embeddings extracted from different timesteps. In Time Contrastive Learning [25] a contrastive loss is used to predict the segment-ID of multivariate time-series as a way to extract features and perform nonlinear ICA.
Funding
  • We found that more advanced sentence encoders did not significantly improve the results, which may be due to the simplicity of the transfer tasks (e.g., in MPQA most datapoints consists of one or a few words), and the fact that bag-of-words models usually perform well on many NLP tasks [48]
Study subjects and analysis
datasets: 4
evaluate with 10-fold cross-validation for MR, CR, Subj, MPQA and use the train/test split for TREC. A L2 regularization weight was chosen via cross-validation (therefore nested cross-validation for the first 4 datasets). Our model consists of a simple sentence encoder genc (a 1D-convolution + ReLU + mean-pooling) that embeds a whole sentence into a 2400-dimension vector z, followed by a GRU (2400 hidden units) which predicts up to 3 future sentence embeddings with the contrastive loss to form c

Reference
  • Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
    Google ScholarLocate open access versionFindings
  • Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
    Google ScholarLocate open access versionFindings
  • Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 3156–316IEEE, 2015.
    Google ScholarLocate open access versionFindings
  • Peter Elias. Predictive coding–i. IRE Transactions on Information Theory, 1(1):16–24, 1955.
    Google ScholarLocate open access versionFindings
  • Bishnu S Atal and Manfred R Schroeder. Adaptive predictive coding of speech signals. The Bell System Technical Journal, 49(8):1973–1986, 1970.
    Google ScholarLocate open access versionFindings
  • Rajesh PN Rao and Dana H Ballard. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience, 2(1):79, 1999.
    Google ScholarLocate open access versionFindings
  • Karl Friston. A theory of cortical responses. Philosophical transactions of the Royal Society B: Biological sciences, 360(1456):815–836, 2005.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
    Findings
  • Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In European Conference on Computer Vision, pages 649–666.
    Google ScholarLocate open access versionFindings
  • Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsupervised visual representation learning by context prediction. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
    Google ScholarLocate open access versionFindings
  • Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 297–304, 2010.
    Google ScholarLocate open access versionFindings
  • Laurenz Wiskott and Terrence J Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural computation, 14(4):715–770, 2002.
    Google ScholarLocate open access versionFindings
  • Andriy Mnih and Yee Whye Teh. A fast and simple algorithm for training neural probabilistic language models. arXiv preprint arXiv:1206.6426, 2012.
    Findings
  • Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
    Findings
  • Yoshua Bengio and Jean-Sebastien Senecal. Adaptive importance sampling to accelerate training of a neural probabilistic language model. IEEE Trans. Neural Networks, 19, 2008.
    Google ScholarLocate open access versionFindings
  • Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
    Findings
  • Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
    Findings
  • Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. Conditional image generation with pixelcnn decoders. CoRR, abs/1606.05328, 2016.
    Findings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017.
    Google ScholarLocate open access versionFindings
  • Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric discriminatively, with application to face verification. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 539–546. IEEE, 2005.
    Google ScholarLocate open access versionFindings
  • Kilian Q Weinberger and Lawrence K Saul. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 10(Feb):207–244, 2009.
    Google ScholarLocate open access versionFindings
  • Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 815–823, 2015.
    Google ScholarLocate open access versionFindings
  • Pierre Sermanet, Corey Lynch, Jasmine Hsu, and Sergey Levine. Time-contrastive networks: Self-supervised learning from multi-view observation. arXiv preprint arXiv:1704.06888, 2017.
    Findings
  • Aapo Hyvarinen and Hiroshi Morioka. Unsupervised feature extraction by time-contrastive learning and nonlinear ica. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 3765–3773. Curran Associates, Inc., 2016.
    Google ScholarLocate open access versionFindings
  • Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302, 2015.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444, 2017.
    Findings
  • Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. arXiv preprint arXiv:1505.00687, 2015.
    Findings
  • Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pages 69–84.
    Google ScholarLocate open access versionFindings
  • Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 5206–5210. IEEE, 2015.
    Google ScholarLocate open access versionFindings
  • Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al. The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
    Google ScholarLocate open access versionFindings
  • Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
    Google ScholarLocate open access versionFindings
  • Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016.
    Findings
  • Carl Doersch and Andrew Zisserman. Multi-task self-supervised visual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2051–2060, 2017.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, pages 630–645.
    Google ScholarLocate open access versionFindings
  • Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456, 2015.
    Google ScholarLocate open access versionFindings
  • Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. CoRR, abs/1601.06759, 2016.
    Findings
  • Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In International Conference on Machine Learning, pages 1188–1196, 2014.
    Google ScholarLocate open access versionFindings
  • Lei Jimmy Ba, Ryan Kiros, and Geoffrey E Hinton. Layer normalization. CoRR, abs/1607.06450, 2016.
    Findings
  • Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27, 2015.
    Google ScholarLocate open access versionFindings
  • Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd annual meeting on association for computational linguistics, pages 115–124. Association for Computational Linguistics, 2005.
    Google ScholarLocate open access versionFindings
  • Minqing Hu and Bing Liu. Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 168–177. ACM, 2004.
    Google ScholarLocate open access versionFindings
  • Bo Pang and Lillian Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd annual meeting on Association for Computational Linguistics, page 271. Association for Computational Linguistics, 2004.
    Google ScholarLocate open access versionFindings
  • Janyce Wiebe, Theresa Wilson, and Claire Cardie. Annotating expressions of opinions and emotions in language. Language resources and evaluation, 39(2-3):165–210, 2005.
    Google ScholarLocate open access versionFindings
  • Xin Li and Dan Roth. Learning question classifiers. In Proceedings of the 19th international conference on Computational linguistics-Volume 1, pages 1–7. Association for Computational Linguistics, 2002.
    Google ScholarLocate open access versionFindings
  • Sida Wang and Christopher D. Manning. Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2, ACL ’12, pages 90–94, 2012.
    Google ScholarLocate open access versionFindings
  • Han Zhao, Zhengdong Lu, and Pascal Poupart. Self-adaptive hierarchical sentence model. In IJCAI, pages 4069–4076, 2015.
    Google ScholarLocate open access versionFindings
  • Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561, 2018.
    Findings
  • Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, et al. Deepmind lab. arXiv preprint arXiv:1612.03801, 2016.
    Findings
  • Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
    Google ScholarLocate open access versionFindings
  • Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learninglecture 6a-overview of mini-batch gradient descent.
    Google ScholarFindings
  • Ishmael Belghazi, Sai Rajeswar, Aristide Baratin, R Devon Hjelm, and Aaron Courville. Mine: Mutual information neural estimation. 2018.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments