Skip-Thought Vectors

Annual Conference on Neural Information Processing Systems, 2015.

Cited by: 1721|Bibtex|Views252
EI
Other Links: academic.microsoft.com|dblp.uni-trier.de|dl.acm.org
Weibo:
We evaluated the effectiveness of skip-thought vectors as an off-the-shelf sentence representation with linear classifiers across 8 tasks

Abstract:

We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct the surrounding sentences of an encoded passage. Sentences that share semantic and syntactic properties are thus mapped to similar vector represe...More

Code:

Data:

Introduction
  • Developing learning algorithms for distributed compositional semantics of words has been a longstanding open problem at the intersection of language understanding and machine learning.
  • Several approaches have been developed for learning composition operators that map word vectors to sentence vectors including recursive networks [1], recurrent networks [2], convolutional networks [3, 4] and recursive-convolutional methods [5, 6] among others
  • All of these methods produce sentence representations that are passed to a supervised task and depend on a class label in order to backpropagate through the composition weights.
  • The downside is at test time, inference needs to be performed to compute a new vector
Highlights
  • Developing learning algorithms for distributed compositional semantics of words has been a longstanding open problem at the intersection of language understanding and machine learning
  • Several approaches have been developed for learning composition operators that map word vectors to sentence vectors including recursive networks [1], recurrent networks [2], convolutional networks [3, 4] and recursive-convolutional methods [5, 6] among others
  • All of these methods produce sentence representations that are passed to a supervised task and depend on a class label in order to backpropagate through the composition weights
  • For our final quantitative experiments, we report results on several classification benchmarks which are commonly used for evaluating sentence representation learning methods
  • We evaluated the effectiveness of skip-thought vectors as an off-the-shelf sentence representation with linear classifiers across 8 tasks
  • We believe our model for learning skip-thought vectors only scratches the surface of possible objectives
Methods
  • Illinois-LH [18] UNAL-NLP [19] Meaning Factory [20] ECNU [21].
  • Mean vectors [22] DT-RNN [23] SDT-RNN [23] LSTM [22] Bidirectional LSTM [22] Dependency Tree-LSTM [22].
  • Bow uni-skip bi-skip combine-skip combine-skip+COCO r.
  • 0.7823 0.8477 0.8405 0.8584 0.8655 ρ MSE.
  • 75.0 82.7 76.1 82.7 75.6 83.0 77.4 84.1 80.4 86.0 bow
Conclusion
  • The authors evaluated the effectiveness of skip-thought vectors as an off-the-shelf sentence representation with linear classifiers across 8 tasks.
  • Many of the methods the authors compare against were only evaluated on 1 task.
  • Many variations have yet to be explored, including (a) deep encoders and decoders, (b) larger context windows, (c) encoding and decoding paragraphs, (d) other encoders, such as convnets.
  • It is likely the case that more exploration of this space will result in even higher quality representations
Summary
  • Introduction:

    Developing learning algorithms for distributed compositional semantics of words has been a longstanding open problem at the intersection of language understanding and machine learning.
  • Several approaches have been developed for learning composition operators that map word vectors to sentence vectors including recursive networks [1], recurrent networks [2], convolutional networks [3, 4] and recursive-convolutional methods [5, 6] among others
  • All of these methods produce sentence representations that are passed to a supervised task and depend on a class label in order to backpropagate through the composition weights.
  • The downside is at test time, inference needs to be performed to compute a new vector
  • Methods:

    Illinois-LH [18] UNAL-NLP [19] Meaning Factory [20] ECNU [21].
  • Mean vectors [22] DT-RNN [23] SDT-RNN [23] LSTM [22] Bidirectional LSTM [22] Dependency Tree-LSTM [22].
  • Bow uni-skip bi-skip combine-skip combine-skip+COCO r.
  • 0.7823 0.8477 0.8405 0.8584 0.8655 ρ MSE.
  • 75.0 82.7 76.1 82.7 75.6 83.0 77.4 84.1 80.4 86.0 bow
  • Conclusion:

    The authors evaluated the effectiveness of skip-thought vectors as an off-the-shelf sentence representation with linear classifiers across 8 tasks.
  • Many of the methods the authors compare against were only evaluated on 1 task.
  • Many variations have yet to be explored, including (a) deep encoders and decoders, (b) larger context windows, (c) encoding and decoding paragraphs, (d) other encoders, such as convnets.
  • It is likely the case that more exploration of this space will result in even higher quality representations
Tables
  • Table1: Summary statistics of the BookCorpus dataset [<a class="ref-link" id="c9" href="#r9">9</a>]. We use this corpus to training our model
  • Table2: In each example, the first sentence is a query while the second sentence is its nearest neighbour. Nearest neighbours were scored by cosine similarity from a random sample of 500,000 sentences from our corpus
  • Table3: Left: Test set results on the SICK semantic relatedness subtask. The evaluation metrics are Pearson’s r, Spearman’s ρ, and mean squared error. The first group of results are SemEval 2014 submissions, while the second group are results reported by [<a class="ref-link" id="c22" href="#r22">22</a>]. Right: Test set results on the Microsoft Paraphrase Corpus. The evaluation metrics are classification accuracy and F1 score. Top: recursive autoencoder variants. Middle: the best published results on this dataset. left) presents our results. First, we observe that our models are able to outperform all previous systems from the SemEval 2014 competition. It highlights that skip-thought vectors learn representations that are well suited for semantic relatedness. Our results are comparable to LSTMs whose representations are trained from scratch on this task. Only the dependency tree-LSTM of [<a class="ref-link" id="c22" href="#r22">22</a>] performs better than our results. We note that the dependency tree-LSTM relies on parsers whose training data is very expensive to collect and does not exist for all languages. We also observe using features learned from an image-sentence embedding model on COCO gives an additional performance boost, resulting in a model that performs on par with the dependency tree-LSTM. To get a feel for the model outputs, Table 4 shows example cases of test set pairs. Our model is able to accurately predict relatedness on many challenging cases. On some examples, it fails to pick up on small distinctions that drastically change a sentence meaning, such as tricks on a motorcycle versus tricking a person on a motorcycle. right) presents our results, from which we can observe the following: (1) skip-thoughts alone outperform recursive nets with dynamic pooling when no hand-crafted features are used, (2) when other features are used, recursive nets with dynamic pooling works better, and (3) when skipthoughts are combined with basic pairwise statistics, it becomes competitive with the state-of-the-art which incorporate much more complicated features and hand-engineering. This is a promising result as many of the sentence pairs have very fine-grained details that signal if they are paraphrases
  • Table4: Example predictions from the SICK test set. GT is the ground truth relatedness, scored between 1 and 5. The last few results show examples where slight changes in sentence structure result in large changes in relatedness which our model was unable to score correctly
  • Table5: COCO test-set results for image-sentence retrieval experiments. R@K is Recall@K (high is good). Med r is the median rank (low is good)
  • Table6: Classification accuracies on several standard benchmarks. Results are grouped as follows: (a): bag-of-words models; (b): supervised compositional models; (c) Paragraph Vector (unsupervised learning of sentence representations); (d) ours. Best results overall are bold while best results outside of group (b) are underlined
Download tables as Excel
Funding
  • This work was supported by NSERC, Samsung, CIFAR, Google and ONR Grant N00014-14-1-0232
Reference
  • Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, 2013.
    Google ScholarFindings
  • Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735– 1780, 1997.
    Google ScholarLocate open access versionFindings
  • Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural network for modelling sentences. ACL, 2014.
    Google ScholarLocate open access versionFindings
  • Yoon Kim. Convolutional neural networks for sentence classification. EMNLP, 2014.
    Google ScholarLocate open access versionFindings
  • Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. SSST-8, 2014.
    Google ScholarFindings
  • Han Zhao, Zhengdong Lu, and Pascal Poupart. Self-adaptive hierarchical sentence model. IJCAI, 2015.
    Google ScholarLocate open access versionFindings
  • Quoc V Le and Tomas Mikolov. Distributed representations of sentences and documents. ICML, 2014.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. ICLR, 2013.
    Google ScholarLocate open access versionFindings
  • Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • Nal Kalchbrenner and Phil Blunsom. Recurrent continuous translation models. In EMNLP, pages 1700– 1709, 2013.
    Google ScholarLocate open access versionFindings
  • Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. EMNLP, 2014.
    Google ScholarLocate open access versionFindings
  • Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. NIPS Deep Learning Workshop, 2014.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Quoc V Le, and Ilya Sutskever. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168, 2013.
    Findings
  • Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. ICLR, 2014.
    Google ScholarFindings
  • Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • Alice Lai and Julia Hockenmaier. Illinois-lh: A denotational and distributional approach to semantics. SemEval 2014, 2014.
    Google ScholarFindings
  • Sergio Jimenez, George Duenas, Julia Baquero, Alexander Gelbukh, Av Juan Dios Bátiz, and Av Mendizábal. Unal-nlp: Combining soft cardinality features for semantic textual similarity, relatedness and entailment. SemEval 2014, 2014.
    Google ScholarLocate open access versionFindings
  • Johannes Bjerva, Johan Bos, Rob van der Goot, and Malvina Nissim. The meaning factory: Formal semantics for recognizing textual entailment and determining semantic similarity. SemEval 2014, page 642, 2014.
    Google ScholarLocate open access versionFindings
  • Jiang Zhao, Tian Tian Zhu, and Man Lan. Ecnu: One stone two birds: Ensemble of heterogenous measures for semantic relatedness and textual entailment. SemEval 2014, 2014.
    Google ScholarLocate open access versionFindings
  • Kai Sheng Tai, Richard Socher, and Christopher D Manning. Improved semantic representations from tree-structured long short-term memory networks. ACL, 2015.
    Google ScholarLocate open access versionFindings
  • Richard Socher, Andrej Karpathy, Quoc V Le, Christopher D Manning, and Andrew Y Ng. Grounded compositional semantics for finding and describing images with sentences. TACL, 2014.
    Google ScholarFindings
  • Richard Socher, Eric H Huang, Jeffrey Pennin, Christopher D Manning, and Andrew Y Ng. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In NIPS, 2011.
    Google ScholarLocate open access versionFindings
  • Andrew Finch, Young-Sook Hwang, and Eiichiro Sumita. Using machine translation evaluation techniques to determine sentence-level semantic equivalence. In IWP, 2005.
    Google ScholarLocate open access versionFindings
  • Dipanjan Das and Noah A Smith. Paraphrase identification as probabilistic quasi-synchronous recognition. In ACL, 2009.
    Google ScholarLocate open access versionFindings
  • Stephen Wan, Mark Dras, Robert Dale, and Cécile Paris. Using dependency-based features to take the "para-farce" out of paraphrase. In Proceedings of the Australasian Language Technology Workshop, 2006.
    Google ScholarLocate open access versionFindings
  • Nitin Madnani, Joel Tetreault, and Martin Chodorow. Re-examining machine translation metrics for paraphrase identification. In NAACL, 2012.
    Google ScholarLocate open access versionFindings
  • Yangfeng Ji and Jacob Eisenstein. Discriminative improvements to distributional sentence similarity. In EMNLP, pages 891–896, 2013.
    Google ScholarLocate open access versionFindings
  • Marco Marelli, Luisa Bentivogli, Marco Baroni, Raffaella Bernardi, Stefano Menini, and Roberto Zamparelli. Semeval-2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. SemEval-2014, 2014.
    Google ScholarLocate open access versionFindings
  • Bill Dolan, Chris Quirk, and Chris Brockett. Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In Proceedings of the 20th international conference on Computational Linguistics, 2004.
    Google ScholarLocate open access versionFindings
  • A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf. Associating neural word embeddings with deep image representations using fisher vectors. In CVPR, 2015.
    Google ScholarFindings
  • Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). ICLR, 2015.
    Google ScholarFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755. 2014.
    Google ScholarLocate open access versionFindings
  • Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • Sida Wang and Christopher D Manning. Baselines and bigrams: Simple, good sentiment and topic classification. In ACL, 2012.
    Google ScholarLocate open access versionFindings
  • Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 2008.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments