What You Can Cram Into A Single amp;!#* Vector: Probing Sentence Embeddings For Linguistic Properties
PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, (2018): 2126-2136
Although much effort has recently been devoted to training high-quality sentence embeddings, we still have a poor understanding of what they are capturing. "Downstream" tasks, often based on sentence classification, are commonly used to evaluate the quality of sentence representations. The complexity of the tasks makes it however difficul...More
PPT (Upload PPT)
- Despite Ray Mooney’s quip that you cannot cram the meaning of a whole %&!$# sentence into a single !#* vector, sentence embedding methods have achieved impressive results in tasks ranging from machine translation (Sutskever et al, 2014; Cho et al, 2014) to entailment detection (Williams et al, 2018), spurring the quest for “universal embeddings” trained once and used in a variety of applications (e.g., Kiros et al, 2015; Conneau et al, 2017; Subramanian et al, 2018).
- The authors grouped Tense, SubjNum and ObjNum with the semantic tasks, since, at least for models that treat words as unanalyzed input units, they must rely on what a sentence denotes, rather than on structural/syntactic information.
- Words are to some extent informative for most tasks, leading to relatively high performance in Tense, SubjNum and ObjNum. Recall that the words containing the probed features are disjoint between train and test partitions, so the authors are not observing a confound here, but rather the effect of the redundancies one expects in natural language data.
- Despite Ray Mooney’s quip that you cannot cram the meaning of a whole %&!$# sentence into a single !#* vector, sentence embedding methods have achieved impressive results in tasks ranging from machine translation (Sutskever et al, 2014; Cho et al, 2014) to entailment detection (Williams et al, 2018), spurring the quest for “universal embeddings” trained once and used in a variety of applications (e.g., Kiros et al, 2015; Conneau et al, 2017; Subramanian et al, 2018)
- An interesting observation in Table 2 is that different encoder architectures trained with the same objective, and achieving similar performance on the training task,4 can lead to linguistically different embeddings, as indicated by the probing tasks
- For the challenging semantic odd man out (SOMO) task, the curves are mostly flat, suggesting that what BiLSTM-max is able to capture about this task is already encoded in its architecture, and further training doesn’t help much
- Adi et al (2017) introduced SentLen, word content (WC) and a word order test, focusing on a bag-of-vectors baseline, an autoencoder and skip-thought. We recast their tasks so that they only require a sentence embedding as input, we extend the evaluation to more tasks, encoders and training objectives, and we relate performance on the probing tasks with that on downstream tasks
- We introduced a set of tasks probing the linguistic knowledge of sentence embedding methods
- We showed that different encoder architectures trained with the same objective with similar performance can result in different embeddings, pointing out the importance of the architecture prior for sentence embeddings
- An interesting observation in Table 2 is that different encoder architectures trained with the same objective, and achieving similar performance on the training task,4 can lead to linguistically different embeddings, as indicated by the probing tasks.
- NMT training leads to encoders that are more linguistically aware than those trained on the NLI data set, despite the fact that the authors confirm the finding of Conneau and colleagues that NLI is best for downstream tasks (Appendix).
- Probing task comparison A good encoder, such as NMT-trained BiLSTM-max, shows generally good performance across probing tasks.
- Performance is still far from the human bounds on TreeDepth, BShift, SOMO and CoordInv. The last 3 tasks ask if a sentence is syntactically or semantically anomalous.
- For the challenging SOMO task, the curves are mostly flat, suggesting that what BiLSTM-max is able to capture about this task is already encoded in its architecture, and further training doesn’t help much.
- The authors recast their tasks so that they only require a sentence embedding as input, the authors extend the evaluation to more tasks, encoders and training objectives, and the authors relate performance on the probing tasks with that on downstream tasks.
- The authors introduced a set of tasks probing the linguistic knowledge of sentence embedding methods.
- The authors further uncovered interesting patterns of correlation between the probing tasks and more complex “downstream” tasks, and presented a set of intriguing findings about the linguistic properties of various embedding methods.
- The authors found that BiLSTM-max embeddings are already capturing interesting linguistic knowledge before training, and that, after training, they detect semantic acceptability without having been exposed to anomalous sentences before.
- The authors hope that the publicly available probing task set will become a standard benchmarking tool of the linguistic properties of new encoders, and that it will stir research towards a better understanding of what they learn.
- Table1: Source and target examples for seq2seq training tasks
- Table2: Probing task accuracies. Classification performed by a MLP with sigmoid nonlinearity, taking pre-learned sentence embeddings as input (see Appendix for details and logistic regression results)
- Adi et al (2017) introduced SentLen, WC and a word order test, focusing on a bag-of-vectors baseline, an autoencoder and skip-thought (all trained on the same data used for the probing tasks). We recast their tasks so that they only require a sentence embedding as input (two of their tasks also require word embeddings, polluting sentencelevel evaluation), we extend the evaluation to more tasks, encoders and training objectives, and we relate performance on the probing tasks with that on downstream tasks. Shi et al (2016) also use 3 probing tasks, including Tense and TopConst. It is not clear that they controlled for the same factors we considered (in particular, lexical overlap and sentence length), and they use much smaller training sets, limiting classifier-based evaluation to logistic regression. Moreover, they test a smaller set of models, focusing on machine translation.
Belinkov et al (2017a), Belinkov et al (2017b) and Dalvi et al (2017) are also interested in understanding the type of linguistic knowledge encoded in sentence and word embeddings, but their focus is on word-level morphosyntax and lexical semantics, and specifically on NMT encoders and decoders. Sennrich (2017) also focuses on NMT systems, and proposes a contrastive test to assess how they handle various linguistic phenomena. Other work explores the linguistic behaviour of recurrent networks and related models by using visualization, input/hidden representation deletion techniques or by looking at the word-by-word behaviour of the network (e.g., Nagamine et al, 2015; Hupkes et al, 2017; Li et al, 2016; Linzen et al, 2016; Kàdàr et al, 2017; Li et al, 2017). These methods, complementary to ours, are not agnostic to encoder architecture, and cannot be used for general-purpose cross-model evaluation.
Study subjects and analysis
language pairs: 3
They consist of an encoder that encodes a source sentence into a fixed-size representation, and a decoder which acts as a conditional language model and that generates the target sentence. We train Neural Machine Translation systems on three language pairs using about 2M sentences from the Europarl corpora (Koehn, 2005). We pick English-French, which involves two similar languages, English-German, involving larger syntactic differences, and English-Finnish, a distant pair
- Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer Lavi, and Yoav Goldberg. 2017. Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. In Proceedings of ICLR Conference Track. Toulon, France. Published online: https://openreview.net/group?id=ICLR.cc/2017/conference.
- Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. A simple but tough-to-beat baseline for sentence embeddings. In Proceedings of ICLR Conference Track. Toulon, France. Published online: https://openreview.net/group?id=ICLR.cc/2017/conference.
- Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. Advances in neural information processing systems (NIPS).
- Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and James Glass. 2017a. What do neural machine translation models learn about morphology? In Proceedings of ACL. Vancouver, Canada, pages 861–872.
- Yonatan Belinkov, Lluís Màrquez, Hassan Sajjad, Nadir Durrani, Fahim Dalvi, and James Glass. 2017b. Evaluating layers of representation in neural machine translation on part-of-speech and semantic tagging tasks. In Proceedings of IJCNLP. Taipei, Taiwan, pages 1–10.
- Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.
- Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. Proceedings of EMNLP.
- Ronan Collobert and Jason Weston. 200A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine learning. ACM, pages 160–167.
- Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of EMNLP. Copenhagen, Denmark, pages 670–680.
- Fahim Dalvi, Nadir Durrani, Hassan Sajjad, Yonatan Belinkov, and Stephan Vogel. 2017. Understanding and improving morphological learning in the neural machine translation decoder. In Proceedings of IJCNLP. Taipei, Taiwan, pages 142–151.
- Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017. Language modeling with gated convolutional networks. Proceedings of the 34th International Conference on Machine Learning.
- Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann Dauphin. 2017. Convolutional sequence to sequence learning. In Proceedings of ICML. Sydney, Australia, pages 1243–1252.
- Dieuwke Hupkes, Sara Veldhoen, and Willem Zuidema. 2017. Visualisation and diagnostic classifiers reveal how recurrent and recursive neural networks process hierarchical structure. http://arxiv.org/abs/1711.10203.
- Allan Jabri, Armand Joulin, and Laurens van der Maaten. 2016. Revisiting visual question answering baselines. In Proceedings of ECCV. Amsterdam, the Netherlands, pages 727–739.
- Àkos Kàdàr, Grzegorz Chrupała, and Afra Alishahi. 2017. Representation of linguistic form and function in recurrent neural networks. Computational Linguistics 43(4):761–780.
- Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in neural information processing systems. pages 3294–3302.
- Dan Klein and Christopher Manning. 2003. Accurate unlexicalized parsing. In Proceedings of ACL. Sapporo, Japan, pages 423–430.
- Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In MT summit. volume 5, pages 79–86.
- Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions. Association for Computational Linguistics, pages 177– 180.
- Alice Lai and Julia Hockenmaier. 2014. Illinois-LH: A denotational and distributional approach to semantics. In Proceedings of SemEval. Dublin, Ireland, pages 329–334.
- Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. 2016. Visualizing and understanding neural models in NLP. In Proceedings of NAACL. San Diego, CA, pages 681–691.
- Jiwei Li, Monroe Will, and Dan Jurafsky. 2017. Efficient estimation of word representations in vector space. https://arxiv.org/abs/1612.08220.
- Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. 2016. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics 4:521– 535.
- Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, and Roberto Zamparelli. 2014. A SICK cure for the evaluation of compositional distributional semantic models. In Proceedings of LREC. Rekjavik, Iceland, pages 216–223.
- Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. 2018. Advances in pre-training distributed word representations. In Proceedings of LREC. Miyazaki, Japan.
- Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In Proceedings of NAACL. Atlanta, Georgia, pages 746–751.
- Tasha Nagamine, Michael L. Seltzer, and Nima Mesgarani. 2015. Exploring how deep neural networks form phonemic categories. In Proceedings of INTERSPEECH. Dresden, Germany, pages 1912– 1916.
- Matthew Nelson, Imen El Karoui, Kristof Giber, Xiaofang Yang, Laurent Cohen, Hilda Koopman, Sydney Cash, Lionel Naccache, John Hale, Christophe Pallier, and Stanislas Dehaene. 2017. Neurophysiological dynamics of phrase-structure building during sentence processing. Proceedings of the National Academy of Sciences 114(18):E3669–E3678.
- Bo Pang and Lillian Lee. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of ACL. Barcelona, Spain, pages 271–278.
- Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernandez. 2016. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of ACL. Berlin, Germany, pages 1525–1534.
- Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of EMNLP. Doha, Qatar, pages 1532–1543.
- Nghia The Pham, Germán Kruszewski, Angeliki Lazaridou, and Marco Baroni. 2015. Jointly optimizing word representations for lexical and sentential tasks with the C-PHRASE model. In Proceedings of ACL. Beijing, China, pages 971–981.
- Rico Sennrich. 2017. How grammatical is characterlevel neural machine translation? assessing MT quality with contrastive translation pairs. In Proceedings of EACL (Short Papers). Valencia, Spain, pages 376–382.
- Xing Shi, Inkit Padhi, and Kevin Knight. 2016. Does string-based neural MT learn source syntax? In Proceedings of EMNLP. Austin, Texas, pages 1526– 1534.
- Richard Socher, Eric Huang, Jeffrey Pennin, Andrew Ng, and Christopher Manning. 2011. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Proceedings of NIPS. Granada, Spain, pages 801–809.
- Sandeep Subramanian, Adam Trischler, Yoshua Bengio, and Christopher J Pal. 2018. Learning general purpose distributed sentence representations via large scale multi-task learning. In International Conference on Learning Representations.
- Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. 2015. End-to-end memory networks. In Advances in neural information processing systems. pages 2440–2448.
- Ilya Sutskever, Oriol Vinyals, and Quoc Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of NIPS. Montreal, Canada, pages 3104–3112.
- Shuai Tang, Hailin Jin, Chen Fang, Zhaowen Wang, and Virginia R de Sa. 2017. Trimming and improving skip-thought vectors. Proceedings of the 2nd Workshop on Representation Learning for NLP.
- Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. 2017. Deep image prior. https://arxiv.org/abs/1711.10925.
- Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. 2015. Grammar as a foreign language. In Advances in Neural Information Processing Systems. pages 2773–2781.
- Adina Williams, Nikita Nangia, and Samuel R Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of NAACL.
- Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei Xu. 2016. Deep recurrent models with fast-forward connections for neural machine translation. arXiv preprint arXiv:1606.04199.
- Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of ICCV. Santiago, Chile, pages 19–27.