Linguistic Knowledge and Transferability of Contextual Representations.
North American Chapter of the Association for Computational Linguistics, (2019): 1073-1094
Contextual word representations derived from large-scale neural language models are successful across a diverse set of NLP tasks, suggesting that they encode useful and transferable features of language. To shed light on the linguistic knowledge they capture, we study the representations produced by several recent pretrained contextualize...更多
下载 PDF 全文
- Pretrained word representations (Mikolov et al, 2013; Pennington et al, 2014) are a key component of state-of-the-art neural NLP models.
- These word vectors are static—a single vector is assigned to each word.
- The broad success of CWRs indicates that they encode useful, transferable features of language.
- Their linguistic knowledge and transferability are not yet well understood
- Pretrained word representations (Mikolov et al, 2013; Pennington et al, 2014) are a key component of state-of-the-art neural NLP models
- We see that the OpenAI transformer significantly underperforms the ELMo models and BERT. Given that it is the only model trained in a unidirectional fashion, this reaffirms that bidirectionality is a crucial component for the highestquality contextualizers (Devlin et al, 2018)
- We study the linguistic knowledge and transferability of contextualized word representations with a suite of sixteen diverse probing tasks
- For tasks that require specific information not captured by the contextual word representation, we show that learning task-specific contextual features helps to encode the requisite knowledge
- Our analysis of patterns in the transferability of contextualizer layers shows that the lowest layer of long short-term memory encodes the most transferable features, while transformers’ middle layers are most transferable
- We find that higher layers in long short-term memory are more task-specific, while transformer layers do not exhibit this same monotonic increase in task-specificity
- The authors' probing models are trained on the representations produced by the individual layers of each contextualizer.
- The authors take the pretrained representations for each layer and relearn the language model softmax classifiers used to predict the and previous token.
- All of the contextualizers use the ELMo architecture, and the training data from each of the pretraining tasks is taken from the PTB.
- The authors compare to (1) a noncontextual baseline (GloVe) to assess the effect of contextualization, (2) a randomly-initialized, untrained ELMo baseline to measure the effect of pretraining, and (3) the ELMo model pretrained on the Billion Word Benchmark to examine the effect of training the bidirectional language model on more data
- Each of the models see the same tokens, but the supervision signal differs.5 The authors compare to (1) a noncontextual baseline (GloVe) to assess the effect of contextualization, (2) a randomly-initialized, untrained ELMo baseline to measure the effect of pretraining, and (3) the ELMo model pretrained on the Billion Word Benchmark to examine the effect of training the bidirectional language model on more data
- Results and Discussion
Table 1 compares each contextualizer’s bestperforming probing model with the GloVe baseline and the previous state of the art for the task.3,4
With just a linear model, the authors can readily extract much of the information needed for high performance on various NLP tasks.
- Comparing the ELMo-based contextualizers, the authors see that ELMo (4-layer) and ELMo are essentially even, though both recurrent models outperform ELMo. The authors see that the OpenAI transformer significantly underperforms the ELMo models and BERT.
- The representations that are better-suited for language modeling are those that exhibit worse probing task performance (Figure 3), indicating that contextualizer layers trade off between encoding general and task-specific features.
- This indicates that the transferability of pretrained CWRs relies on pretraining on large corpora, emphasizing the utility and importance of self-supervised pretraining
- The authors study the linguistic knowledge and transferability of contextualized word representations with a suite of sixteen diverse probing tasks.
- The features generated by pretrained contextualizers are sufficient for high performance on a broad set of tasks.
- For tasks that require specific information not captured by the contextual word representation, the authors show that learning task-specific contextual features helps to encode the requisite knowledge.
- It seems likely that certain high-level semantic phenomena are incidentally useful for the contextualizer’s pretraining task, leading to their presence in higher layers.
- The authors find that bidirectional language model pretraining yields representations that are more transferable in general than eleven other candidate pretraining tasks
- Table1: Performance of the best layerwise linear probing model for each contextualizer compared against a GloVe-based linear probing baseline and the previous state of the art. The best contextualizer for each task is bolded. Results for all layers on all tasks, and papers describing the prior state of the art, are given in Appendix D
- Table2: Comparison of different probing models trained on ELMo (original); best-performing probing model is bolded. Results for each probing model are from the highest-performing contextualizer layer. Enabling probing models to learn task-specific contextual features (with LSTMs) yields outsized benefits in tasks requiring highly specific information
- Table3: Performance (averaged across target tasks) of contextualizers pretrained on a variety of tasks
- Table4: Performance of prior state of the art models (without pretraining) for each task
- Table5: Token labeling task performance of a linear probing model trained on top of the ELMo and OpenAI contextualizers, compared against a GloVe-based probing baseline and the previous state of the art
- Table6: Token labeling task performance of a linear probing model trained on top of the BERT contextualizers
- Table7: Segmentation task performance of a linear probing model trained on top of the ELMo and OpenAI contextualizers, compared against a GloVe-based probing baseline and the previous state of the art
- Table8: Segmentation task performance of a linear probing model trained on top of the BERT contextualizers
- Table9: Pairwise relation task performance of a linear probing model trained on top of the ELMo and OpenAI contextualizers, compared against a GloVe-based probing baseline
- Table10: Pairwise relation task performance of a linear probing model trained on top of the BERT contextualizers
- Table11: Target token labeling task performance of contextualizers pretrained on a variety of different tasks. The probing model used is linear, and the contextualizer architecture is ELMo (original)
- Table12: Target segmentation task performance of contextualizers pretrained on a variety of different tasks. The probing model used is linear, and the contextualizer architecture is ELMo (original)
- Table13: Target pairwise prediction task performance of contextualizers pretrained on a variety of different tasks. The probing model used is linear, and the contextualizer architecture is ELMo (original)
- Methodologically, our work is most similar to Shi et al (2016b), Adi et al (2017), and Hupkes et al (2018), who use the internal representations of neural models to predict properties of interest. Conneau et al (2018) construct probing tasks to study the linguistic properties of sentence embedding methods. We focus on contextual word representations, which have achieved state-of-the-art results on a variety of tasks, and examine a broader range of linguistic knowledge.
In contemporaneous work, Tenney et al (2019) evaluate CoVe (McCann et al, 2017), ELMo (Peters et al, 2018a), the OpenAI Transformer (Radford et al, 2018), and BERT (Devlin et al, 2018) on a variety of sub-sentence linguistic analysis tasks. Their results also suggest that the aforementioned pretrained models for contextualized word representation encode stronger notions of syntax than higher-level semantics. They also find that using a scalar mix of output layers is particularly effective in deep transformer-based models, aligned with our own probing results and our observation that transformers tend to encode transferable features in their intermediate layers. Furthermore, they find that ELMo’s performance cannot be explained by a model with access to only local context, indicating that ELMo encodes linguistic features from distant tokens.
- NL is supported by a Washington Research Foundation Fellowship and a Barry M
- YB is supported by the Harvard Mind, Brain, and Behavior Initiative
- Lasha Abzianidze, Johannes Bjerva, Kilian Evang, Hessel Haagsma, Rik van Noord, Pierre Ludmann, Duc-Duy Nguyen, and Johan Bos. 2017. The parallel meaning bank: Towards a multilingual corpus of translations annotated with compositional meaning representations. In Proc. of EACL.
- Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer Lavi, and Yoav Goldberg. 2017. Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. In Proc. of ICLR.
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proc. of ICLR.
- Srinivas Bangalore and Aravind K. Joshi. 1999. Supertagging: An approach to almost parsing. Computational Linguistics, 25(2):237–265.
- Yonatan Belinkov. 2018. On Internal Language Representations in Deep Learning: An Analysis of Machine Translation and Speech Recognition. Ph.D. thesis, Massachusetts Institute of Technology.
- Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and James R. Glass. 2017a. What do neural machine translation models learn about morphology? In Proc. of ACL.
- Yonatan Belinkov and James Glass. 2019. Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics.
- Yonatan Belinkov, Lluıs Marquez, Hassan Sajjad, Nadir Durrani, Fahim Dalvi, and James R. Glass. 2017b. Evaluating layers of representation in neural machine translation on part-of-speech and semantic tagging tasks. In Proc. of IJCNLP.
- Johannes Bjerva, Barbara Plank, and Johan Bos. 2016. Semantic tagging with deep residual networks. In Proc. of COLING.
- Terra Blevins, Omer Levy, and Luke Zettlemoyer. 2018. Deep RNNs encode soft hierarchical syntax. In Proc. of ACL.
- Bernd Bohnet, Ryan T. McDonald, Gonalo Simoes, Daniel Andor, Emily Pitler, and Joshua Maynez. 2018. Morphosyntactic tagging with a metaBiLSTM model over context sensitive token encodings. In Proc. of ACL.
- Antoine Bosselut, Omer Levy, Ari Holtzman, Corin Ennis, Dieter Fox, and Yejin Choi. 2017. Simulating action dynamics with neural process networks. In Proc. of ICLR.
- Yinghui Huang, Katherin Yu, Shuning Jin, and Berlin Chen. 2018. Looking for ELMo’s friends: Sentence-level pretraining beyond language modeling. ArXiv:1812.10860.
- Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, and Phillipp Koehn. 20One billion word benchmark for measuring progress in statistical language modeling. In Proc. of INTERSPEECH.
- Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Niki Parmar, Mike Schuster, Zhifeng Chen, Yonghui Wu, and Macduff Hughes. 2018. The best of both worlds: Combining recent advances in neural machine translation. In Proc. of ACL.
- Alexis Conneau, Douwe Kiela, Holger Schwenk, Loıc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proc. of EMNLP.
- Alexis Conneau, German Kruszewski, Guillaume Lample, Loıc Barrault, and Marco Baroni. 2018. What you can cram into a single vector: Probing sentence embeddings for linguistic properties. In Proc. of ACL.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 20BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. of NAACL.
- Timothy Dozat and Christopher D. Manning. 2016. Deep biaffine attention for neural dependency parsing. In Proc. of ICLR.
- Timothy Dozat and Christopher D. Manning. 2018. Simpler but more accurate semantic dependency parsing. In Proc. of ACL.
- Jessica Ficler and Yoav Goldberg. 2016. Coordination annotation extension in the Penn Treebank. In Proc. of ACL.
- Richard Futrell and Roger P. Levy. 2019. Do RNNs learn human-like abstract word order preferences? In Proc. of SCiL.
- David Gaddy, Mitchell Stern, and Dan Klein. 2018. What’s going on in neural constituency parsers? An analysis. In Proc. of NAACL.
- Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew E. Peters, Michael Schmitz, and Luke Zettlemoyer. 2018. AllenNLP: A deep semantic natural language processing platform. In Proc. of NLP-OSS.
- Samuel R. Bowman, Ellie Pavlick, Edouard Grave, Benjamin Van Durme, Alex Wang, Jan Hula, Patrick Xia, Raghavendra Pappagari, R. Thomas McCoy, Roma Patel, Najoung Kim, Ian Tenney, Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka, and Richard Socher. 2017. A joint many-task model: Growing a neural network for multiple nlp tasks. In Proc. of EMNLP.
- Julia Hockenmaier and Mark Steedman. 2007. CCGbank: A corpus of CCG derivations and dependency structures extracted from the Penn Treebank. Computational Linguistics, 33(3):355–396.
- Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proc. of ACL.
- Dieuwke Hupkes, Sara Veldhoen, and Willem Zuidema. 2018. Visualisation and ‘diagnostic classifiers’ reveal how recurrent and recursive neural networks process hierarchical structure. In Proc. of IJCAI.
- Yangfeng Ji, Chenhao Tan, Sebastian Martschat, Yejin Choi, and Noah A. Smith. 2017. Dynamic entity representations in neural language models. In Proc. of EMNLP.
- Jaap Jumelet and Dieuwke Hupkes. 2018. Do language models understand anything? On the ability of LSTMs to understand negative polarity items. In Proc. of BlackboxNLP.
- Andrej Karpathy, Justin Johnson, and Li Fei-Fei. 2015. Visualizing and understanding recurrent networks. In Proc. of ICLR (Workshop).
- Urvashi Khandelwal, He He, Peng Qi, and Dan Jurafsky. 2018. Sharp nearby, fuzzy far away: How neural language models use context. In Proc. of ACL.
- Diederik P. Kingma and Jimmy Lei Ba. 2015. Adam: A method for stochastic optimization. In Proc. of ICLR.
- Adhiguna Kuncoro, Miguel Ballesteros, Lingpeng Kong, Chris Dyer, Graham Neubig, and Noah A Smith. 2017. What do recurrent neural network grammars learn about syntax? In Proc. of EACL.
- John D. Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of ICML.
- Mike Lewis, Kenton Lee, and Luke Zettlemoyer. 2016. LSTM CCG parsing. In Proc. of NAACL.
- Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. 2015. Visualizing and understanding neural models in nlp. In Proc. of NAACL.
- Tal Linzen. 2018. What can linguistics and deep learning contribute to each other? Language.
- Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. 2016. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics, 4:521– 535.
- Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: The Penn treebank. Computational linguistics, 19(2):313–330.
- Marie-Catherine de Marneffe, Christopher D. Manning, and Christopher Potts. 2012. Did it happen? The pragmatic complexity of veridicality assessment. Computational Linguistics, 38:301–333.
- Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In Proc. of NeurIPS.
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proc. of NeurIPS.
- Stephan Oepen, Marco Kuhlmann, Yusuke Miyao, Daniel Zeman, Silvie Cinkova, Dan Flickinger, Jan Hajic, and Zdenka Uresova. 2015. SemEval 2015 task 18: Broad-coverage semantic dependency parsing. In Proc. of SemEval 2015.
- Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proc. of EMNLP.
- Matthew Peters, Sebastian Ruder, and Noah A Smith. 2019. To tune or not to tune? Adapting pretrained representations to diverse tasks. ArXiv:1903.05987.
- Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018a. Deep contextualized word representations. In Proc. of NAACL.
- Matthew E. Peters, Mark Neumann, Luke Zettlemoyer, and Wen-tau Yih. 2018b. Dissecting contextual word embeddings: Architecture and representation. In Proc. of EMNLP.
- Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang. 2012. CoNLL2012 shared task: Modeling multilingual unrestricted coreference in OntoNotes. In Proc. of CoNLL.
- Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. Technical report, OpenAI.
- Marek Rei and Anders Sogaard. 2019. Jointly learning to label sentences and tokens. In Proc. of AAAI.
- Marek Rei and Helen Yannakoudakis. 2016. Compositional sequence labeling models for error detection in learner writing. In Proc. of ACL.
- Rachel Rudinger, Aaron Steven White, and Benjamin Van Durme. 2018. Neural models of factuality. In Proc. of NAACL.
- Roser Saurı and James Pustejovsky. 2009. Factbank: a corpus annotated with event factuality. Language Resources and Evaluation, 43:227–268.
- Roser Saurı and James Pustejovsky. 2012. Are you sure that this happened? Assessing the factuality degree of events in text. Computational Linguistics, 38:261–299.
- Nathan Schneider, Jena D. Hwang, Vivek Srikumar, Jakob Prange, Austin Blodgett, Sarah R. Moeller, Aviram Stern, Adi Bitan, and Omri Abend. 2018. Comprehensive supersense disambiguation of English prepositions and possessives. In Proc. of ACL.
- Tal Schuster, Ori Ram, Regina Barzilay, and Amir Globerson. 2019. Cross-lingual alignment of contextual word embeddings, with applications to zeroshot dependency parsing. In Proc. of NAACL.
- Xing Shi, Kevin Knight, and Deniz Yuret. 2016a. Why neural translations are the right length. In Proc. of EMNLP.
- Xing Shi, Inkit Padhi, and Kevin Knight. 2016b. Does string-based neural MT learn source syntax? In Proc. of EMNLP.
- Helen Yannakoudakis, Ted Briscoe, and Ben Medlock. 2011. A new dataset and method for automatically grading ESOL texts. In Proc. of ACL.
- Michihiro Yasunaga, Jungo Kasai, and Dragomir R. Radev. 2018. Robust multilingual part-of-speech tagging via adversarial training. In Proc. of NAACL.
- Kelly W. Zhang and Samuel R. Bowman. 2018. Language modeling teaches you more syntax than translation does: Lessons learned through auxiliary task analysis. In Proc. of BlackboxNLP.
- Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proc. of ICCV.
- Natalia Silveira, Timothy Dozat, Marie-Catherine de Marneffe, Samuel Bowman, Miriam Connor, John Bauer, and Christopher D. Manning. 2014. A gold standard dependency corpus for English. In Proc. of LREC.
- Gongbo Tang, Mathias Muller, Annette Rios, and Rico Sennrich. 2018. Why self-attention? A targeted evaluation of neural machine translation architectures. In Proc. of EMNLP.
- Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R. Thomas McCoy, Najoung Kim, Benjamin Van Durme, Samuel R. Bowman, Dipanjan Das, and Ellie Pavlick. 2019. What do you learn from context? Probing for sentence structure in contextualized word representations. In Proc. of ICLR.
- Erik F. Tjong Kim Sang and Sabine Buchholz. 2000. Introduction to the CoNLL-2000 shared task: Chunking. In Proc. of LLL and CoNLL.
- Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proc. of CoNLL.
- Yulia Tsvetkov, Manaal Faruqui, Wang Ling, Guillaume Lample, and Chris Dyer. 2015. Evaluation of word vector representations by subspace alignment. In Proc. of EMNLP.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proc. of NeurIPS.
- Ethan Wilcox, Roger Levy, Takashi Morita, and Richard Futrell. 2018. What do RNN language models learn about filler-gap dependencies? In Proc. of BlackboxNLP.
- Zichao Yang, Phil Blunsom, Chris Dyer, and Wang Ling. 2017. Reference-aware language models. In Proc. of EMNLP.