Linguistic Knowledge and Transferability of Contextual Representations.

North American Chapter of the Association for Computational Linguistics, (2019): 1073-1094

引用401|浏览366
EI
下载 PDF 全文
引用
微博一下

摘要

Contextual word representations derived from large-scale neural language models are successful across a diverse set of NLP tasks, suggesting that they encode useful and transferable features of language. To shed light on the linguistic knowledge they capture, we study the representations produced by several recent pretrained contextualize...更多

代码

数据

0
简介
  • Pretrained word representations (Mikolov et al, 2013; Pennington et al, 2014) are a key component of state-of-the-art neural NLP models.
  • These word vectors are static—a single vector is assigned to each word.
  • The broad success of CWRs indicates that they encode useful, transferable features of language.
  • Their linguistic knowledge and transferability are not yet well understood
重点内容
  • Pretrained word representations (Mikolov et al, 2013; Pennington et al, 2014) are a key component of state-of-the-art neural NLP models
  • We see that the OpenAI transformer significantly underperforms the ELMo models and BERT. Given that it is the only model trained in a unidirectional fashion, this reaffirms that bidirectionality is a crucial component for the highestquality contextualizers (Devlin et al, 2018)
  • We study the linguistic knowledge and transferability of contextualized word representations with a suite of sixteen diverse probing tasks
  • For tasks that require specific information not captured by the contextual word representation, we show that learning task-specific contextual features helps to encode the requisite knowledge
  • Our analysis of patterns in the transferability of contextualizer layers shows that the lowest layer of long short-term memory encodes the most transferable features, while transformers’ middle layers are most transferable
  • We find that higher layers in long short-term memory are more task-specific, while transformer layers do not exhibit this same monotonic increase in task-specificity
方法
  • The authors' probing models are trained on the representations produced by the individual layers of each contextualizer.
  • The authors take the pretrained representations for each layer and relearn the language model softmax classifiers used to predict the and previous token.
  • All of the contextualizers use the ELMo architecture, and the training data from each of the pretraining tasks is taken from the PTB.
  • The authors compare to (1) a noncontextual baseline (GloVe) to assess the effect of contextualization, (2) a randomly-initialized, untrained ELMo baseline to measure the effect of pretraining, and (3) the ELMo model pretrained on the Billion Word Benchmark to examine the effect of training the bidirectional language model on more data
  • Each of the models see the same tokens, but the supervision signal differs.5 The authors compare to (1) a noncontextual baseline (GloVe) to assess the effect of contextualization, (2) a randomly-initialized, untrained ELMo baseline to measure the effect of pretraining, and (3) the ELMo model pretrained on the Billion Word Benchmark to examine the effect of training the bidirectional language model on more data
结果
  • Results and Discussion

    Table 1 compares each contextualizer’s bestperforming probing model with the GloVe baseline and the previous state of the art for the task.3,4

    With just a linear model, the authors can readily extract much of the information needed for high performance on various NLP tasks.
  • Comparing the ELMo-based contextualizers, the authors see that ELMo (4-layer) and ELMo are essentially even, though both recurrent models outperform ELMo. The authors see that the OpenAI transformer significantly underperforms the ELMo models and BERT.
  • The representations that are better-suited for language modeling are those that exhibit worse probing task performance (Figure 3), indicating that contextualizer layers trade off between encoding general and task-specific features.
  • This indicates that the transferability of pretrained CWRs relies on pretraining on large corpora, emphasizing the utility and importance of self-supervised pretraining
结论
  • The authors study the linguistic knowledge and transferability of contextualized word representations with a suite of sixteen diverse probing tasks.
  • The features generated by pretrained contextualizers are sufficient for high performance on a broad set of tasks.
  • For tasks that require specific information not captured by the contextual word representation, the authors show that learning task-specific contextual features helps to encode the requisite knowledge.
  • It seems likely that certain high-level semantic phenomena are incidentally useful for the contextualizer’s pretraining task, leading to their presence in higher layers.
  • The authors find that bidirectional language model pretraining yields representations that are more transferable in general than eleven other candidate pretraining tasks
表格
  • Table1: Performance of the best layerwise linear probing model for each contextualizer compared against a GloVe-based linear probing baseline and the previous state of the art. The best contextualizer for each task is bolded. Results for all layers on all tasks, and papers describing the prior state of the art, are given in Appendix D
  • Table2: Comparison of different probing models trained on ELMo (original); best-performing probing model is bolded. Results for each probing model are from the highest-performing contextualizer layer. Enabling probing models to learn task-specific contextual features (with LSTMs) yields outsized benefits in tasks requiring highly specific information
  • Table3: Performance (averaged across target tasks) of contextualizers pretrained on a variety of tasks
  • Table4: Performance of prior state of the art models (without pretraining) for each task
  • Table5: Token labeling task performance of a linear probing model trained on top of the ELMo and OpenAI contextualizers, compared against a GloVe-based probing baseline and the previous state of the art
  • Table6: Token labeling task performance of a linear probing model trained on top of the BERT contextualizers
  • Table7: Segmentation task performance of a linear probing model trained on top of the ELMo and OpenAI contextualizers, compared against a GloVe-based probing baseline and the previous state of the art
  • Table8: Segmentation task performance of a linear probing model trained on top of the BERT contextualizers
  • Table9: Pairwise relation task performance of a linear probing model trained on top of the ELMo and OpenAI contextualizers, compared against a GloVe-based probing baseline
  • Table10: Pairwise relation task performance of a linear probing model trained on top of the BERT contextualizers
  • Table11: Target token labeling task performance of contextualizers pretrained on a variety of different tasks. The probing model used is linear, and the contextualizer architecture is ELMo (original)
  • Table12: Target segmentation task performance of contextualizers pretrained on a variety of different tasks. The probing model used is linear, and the contextualizer architecture is ELMo (original)
  • Table13: Target pairwise prediction task performance of contextualizers pretrained on a variety of different tasks. The probing model used is linear, and the contextualizer architecture is ELMo (original)
Download tables as Excel
相关工作
  • Methodologically, our work is most similar to Shi et al (2016b), Adi et al (2017), and Hupkes et al (2018), who use the internal representations of neural models to predict properties of interest. Conneau et al (2018) construct probing tasks to study the linguistic properties of sentence embedding methods. We focus on contextual word representations, which have achieved state-of-the-art results on a variety of tasks, and examine a broader range of linguistic knowledge.

    In contemporaneous work, Tenney et al (2019) evaluate CoVe (McCann et al, 2017), ELMo (Peters et al, 2018a), the OpenAI Transformer (Radford et al, 2018), and BERT (Devlin et al, 2018) on a variety of sub-sentence linguistic analysis tasks. Their results also suggest that the aforementioned pretrained models for contextualized word representation encode stronger notions of syntax than higher-level semantics. They also find that using a scalar mix of output layers is particularly effective in deep transformer-based models, aligned with our own probing results and our observation that transformers tend to encode transferable features in their intermediate layers. Furthermore, they find that ELMo’s performance cannot be explained by a model with access to only local context, indicating that ELMo encodes linguistic features from distant tokens.
基金
  • NL is supported by a Washington Research Foundation Fellowship and a Barry M
  • YB is supported by the Harvard Mind, Brain, and Behavior Initiative
引用论文
  • Lasha Abzianidze, Johannes Bjerva, Kilian Evang, Hessel Haagsma, Rik van Noord, Pierre Ludmann, Duc-Duy Nguyen, and Johan Bos. 2017. The parallel meaning bank: Towards a multilingual corpus of translations annotated with compositional meaning representations. In Proc. of EACL.
    Google ScholarLocate open access versionFindings
  • Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer Lavi, and Yoav Goldberg. 2017. Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. In Proc. of ICLR.
    Google ScholarLocate open access versionFindings
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proc. of ICLR.
    Google ScholarLocate open access versionFindings
  • Srinivas Bangalore and Aravind K. Joshi. 1999. Supertagging: An approach to almost parsing. Computational Linguistics, 25(2):237–265.
    Google ScholarLocate open access versionFindings
  • Yonatan Belinkov. 2018. On Internal Language Representations in Deep Learning: An Analysis of Machine Translation and Speech Recognition. Ph.D. thesis, Massachusetts Institute of Technology.
    Google ScholarFindings
  • Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and James R. Glass. 2017a. What do neural machine translation models learn about morphology? In Proc. of ACL.
    Google ScholarLocate open access versionFindings
  • Yonatan Belinkov and James Glass. 2019. Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics.
    Google ScholarFindings
  • Yonatan Belinkov, Lluıs Marquez, Hassan Sajjad, Nadir Durrani, Fahim Dalvi, and James R. Glass. 2017b. Evaluating layers of representation in neural machine translation on part-of-speech and semantic tagging tasks. In Proc. of IJCNLP.
    Google ScholarLocate open access versionFindings
  • Johannes Bjerva, Barbara Plank, and Johan Bos. 2016. Semantic tagging with deep residual networks. In Proc. of COLING.
    Google ScholarLocate open access versionFindings
  • Terra Blevins, Omer Levy, and Luke Zettlemoyer. 2018. Deep RNNs encode soft hierarchical syntax. In Proc. of ACL.
    Google ScholarLocate open access versionFindings
  • Bernd Bohnet, Ryan T. McDonald, Gonalo Simoes, Daniel Andor, Emily Pitler, and Joshua Maynez. 2018. Morphosyntactic tagging with a metaBiLSTM model over context sensitive token encodings. In Proc. of ACL.
    Google ScholarLocate open access versionFindings
  • Antoine Bosselut, Omer Levy, Ari Holtzman, Corin Ennis, Dieter Fox, and Yejin Choi. 2017. Simulating action dynamics with neural process networks. In Proc. of ICLR.
    Google ScholarLocate open access versionFindings
  • Yinghui Huang, Katherin Yu, Shuning Jin, and Berlin Chen. 2018. Looking for ELMo’s friends: Sentence-level pretraining beyond language modeling. ArXiv:1812.10860.
    Findings
  • Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, and Phillipp Koehn. 20One billion word benchmark for measuring progress in statistical language modeling. In Proc. of INTERSPEECH.
    Google ScholarLocate open access versionFindings
  • Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Niki Parmar, Mike Schuster, Zhifeng Chen, Yonghui Wu, and Macduff Hughes. 2018. The best of both worlds: Combining recent advances in neural machine translation. In Proc. of ACL.
    Google ScholarLocate open access versionFindings
  • Alexis Conneau, Douwe Kiela, Holger Schwenk, Loıc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proc. of EMNLP.
    Google ScholarLocate open access versionFindings
  • Alexis Conneau, German Kruszewski, Guillaume Lample, Loıc Barrault, and Marco Baroni. 2018. What you can cram into a single vector: Probing sentence embeddings for linguistic properties. In Proc. of ACL.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 20BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. of NAACL.
    Google ScholarLocate open access versionFindings
  • Timothy Dozat and Christopher D. Manning. 2016. Deep biaffine attention for neural dependency parsing. In Proc. of ICLR.
    Google ScholarLocate open access versionFindings
  • Timothy Dozat and Christopher D. Manning. 2018. Simpler but more accurate semantic dependency parsing. In Proc. of ACL.
    Google ScholarLocate open access versionFindings
  • Jessica Ficler and Yoav Goldberg. 2016. Coordination annotation extension in the Penn Treebank. In Proc. of ACL.
    Google ScholarLocate open access versionFindings
  • Richard Futrell and Roger P. Levy. 2019. Do RNNs learn human-like abstract word order preferences? In Proc. of SCiL.
    Google ScholarLocate open access versionFindings
  • David Gaddy, Mitchell Stern, and Dan Klein. 2018. What’s going on in neural constituency parsers? An analysis. In Proc. of NAACL.
    Google ScholarLocate open access versionFindings
  • Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew E. Peters, Michael Schmitz, and Luke Zettlemoyer. 2018. AllenNLP: A deep semantic natural language processing platform. In Proc. of NLP-OSS.
    Google ScholarLocate open access versionFindings
  • Samuel R. Bowman, Ellie Pavlick, Edouard Grave, Benjamin Van Durme, Alex Wang, Jan Hula, Patrick Xia, Raghavendra Pappagari, R. Thomas McCoy, Roma Patel, Najoung Kim, Ian Tenney, Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka, and Richard Socher. 2017. A joint many-task model: Growing a neural network for multiple nlp tasks. In Proc. of EMNLP.
    Google ScholarLocate open access versionFindings
  • Julia Hockenmaier and Mark Steedman. 2007. CCGbank: A corpus of CCG derivations and dependency structures extracted from the Penn Treebank. Computational Linguistics, 33(3):355–396.
    Google ScholarLocate open access versionFindings
  • Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proc. of ACL.
    Google ScholarLocate open access versionFindings
  • Dieuwke Hupkes, Sara Veldhoen, and Willem Zuidema. 2018. Visualisation and ‘diagnostic classifiers’ reveal how recurrent and recursive neural networks process hierarchical structure. In Proc. of IJCAI.
    Google ScholarLocate open access versionFindings
  • Yangfeng Ji, Chenhao Tan, Sebastian Martschat, Yejin Choi, and Noah A. Smith. 2017. Dynamic entity representations in neural language models. In Proc. of EMNLP.
    Google ScholarLocate open access versionFindings
  • Jaap Jumelet and Dieuwke Hupkes. 2018. Do language models understand anything? On the ability of LSTMs to understand negative polarity items. In Proc. of BlackboxNLP.
    Google ScholarLocate open access versionFindings
  • Andrej Karpathy, Justin Johnson, and Li Fei-Fei. 2015. Visualizing and understanding recurrent networks. In Proc. of ICLR (Workshop).
    Google ScholarLocate open access versionFindings
  • Urvashi Khandelwal, He He, Peng Qi, and Dan Jurafsky. 2018. Sharp nearby, fuzzy far away: How neural language models use context. In Proc. of ACL.
    Google ScholarLocate open access versionFindings
  • Diederik P. Kingma and Jimmy Lei Ba. 2015. Adam: A method for stochastic optimization. In Proc. of ICLR.
    Google ScholarLocate open access versionFindings
  • Adhiguna Kuncoro, Miguel Ballesteros, Lingpeng Kong, Chris Dyer, Graham Neubig, and Noah A Smith. 2017. What do recurrent neural network grammars learn about syntax? In Proc. of EACL.
    Google ScholarLocate open access versionFindings
  • John D. Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of ICML.
    Google ScholarLocate open access versionFindings
  • Mike Lewis, Kenton Lee, and Luke Zettlemoyer. 2016. LSTM CCG parsing. In Proc. of NAACL.
    Google ScholarLocate open access versionFindings
  • Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. 2015. Visualizing and understanding neural models in nlp. In Proc. of NAACL.
    Google ScholarLocate open access versionFindings
  • Tal Linzen. 2018. What can linguistics and deep learning contribute to each other? Language.
    Google ScholarFindings
  • Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. 2016. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics, 4:521– 535.
    Google ScholarLocate open access versionFindings
  • Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: The Penn treebank. Computational linguistics, 19(2):313–330.
    Google ScholarLocate open access versionFindings
  • Marie-Catherine de Marneffe, Christopher D. Manning, and Christopher Potts. 2012. Did it happen? The pragmatic complexity of veridicality assessment. Computational Linguistics, 38:301–333.
    Google ScholarLocate open access versionFindings
  • Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In Proc. of NeurIPS.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proc. of NeurIPS.
    Google ScholarLocate open access versionFindings
  • Stephan Oepen, Marco Kuhlmann, Yusuke Miyao, Daniel Zeman, Silvie Cinkova, Dan Flickinger, Jan Hajic, and Zdenka Uresova. 2015. SemEval 2015 task 18: Broad-coverage semantic dependency parsing. In Proc. of SemEval 2015.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proc. of EMNLP.
    Google ScholarLocate open access versionFindings
  • Matthew Peters, Sebastian Ruder, and Noah A Smith. 2019. To tune or not to tune? Adapting pretrained representations to diverse tasks. ArXiv:1903.05987.
    Findings
  • Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018a. Deep contextualized word representations. In Proc. of NAACL.
    Google ScholarLocate open access versionFindings
  • Matthew E. Peters, Mark Neumann, Luke Zettlemoyer, and Wen-tau Yih. 2018b. Dissecting contextual word embeddings: Architecture and representation. In Proc. of EMNLP.
    Google ScholarLocate open access versionFindings
  • Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang. 2012. CoNLL2012 shared task: Modeling multilingual unrestricted coreference in OntoNotes. In Proc. of CoNLL.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. Technical report, OpenAI.
    Google ScholarFindings
  • Marek Rei and Anders Sogaard. 2019. Jointly learning to label sentences and tokens. In Proc. of AAAI.
    Google ScholarLocate open access versionFindings
  • Marek Rei and Helen Yannakoudakis. 2016. Compositional sequence labeling models for error detection in learner writing. In Proc. of ACL.
    Google ScholarLocate open access versionFindings
  • Rachel Rudinger, Aaron Steven White, and Benjamin Van Durme. 2018. Neural models of factuality. In Proc. of NAACL.
    Google ScholarLocate open access versionFindings
  • Roser Saurı and James Pustejovsky. 2009. Factbank: a corpus annotated with event factuality. Language Resources and Evaluation, 43:227–268.
    Google ScholarLocate open access versionFindings
  • Roser Saurı and James Pustejovsky. 2012. Are you sure that this happened? Assessing the factuality degree of events in text. Computational Linguistics, 38:261–299.
    Google ScholarLocate open access versionFindings
  • Nathan Schneider, Jena D. Hwang, Vivek Srikumar, Jakob Prange, Austin Blodgett, Sarah R. Moeller, Aviram Stern, Adi Bitan, and Omri Abend. 2018. Comprehensive supersense disambiguation of English prepositions and possessives. In Proc. of ACL.
    Google ScholarLocate open access versionFindings
  • Tal Schuster, Ori Ram, Regina Barzilay, and Amir Globerson. 2019. Cross-lingual alignment of contextual word embeddings, with applications to zeroshot dependency parsing. In Proc. of NAACL.
    Google ScholarLocate open access versionFindings
  • Xing Shi, Kevin Knight, and Deniz Yuret. 2016a. Why neural translations are the right length. In Proc. of EMNLP.
    Google ScholarLocate open access versionFindings
  • Xing Shi, Inkit Padhi, and Kevin Knight. 2016b. Does string-based neural MT learn source syntax? In Proc. of EMNLP.
    Google ScholarLocate open access versionFindings
  • Helen Yannakoudakis, Ted Briscoe, and Ben Medlock. 2011. A new dataset and method for automatically grading ESOL texts. In Proc. of ACL.
    Google ScholarLocate open access versionFindings
  • Michihiro Yasunaga, Jungo Kasai, and Dragomir R. Radev. 2018. Robust multilingual part-of-speech tagging via adversarial training. In Proc. of NAACL.
    Google ScholarLocate open access versionFindings
  • Kelly W. Zhang and Samuel R. Bowman. 2018. Language modeling teaches you more syntax than translation does: Lessons learned through auxiliary task analysis. In Proc. of BlackboxNLP.
    Google ScholarLocate open access versionFindings
  • Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proc. of ICCV.
    Google ScholarLocate open access versionFindings
  • Natalia Silveira, Timothy Dozat, Marie-Catherine de Marneffe, Samuel Bowman, Miriam Connor, John Bauer, and Christopher D. Manning. 2014. A gold standard dependency corpus for English. In Proc. of LREC.
    Google ScholarLocate open access versionFindings
  • Gongbo Tang, Mathias Muller, Annette Rios, and Rico Sennrich. 2018. Why self-attention? A targeted evaluation of neural machine translation architectures. In Proc. of EMNLP.
    Google ScholarLocate open access versionFindings
  • Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R. Thomas McCoy, Najoung Kim, Benjamin Van Durme, Samuel R. Bowman, Dipanjan Das, and Ellie Pavlick. 2019. What do you learn from context? Probing for sentence structure in contextualized word representations. In Proc. of ICLR.
    Google ScholarLocate open access versionFindings
  • Erik F. Tjong Kim Sang and Sabine Buchholz. 2000. Introduction to the CoNLL-2000 shared task: Chunking. In Proc. of LLL and CoNLL.
    Google ScholarLocate open access versionFindings
  • Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proc. of CoNLL.
    Google ScholarLocate open access versionFindings
  • Yulia Tsvetkov, Manaal Faruqui, Wang Ling, Guillaume Lample, and Chris Dyer. 2015. Evaluation of word vector representations by subspace alignment. In Proc. of EMNLP.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proc. of NeurIPS.
    Google ScholarLocate open access versionFindings
  • Ethan Wilcox, Roger Levy, Takashi Morita, and Richard Futrell. 2018. What do RNN language models learn about filler-gap dependencies? In Proc. of BlackboxNLP.
    Google ScholarLocate open access versionFindings
  • Zichao Yang, Phil Blunsom, Chris Dyer, and Wang Ling. 2017. Reference-aware language models. In Proc. of EMNLP.
    Google ScholarLocate open access versionFindings
0
您的评分 :

暂无评分

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn