Probing Linguistic Features of Sentence-Level Representations in Neural Relation Extraction

arxiv, 2020.

Cited by: 2|Bibtex|Views59
Other Links: arxiv.org
Weibo:
We introduced a set of probing tasks to study the linguistic features captured in sentence encoder representations trained on relation extraction

Abstract:

Despite the recent progress, little is known about the features captured by state-of-the-art neural relation extraction (RE) models. Common methods encode the source sentence, conditioned on the entity mentions, before classifying the relation. However, the complexity of the task makes it difficult to understand how encoder architecture...More

Code:

Data:

Introduction
  • Relation extraction (RE) is concerned with extracting relationships between entities mentioned in text, where relations correspond to semantic categories such as org:founded by, person:spouse, or org:subsidiaries (Figure 1).
  • Ture to learn a fixed size representation of the input, e.g. a sentence, which is passed to a classification layer to predict the target relation label
  • These good results suggest that the learned representations capture linguistic and semantic properties of the input that are relevant to the downstream RE task, an intuition that was previously discussed for a variety of other NLP tasks by Conneau et al (2018).
  • The authors' aim is to pinpoint the information a given RE model is relying on, in order to improve model performance as well as to diagnose errors
Highlights
  • Relation extraction (RE) is concerned with extracting relationships between entities mentioned in text, where relations correspond to semantic categories such as org:founded by, person:spouse, or org:subsidiaries (Figure 1)
  • These good results suggest that the learned representations capture linguistic and semantic properties of the input that are relevant to the downstream Relation extraction task, an intuition that was previously discussed for a variety of other NLP tasks by Conneau et al (2018)
  • We did not include the argument ordering task and EntExists task in the SemEval evaluation, since SemEval relation arguments are always ordered in the sentence as indicated by the relation type, and entity types recognizable by standard tools such as Stanford CoreNLP that might occur between head and tail are not relevant to the dataset’s entity types and relations
  • We introduced a set of probing tasks to study the linguistic features captured in sentence encoder representations trained on relation extraction
  • We found self-attentive encoders to be well suited for the Relation extraction on sentences of different complexity, though they consistently perform lower on probing tasks; hinting that these architectures capture “deeper” linguistic features
  • We showed that the bias induced by different architectures clearly affects the learned properties, as suggested by probing task performance, e.g. for distance and dependency related probing tasks
Results
  • Table 2 and Table 3 report the accuracy scores of the probing task experiments for models trained on the TACRED and SemEval dataset.
  • The authors did not include the ArgOrd and EntExists task in the SemEval evaluation, since SemEval relation arguments are always ordered in the sentence as indicated by the relation type, and entity types recognizable by standard tools such as Stanford CoreNLP that might occur between head and tail are not relevant to the dataset’s entity types and relations.
Conclusion
  • The authors introduced a set of probing tasks to study the linguistic features captured in sentence encoder representations trained on relation extraction.
  • The authors found self-attentive encoders to be well suited for the RE on sentences of different complexity, though they consistently perform lower on probing tasks; hinting that these architectures capture “deeper” linguistic features.
  • The authors showed that the bias induced by different architectures clearly affects the learned properties, as suggested by probing task performance, e.g. for distance and dependency related probing tasks.
  • The authors want to extend the probing tasks to cover specific linguistic patterns such as appositions, and investigate a model’s ability of generalizing to specific entity types, e.g. company and person names
Summary
  • Introduction:

    Relation extraction (RE) is concerned with extracting relationships between entities mentioned in text, where relations correspond to semantic categories such as org:founded by, person:spouse, or org:subsidiaries (Figure 1).
  • Ture to learn a fixed size representation of the input, e.g. a sentence, which is passed to a classification layer to predict the target relation label
  • These good results suggest that the learned representations capture linguistic and semantic properties of the input that are relevant to the downstream RE task, an intuition that was previously discussed for a variety of other NLP tasks by Conneau et al (2018).
  • The authors' aim is to pinpoint the information a given RE model is relying on, in order to improve model performance as well as to diagnose errors
  • Results:

    Table 2 and Table 3 report the accuracy scores of the probing task experiments for models trained on the TACRED and SemEval dataset.
  • The authors did not include the ArgOrd and EntExists task in the SemEval evaluation, since SemEval relation arguments are always ordered in the sentence as indicated by the relation type, and entity types recognizable by standard tools such as Stanford CoreNLP that might occur between head and tail are not relevant to the dataset’s entity types and relations.
  • Conclusion:

    The authors introduced a set of probing tasks to study the linguistic features captured in sentence encoder representations trained on relation extraction.
  • The authors found self-attentive encoders to be well suited for the RE on sentences of different complexity, though they consistently perform lower on probing tasks; hinting that these architectures capture “deeper” linguistic features.
  • The authors showed that the bias induced by different architectures clearly affects the learned properties, as suggested by probing task performance, e.g. for distance and dependency related probing tasks.
  • The authors want to extend the probing tasks to cover specific linguistic patterns such as appositions, and investigate a model’s ability of generalizing to specific entity types, e.g. company and person names
Tables
  • Table1: Comparison of datasets used for evaluation
  • Table2: TACRED probing task accuracies and model F1 scores on the test set. ↑ and ↓ indicate the cased and uncased version of BERT, ⊗ models with entity masking. Probing task classification is performed by a logistic regression on the representations sj of all sentences in the dataset
  • Table3: SemEval probing task accuracies and model F1 scores on the test set. ↑ and ↓ indicate the cased and uncased version of BERT. Probing task classification is performed by a logistic regression on the representations sj of all sentences in the dataset
  • Table4: Relation extraction test set performance on SemEval. ↑ and ↓ indicate the cased and uncased version of BERT. Due to the small dataset size, we report the mean across 5 randomly initialized runs
  • Table5: Relation extraction test set performance on TACRED. ↑ and ↓ indicate the cased and uncased version of BERT, ⊗ models with entity masking
Download tables as Excel
Related work
  • Shi et al (2016) introduced probing tasks to probe syntactic properties captured in encoders trained on neural machine translation. Adi et al (2017) extended this concept of “auxiliary prediction tasks”, proposing SentLen, word count and word order tasks to probe general sentence encoders, such as bag-of-vectors, auto-encoder and skip-thought. Conneau et al (2018) considered 10 probing tasks, including SentLen and TreeDepth, and an extended set of encoders such as Seq2Tree and encoders
Funding
  • This work has been supported by the German Federal Ministry of Education and Research as part of the projects DEEPLEE (01IW17001) and BBDC2 (01IS18025E), and by the German Federal Ministry of Economics Affairs and Energy as part of the project PLASS (01MD19003E)
Reference
  • Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer Lavi, and Yoav Goldberg. 2017. Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. In Proceedings of ICLR Conference Track, Toulon, France.
    Google ScholarLocate open access versionFindings
  • Christoph Alt, Marc Hubner, and Leonhard Hennig. 2019a. Fine-tuning pre-trained transformer language models to distantly supervised relation extraction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1388–1398, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Christoph Alt, Marc Hubner, and Leonhard Hennig. 2019b. Improving relation extraction by pre-trained language representations. In Proceedings of AKBC 2019, pages 1–18, Amherst, Massachusetts.
    Google ScholarLocate open access versionFindings
  • Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and James Glass. 2017. What do neural machine translation models learn about morphology? In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 861–872, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Razvan C. Bunescu and Raymond J. Mooney. 200A Shortest Path Dependency Kernel for Relation Extraction. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, HLT ’05, pages 724–731, Stroudsburg, PA, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Rui Cai, Xiaodong Zhang, and Houfeng Wang. 201Bidirectional Recurrent Convolutional Neural Network for Relation Classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 756–765, Berlin, Germany. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alexis Conneau and Douwe Kiela. 2018. Senteval: An evaluation toolkit for universal sentence representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018). European Language Resource Association.
    Google ScholarLocate open access versionFindings
  • Alexis Conneau, German Kruszewski, Guillaume Lample, Loıc Barrault, and Marco Baroni. 201What you can cram into a single &!#* vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2126–2136. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. Computing Research Repository (CoRR), abs/1810.04805.
    Google ScholarLocate open access versionFindings
  • Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew E. Peters, Michael Schmitz, and Luke Zettlemoyer. 2018. Allennlp: A deep semantic natural language processing platform. Computing Research Repository (CoRR), abs/1803.07640.
    Google ScholarLocate open access versionFindings
  • Mario Giulianelli, Jack Harding, Florian Mohnert, Dieuwke Hupkes, and Willem Zuidema. 2018. Under the hood: Using diagnostic classifiers to investigate and improve how language models track agreement information. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 240–248, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid O Seaghdha, Sebastian Pado, Marco Pennacchiotti, Lorenza Romano, and Stan Szpakowicz. 2010. Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 33–38. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke S. Zettlemoyer, and Omer Levy. 2019. Spanbert: Improving pre-training by representing and predicting spans. ArXiv, abs/1907.10529.
    Findings
  • Thomas N. Kipf and Max Welling. 2016. Semisupervised classification with graph convolutional networks. Computing Research Repository (CoRR), abs/1609.02907.
    Google ScholarLocate open access versionFindings
  • Sebastian Krause, Hong Li, Hans Uszkoreit, and Feiyu Xu. 2012. Large-Scale Learning of RelationExtraction Rules with Distant Supervision from the Web. In Philippe Cudr-Mauroux, Jeff Heflin, Evren Sirin, Tania Tudorache, Jrme Euzenat, Manfred Hauswirth, Josiane Xavier Parreira, Jim Hendler, Guus Schreiber, Abraham Bernstein, and Eva Blomqvist, editors, The Semantic Web ISWC 2012, number 7649 in Lecture Notes in Computer Science, pages 263–278. Springer Berlin Heidelberg.
    Google ScholarLocate open access versionFindings
  • Joohong Lee, Sangwoo Seo, and Yong Suk Choi. 2019. Semantic Relation Classification via Bidirectional LSTM Networks with Entity-aware Attention using Latent Entity Typing. arXiv:1901.08163 [cs]. ArXiv: 1901.08163.
    Findings
  • Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. 2016. Neural Relation Extraction with Selective Attention over Instances. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2124–2133, Berlin, Germany. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Processing Toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55–60, Baltimore, Maryland. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant Supervision for Relation Extraction Without Labeled Data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2, ACL ’09, pages 1003–1011, Stroudsburg, PA, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Thien Huu Nguyen and Ralph Grishman. 2015. Relation extraction: Perspective from convolutional neural networks. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pages 39–48. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227– 2237. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Matthew E. Peters, Mark Neumann, Robert Logan, Roy Schwartz, Vidur Joshi, Sameer Singh, and Noah A. Smith. 2019. Knowledge enhanced contextual word representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 43–54, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.
    Google ScholarFindings
  • Cicero dos Santos, Bing Xiang, and Bowen Zhou. 2015. Classifying Relations by Ranking with Convolutional Neural Networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 626–634, Beijing, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Christopher Manning, Mihai Surdeanu, John Bauer, Xing Shi, Inkit Padhi, and Kevin Knight. 2016. Does
    Google ScholarFindings
  • 2014. The Stanford CoreNLP Natural Language ceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1526– 1534. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mihai Surdeanu, David McClosky, Mason R. Smith, Andrey Gusev, and Christopher D. Manning. 2011. Customizing an information extraction system to a new domain. In Proceedings of the ACL 2011 Workshop on Relational Models of Semantics, RELMS ’11, pages 2–10, Stroudsburg, PA, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Shikhar Vashishth, Rishabh Joshi, Sai Suman Prayaga, Chiranjib Bhattacharyya, and Partha Talukdar. 2018. RESIDE: Improving Distantly-Supervised Neural Relation Extraction using Side Information. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1257– 1266, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
    Google ScholarLocate open access versionFindings
  • Linlin Wang, Zhu Cao, Gerard de Melo, and Zhiyuan Liu. 2016. Relation Classification via Multi-Level Attention CNNs. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1298– 1307, Berlin, Germany. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, and Jun Zhao. 2014. Relation classification via convolutional deep neural network. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 2335–2344. Dublin City University and Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Dongxu Zhang and Dong Wang. 2015. Relation classification via recurrent neural network. Computing Research Repository (CoRR), abs/1508.01006.
    Google ScholarLocate open access versionFindings
  • Yuhao Zhang, Peng Qi, and Christopher D. Manning. 2018. Graph convolution over pruned dependency trees improves relation extraction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2205–2215. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D. Manning. 2017. Positionaware attention and supervised data improve slot filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 35–45. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • GuoDong Zhou, Jian Su, Jie Zhang, and Min Zhang. 2005. Exploring various knowledge in relation extraction. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 427–434. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • For vanilla models we use 300-dimensional pretrained GloVe embeddings (Pennington et al., 2014) as input. Variants with ELMo use the contextualized word representations in combination with GloVe embeddings and models with BERT only use the computed representations. For models trained on TACRED we use 30-dimensional positional offset embeddings for head and tail (50dimensional embeddings for SemEval). Similar for the batch-size we use 50 on TACRED and 30 on SemEval. If not mentioned otherwise, we use the same hyperparameters for models with and without entity masking.
    Google ScholarFindings
  • CNN For training on TACRED we use the hyperparameters of Zhang et al. (2017). We employ Adagrad as an optimizer, with an initial learning rate of 0.1 and run training for 50 epochs. Starting from the 15th epoch, we gradually decrease the learning rate by a factor of 0.9. For the CNN we use 500 filters of sizes [2, 3, 4, 5] and apply l2 regularization with a coefficient of 10−3 to all filter weights. We use tanh as activation and apply dropout on the encoder output with a probability of 0.5. We use the same hyperparameters for variants with ELMo. For variants with BERT, we use an initial learning rate of 0.01 and decrease the learning rate by a factor of 0.9 every time the validation F1 score is plateauing. Also we use 200 filters of sizes [2, 3, 4, 5].
    Google ScholarLocate open access versionFindings
  • On SemEval, we use the hyperparameters of Nguyen and Grishman (2015). We employ Adadelta with initial learning rate of 1 and run it for 50 epochs. We apply l2 regularization with a coefficient of 10−5 to all filter weights. We use embedding and encoder dropout of 0.5, word dropout of 0.04 and 150 filters of sizes [2, 3, 4, 5]. For variants using BERT, we decrease the learning rate by a factor of 0.9 every time the validation F1 score is plateauing.
    Google ScholarLocate open access versionFindings
  • BiLSTM For training on TACRED we use the hyperparameters of Zhang et al. (2017). We employ Adagrad with an initial learning rate of 0.01, train for 30 epochs and gradually decrease the learning rate by a factor of 0.9, starting from the 15th epoch. We use word dropout of 0.04 and recurrent dropout of 0.5. The BiLSTM consists of two layers of hidden dimension 500 for each direction. For training with ELMo and BERT we decrease the learning rate by a factor of 0.9 every time the validation F1 score is plateauing.
    Google ScholarLocate open access versionFindings
  • GCN On TACRED and SemEval we reuse the hyperparameters of Zhang et al. (2018). We employ SGD as optimizer with an initial learning rate of 0.3, which is reduced by a factor of 0.9 every time the validation F1 score plateaus. We use dropout of 0.5 between all but the last GCN layer, word dropout of 0.04, and embedding and encoder dropout of 0.5. Similar to the authors we use path-centric pruning with K=1. On TACRED we use two 200-dimensional GCN layers and similar two 200-dimensional feedforward layers with ReLU activation, whereas on SemEval we instead use a single 200-dimensional GCN layer.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments