TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task

ACL, pp. 1558-1569, 2020.

Cited by: 0|Bibtex|Views18
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
The question we ask in this work is: Is there still room for improvement, and can we identify the underlying factors that contribute to this error rate? We analyse this question from two separate viewpoints: to what extent does the quality of crowd based annotations contribute to...

Abstract:

TACRED (Zhang et al., 2017) is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE). But, even with recent advances in unsupervised pre-training and knowledge enhanced neural RE, models still show a high error rate. In this paper, we investigate the questions: Have we reached a performance ceiling or is...More
0
Introduction
  • Relation Extraction (RE) is the task of extracting relationships between concepts and entities from text, where relations correspond to semantic categories such as per:spouse, org:founded by or org:subsidiaries (Figure 1).
  • The methods best performing on the dataset use some form of pre-training to improve RE performance: finetuning pre-trained language representations (Alt et al, 2019; Shi and Lin, 2019; Joshi et al, 2019) or integrating external knowledge during pre-training, e.g. via joint language modelling and linking on entity-linked text (Zhang et al, 2019; Peters et al, 2019; Baldini Soares et al, 2019); with the last two methods achieving a state-of-the-art performance of 71.5 F1
  • While this performance is impressive, the error rate of almost 30% is still high.
  • The question the authors ask in this work is: Is there still room for improvement, and can the authors identify the underlying factors that contribute to this error rate? The authors analyse this question from two separate viewpoints: (1) to what extent does the quality of crowd based annotations contribute to the error rate, and (2) what can be attributed to dataset and models? Answers to these questions can provide insights for improving crowdsourced annotation in RE, and suggest directions for future research
Highlights
  • Relation Extraction (RE) is the task of extracting relationships between concepts and entities from text, where relations correspond to semantic categories such as per:spouse, org:founded by or org:subsidiaries (Figure 1)
  • The methods best performing on the dataset use some form of pre-training to improve Relation Extraction performance: finetuning pre-trained language representations (Alt et al, 2019; Shi and Lin, 2019; Joshi et al, 2019) or integrating external knowledge during pre-training, e.g. via joint language modelling and linking on entity-linked text (Zhang et al, 2019; Peters et al, 2019; Baldini Soares et al, 2019); with the last two methods achieving a state-of-the-art performance of 71.5 F1
  • The question we ask in this work is: Is there still room for improvement, and can we identify the underlying factors that contribute to this error rate? We analyse this question from two separate viewpoints: (1) to what extent does the quality of crowd based annotations contribute to the error rate, and (2) what can be attributed to dataset and models? Answers to these questions can provide insights for improving crowdsourced annotation in Relation Extraction, and suggest directions for future research
  • Even though our selection strategy was biased towards examples challenging for models, the large proportion of changed labels suggests that these examples were difficult to label for crowd workers as well
  • To improve the evaluation accuracy and reliability of future Relation Extraction methods, we provide a revised, extensively relabeled TACRED
  • We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled
  • We showed that models adopt heuristics when entities are unmasked and proposed that evaluation metrics should consider an instance’s difficulty
Results
  • The authors find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
  • The average model F1 score rises to 70.1%, a major improvement of 8% over the 62.1% average F1 on the original test split, corresponding to a 21.1% error reduction.
  • KnowBERT has the second highest score of 58.7, 3% less than SpanBERT
Conclusion
  • The low quality of crowd-generated labels in the Challenging group may be due to their complexity, or due to other reasons, such as lack of detailed annotation guidelines, lack of training, etc
  • It suggests that, at least for Dev and Test splits, crowdsourcing, even with crowd worker quality checks as used by Zhang et al (2017), may not be sufficient to produce high quality evaluation data.
  • The authors showed that models adopt heuristics when entities are unmasked and proposed that evaluation metrics should consider an instance’s difficulty
Summary
  • Introduction:

    Relation Extraction (RE) is the task of extracting relationships between concepts and entities from text, where relations correspond to semantic categories such as per:spouse, org:founded by or org:subsidiaries (Figure 1).
  • The methods best performing on the dataset use some form of pre-training to improve RE performance: finetuning pre-trained language representations (Alt et al, 2019; Shi and Lin, 2019; Joshi et al, 2019) or integrating external knowledge during pre-training, e.g. via joint language modelling and linking on entity-linked text (Zhang et al, 2019; Peters et al, 2019; Baldini Soares et al, 2019); with the last two methods achieving a state-of-the-art performance of 71.5 F1
  • While this performance is impressive, the error rate of almost 30% is still high.
  • The question the authors ask in this work is: Is there still room for improvement, and can the authors identify the underlying factors that contribute to this error rate? The authors analyse this question from two separate viewpoints: (1) to what extent does the quality of crowd based annotations contribute to the error rate, and (2) what can be attributed to dataset and models? Answers to these questions can provide insights for improving crowdsourced annotation in RE, and suggest directions for future research
  • Results:

    The authors find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
  • The average model F1 score rises to 70.1%, a major improvement of 8% over the 62.1% average F1 on the original test split, corresponding to a 21.1% error reduction.
  • KnowBERT has the second highest score of 58.7, 3% less than SpanBERT
  • Conclusion:

    The low quality of crowd-generated labels in the Challenging group may be due to their complexity, or due to other reasons, such as lack of detailed annotation guidelines, lack of training, etc
  • It suggests that, at least for Dev and Test splits, crowdsourcing, even with crowd worker quality checks as used by Zhang et al (2017), may not be sufficient to produce high quality evaluation data.
  • The authors showed that models adopt heuristics when entities are unmasked and proposed that evaluation metrics should consider an instance’s difficulty
Tables
  • Table1: TACRED statistics per split. About 79.5% of the examples are labeled as no relation
  • Table2: Re-annotation statistics for TACRED Dev and Test splits
  • Table3: Inter-Annotator Kappa-agreement for the relation validation task on TACRED Dev and Test splits (H1,H2 = human re-annotators, H = revised labels, C = original TACRED crowd-generated labels)
  • Table4: Misclassification types along with sentence examples, relevant false predictions, and error frequency. The problematic sentence parts are underlined (examples may be abbreviated due to space constraints)
  • Table5: Test set F1 score on TACRED, our revised version, and weighted by difficulty (on revised). The weight per instance is determined by the number of incorrect predictions in our set of 49 RE models. The result suggests that SpanBERT better generalizes to more challenging examples, e.g. complex sentential context
  • Table6: Test set performance on TACRED and the revised version for all 49 models we used to select the most challenging instances. We use the same entity masking strategy as <a class="ref-link" id="cZhang_et+al_2017_a" href="#rZhang_et+al_2017_a">Zhang et al (2017</a>), replacing each entity in the original sentence with a special <NER>-{SUBJ, OBJ} token where <NER> is the corresponding NER tag. For models w/ POS/NER we concatenate part-of-speech and named entity tag embeddings to each input token embedding
Download tables as Excel
Related work
  • Relation Extraction on TACRED Recent RE approaches include PA-LSTM (Zhang et al, 2017) and GCN (Zhang et al, 2018), with the former combining recurrence and attention, and the latter leveraging graph convolutional neural networks.

    Many current approaches use unsupervised or semi-supervised pre-training: fine-tuning of language representations pre-trained on token-level (Alt et al, 2019; Shi and Lin, 2019) or span-level (Joshi et al, 2019), fine-tuning of knowledge enhanced word representations that are pre-trained on entity-linked text (Zhang et al, 2019; Peters et al, 2019), and “matching the blanks” pre-training (Baldini Soares et al, 2019).

    Dataset Evaluation Chen et al (2016) and Barnes et al (2019) also use model results to assess dataset difficulty for reading comprehension and sentiment analysis. Other work also explores bias in datasets and the adoption of shallow heuristics on biased datasets in natural language inference (Niven and Kao, 2019) and argument reasoning comprehension (McCoy et al, 2019).

    Analyzing trained Models Explanation methods include occlusion or gradient-based methods, measuring the relevance of input features to the output (Zintgraf et al, 2017; Harbecke et al, 2018), and probing tasks (Conneau et al, 2018; Kim et al, 2019) that probe the presence of specific features e.g. in intermediate layers. More similar to our approach is rewriting of instances (Jia and Liang, 2017; Ribeiro et al, 2018) but instead of evaluating model robustness we use rewriting to test explicit error hypotheses, similar to Wu et al (2019).
Funding
  • This work has been supported by the German Federal Ministry of Education and Research as part of the projects DEEPLEE (01IW17001) and BBDC2 (01IS18025E), and by the German Federal Ministry for Economic Affairs and Energy as part of the project PLASS (01MD19003E)
Reference
  • Christoph Alt, Marc Hubner, and Leonhard Hennig. 2019. Improving relation extraction by pre-trained language representations. In Proceedings of the 2019 Conference on Automated Knowledge Base Construction, Amherst, Massachusetts.
    Google ScholarLocate open access versionFindings
  • Livio Baldini Soares, Nicholas FitzGerald, Jeffrey Ling, and Tom Kwiatkowski. 2019. Matching the blanks: Distributional similarity for relation learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2895–2905, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jeremy Barnes, Lilja Øvrelid, and Erik Velldal. 2019. Sentiment analysis is not solved! assessing and probing sentiment classification. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 12–23, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Danqi Chen, Jason Bolton, and Christopher D. Manning. 2016. A thorough examination of the CNN/daily mail reading comprehension task. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2358–2367, Berlin, Germany. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alexis Conneau, German Kruszewski, Guillaume Lample, Loıc Barrault, and Marco Baroni. 2018. What you can cram into a single &!#* vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2126–2136, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • David Harbecke, Robert Schwarzenberg, and Christoph Alt. 2018. Learning explanations from language data. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 316–318, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid O Seaghdha, Sebastian Pado, Marco Pennacchiotti, Lorenza Romano, and Stan Szpakowicz. 2010. SemEval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 33–38, Uppsala, Sweden. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Heng Ji and Ralph Grishman. 2011. Knowledge base population: Successful approaches and challenges. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT ’11, pages 1148–1158, Stroudsburg, PA, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2021–2031, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke S. Zettlemoyer, and Omer Levy. 2019. Spanbert: Improving pre-training by representing and predicting spans. ArXiv, abs/1907.10529.
    Findings
  • Najoung Kim, Roma Patel, Adam Poliak, Patrick Xia, Alex Wang, Tom McCoy, Ian Tenney, Alexis Ross, Tal Linzen, Benjamin Van Durme, Samuel R. Bowman, and Ellie Pavlick. 2019. Probing what different NLP tasks teach machines about function word comprehension. In Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019), pages 235–249, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Thien Huu Nguyen and Ralph Grishman. 2015. Relation extraction: Perspective from convolutional neural networks. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pages 39–48, Denver, Colorado. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Timothy Niven and Hung-Yu Kao. 2019. Probing neural network comprehension of natural language arguments. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4658–4664, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Matthew E. Peters, Mark Neumann, Robert Logan, Roy Schwartz, Vidur Joshi, Sameer Singh, and Noah A. Smith. 2019. Knowledge enhanced contextual word representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 43–54, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 20Improving language understanding by generative pre-training. available as a preprint.
    Google ScholarFindings
  • Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Semantically equivalent adversarial rules for debugging NLP models. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 856–865, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Sebastian Riedel, Limin Yao, and Andrew McCallum. 2010. Modeling Relations and Their Mentions without Labeled Text. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD ’10).
    Google ScholarLocate open access versionFindings
  • Peng Shi and Jimmy Lin. 2019. Simple bert models for relation extraction and semantic role labeling. ArXiv, abs/1904.05255.
    Findings
  • Tongshuang Wu, Marco Tulio Ribeiro, Jeffrey Heer, and Daniel Weld. 2019. Errudite: Scalable, reproducible, and testable error analysis. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 747–763, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Kun Xu, Siva Reddy, Yansong Feng, Songfang Huang, and Dongyan Zhao. 2016. Question answering on freebase via relation extraction and textual evidence. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2326–2336. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, and Jun Zhao. 2014. Relation classification via convolutional deep neural network. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 2335–2344, Dublin, Ireland. Dublin City University and Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yuhao Zhang, Peng Qi, and Christopher D. Manning. 2018. Graph convolution over pruned dependency trees improves relation extraction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2205–2215, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D. Manning. 2017. Positionaware attention and supervised data improve slot filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 35–45, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. 2019. ERNIE: Enhanced language representation with informative entities. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1441–1451, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Luisa M. Zintgraf, Taco S. Cohen, Tameem Adel, and Max Welling. 2017. Visualizing deep neural network decisions: Prediction difference analysis. International Conference on Learning Representations.
    Google ScholarFindings
  • CNN For training we use the hyperparameters of Zhang et al. (2017). We employ Adagrad as an optimizer, with an initial learning rate of 0.1 and run training for 50 epochs. Starting from the 15th epoch, we gradually decrease the learning rate by a factor of 0.9. For the CNN we use 500 filters of sizes [2, 3, 4, 5] and apply l2 regularization with a coefficient of 10−3 to all filter weights. We use tanh as activation and apply dropout on the encoder output with a probability of 0.5. We use the same hyperparameters for variants with ELMo. For variants with BERT, we use an initial learning rate of 0.01 and decrease the learning rate by a factor of 0.9 every time the validation F1 score is plateauing. Also we use 200 filters of sizes [2, 3, 4, 5].
    Google ScholarLocate open access versionFindings
  • LSTM/Bi-LSTM For training we use the hyperparameters of Zhang et al. (2017). We employ Adagrad with an initial learning rate of 0.01, train for 30 epochs and gradually decrease the learning rate by a factor of 0.9, starting from the 15th epoch. We use word dropout of 0.04 and recurrent dropout of 0.5. The BiLSTM consists of two layers of hidden dimension 500 for each direction. For training with ELMo and BERT we decrease the learning rate by a factor of 0.9 every time the validation F1 score is plateauing.
    Google ScholarLocate open access versionFindings
  • GCN We reuse the hyperparameters of Zhang et al. (2018). We employ SGD as optimizer with an initial learning rate of 0.3, which is reduced by a factor of 0.9 every time the validation F1 score plateaus. We use dropout of 0.5 between all but the last GCN layer, word dropout of 0.04, and embedding and encoder dropout of 0.5. Similar to the authors we use path-centric pruning with K=1. We use two 200-dimensional GCN layers and similar two 200-dimensional feedforward layers with ReLU activation.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments