Evaluating NLP Models via Contrast Sets
Weibo:
Abstract:
Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture a dataset's intended capabilities. We propose a...More
Code:
Data:
Introduction
- Progress in natural language processing (NLP) has long been measured with standard benchmark datasets (e.g., Marcus et al, 1993).
- These benchmarks help to provide a uniform evaluation of new modeling developments.
- Three -colored and -posed chow dogs are face to face in one image.
- Example Image Perturbation: Two -colored and -posed chow dogs are face to face in one image
Highlights
- Progress in natural language processing (NLP) has long been measured with standard benchmark datasets (e.g., Marcus et al, 1993)
- We propose that dataset authors manually perturb instances from their test set, creating contrast sets which characterize the local decision boundary around the test instances (Section 2)
- We show that using about a person-week of work can yield high-quality perturbed test sets of approximately 1000 instances for many commonly studied NLP benchmarks, though the amount of work depends on the nature of the task (Section 3)
- Dataset-Specific Instantiations The process for creating contrast sets is dataset-specific: we present general guidelines that hold across many tasks, experts must still characterize the type of phenomena each individual dataset is intended to capture
- We presented a new annotation paradigm for constructing more rigorous test sets for NLP
- Our procedure maintains most of the established processes for dataset creation but fills in the systematic gaps that are typically present in datasets
Methods
- Post-hoc Construction of Contrast Sets Improving the evaluation for existing datasets well after their release is usually too late: new models have been designed, research papers have been published, and the community has absorbed potentially incorrect insights.
- Post-hoc contrast sets may be biased by existing models.
- The authors instead recommend that new datasets include contrast sets upon release, so that the authors can characterize beforehand when they will be satisfied that a model has acquired the dataset’s intended capabilities.
- The effort to create contrast sets is a small fraction of the effort required to produce a new dataset in the first place
Results
- The parser that the authors use achieves 95.7% unlabeled attachment score on the English Penn Treebank (Dozat and Manning, 2017).
Conclusion
- The authors presented a new annotation paradigm for constructing more rigorous test sets for NLP.
- The authors created contrast sets for 10 NLP datasets and released this data as new evaluation benchmarks.
- The authors recommend that future data collection efforts create contrast sets to provide more comprehensive evaluations for both existing and new NLP datasets.
- While the authors have created thousands of new test examples across a wide variety of datasets, the authors have only taken small steps towards the rigorous evaluations the authors would like to see in NLP.
- The last several years have given them dramatic modeling advancements; the evaluation methodologies and datasets need to see similar improvements
Summary
Introduction:
Progress in natural language processing (NLP) has long been measured with standard benchmark datasets (e.g., Marcus et al, 1993).- These benchmarks help to provide a uniform evaluation of new modeling developments.
- Three -colored and -posed chow dogs are face to face in one image.
- Example Image Perturbation: Two -colored and -posed chow dogs are face to face in one image
Methods:
Post-hoc Construction of Contrast Sets Improving the evaluation for existing datasets well after their release is usually too late: new models have been designed, research papers have been published, and the community has absorbed potentially incorrect insights.- Post-hoc contrast sets may be biased by existing models.
- The authors instead recommend that new datasets include contrast sets upon release, so that the authors can characterize beforehand when they will be satisfied that a model has acquired the dataset’s intended capabilities.
- The effort to create contrast sets is a small fraction of the effort required to produce a new dataset in the first place
Results:
The parser that the authors use achieves 95.7% unlabeled attachment score on the English Penn Treebank (Dozat and Manning, 2017).Conclusion:
The authors presented a new annotation paradigm for constructing more rigorous test sets for NLP.- The authors created contrast sets for 10 NLP datasets and released this data as new evaluation benchmarks.
- The authors recommend that future data collection efforts create contrast sets to provide more comprehensive evaluations for both existing and new NLP datasets.
- While the authors have created thousands of new test examples across a wide variety of datasets, the authors have only taken small steps towards the rigorous evaluations the authors would like to see in NLP.
- The last several years have given them dramatic modeling advancements; the evaluation methodologies and datasets need to see similar improvements
Tables
- Table1: We create contrast sets for 10 datasets and show instances from seven of them here
- Table2: Models struggle on the contrast sets compared to the original test sets. For each dataset, we use a model that is at or near state-of-the-art performance and evaluate it on the “# Examples” examples in the contrast sets (not including the original example). We report percentage accuracy for NLVR2, IMDb, PERSPECTRUM, MATRES, and BoolQ; F1 scores for DROP and QUOREF; Exact Match (EM) scores for ROPES and MC-TACO; and unlabeled attachment score on modified attachments for the UD English dataset. We also report contrast consistency: the percentage of the “# Sets” contrast sets for which a model’s predictions are correct for all examples in the set (including the original example). More details on datasets, models, and evaluation metrics can be found in Appendix A and Appendix B
- Table3: Humans achieve similar performance on the contrast sets and the original test sets. The metrics here are the same as those in Table 2
- Table4: Accuracy breakdown of the perturbation types for MATRES
- Table5: Accuracy breakdown of the perturbation types for DROP
Related work
- Here, we present related methods to contrast sets. Section 2.1 discusses other related work such as adversarial examples and input perturbations.
creating expert-crafted contrast sets that evaluate local decision boundaries. On sentiment analysis, the task studied by both us and Kaushik et al (2019), the evaluation results were very similar. This suggests that contrast sets may be feasible to crowdsource for tasks that are easily explainable to crowd workers.
Generalization to new data distributions The MRQA shared task (Fisch et al, 2019) evaluates generalization to held-out datasets which require different types of reasoning (e.g., numerical reasoning, compositional questions) and come from different domains (e.g., biomedical, newswire, Wikipedia). We instead perturb in-domain examples to fill in gaps in the original data distribution.
Funding
- The parser that we use achieves 95.7% unlabeled attachment score on the English Penn Treebank (Dozat and Manning, 2017)
Study subjects and analysis
diverse NLP datasets: 10
Contrast sets provide a local view of a model's decision boundary, which can be used to more accurately evaluate a model's true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP reading comprehension, UD parsing, IMDb sentiment analysis). Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets---up to 25\% in some cases
existing NLP datasets: 10
We show that using about a person-week of work can yield high-quality perturbed test sets of approximately 1000 instances for many commonly studied NLP benchmarks, though the amount of work depends on the nature of the task (Section 3). We apply this annotation paradigm to a diverse set of 10 existing NLP datasets—including visual reasoning, reading comprehension, sentiment analysis, and syntactic parsing—to demonstrate its wide applicability and efficacy (Section 4). Although contrast sets are not intentionally adversarial, state-of-the-art models perform dramatically worse on our contrast sets than on the original test sets, especially when evaluating consistency
datasets: 3
3 How to Create Contrast Sets. Here, we walk through our process for creating contrast sets for three datasets (DROP, NLVR2, and UD Parsing). Examples are shown in Figure 1 and Table 1
dogs: 4
For example, we might change The left image contains twice the number of dogs as the right image to The left image contains three times the number of dogs as the right image. Similarly, given an image pair with four dogs in the left and two dogs in the right, we can replace individual images with photos of variably-sized groups of dogs. The textual perturbations were often changes in quantifiers (e.g., at least one to exactly one), entities (e.g., dogs to cats), or properties thereof (e.g., orange glass to green glass)
NLP datasets: 10
4.1 Original Datasets. We create contrast sets for 10 NLP datasets (full descriptions are provided in Section A):. • NLVR2 (Suhr et al, 2019)
datasets: 4
We (the authors) test ourselves on these examples. Human performance is comparable across the original test and contrasts set examples for the four datasets (Table 3). Dataset IMDb PERSPECTRUM QUOREF ROPES
NLP datasets: 10
By shifting evaluations from accuracy on i.i.d. test sets to consistency on contrast sets, we can better examine whether models have learned the desired capabilities or simply captured the idiosyncrasies of a dataset. We created contrast sets for 10 NLP datasets and released this data as new evaluation benchmarks. We recommend that future data collection efforts create contrast sets to provide more comprehensive evaluations for both existing and new NLP datasets
datasets: 10
. We create contrast sets for 10 datasets and show instances from seven of them here. Models struggle on the contrast sets compared to the original test sets. For each dataset, we use a model that is at or near state-of-the-art performance and evaluate it on the “# Examples” examples in the contrast sets (not including the original example). We report percentage accuracy for NLVR2, IMDb, PERSPECTRUM, MATRES, and BoolQ; F1 scores for DROP and QUOREF; Exact Match (EM) scores for ROPES and MC-TACO; and unlabeled attachment score on modified attachments for the UD English dataset. We also report contrast consistency: the percentage of the “# Sets” contrast sets for which a model’s predictions are correct for all examples in the set (including the original example). More details on datasets, models, and evaluation metrics can be found in Appendix A and Appendix B
Reference
- Lars Ahrenberg. 2007. LinES: an English-Swedish parallel treebank. In NODALIDA.
- Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. 2010. A theory of learning from different domains. Machine Learning.
- Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. A large annotated corpus for learning natural language inference. In EMNLP.
- Pradeep Dasigi, Nelson F Liu, Ana Marasovic, Noah A Smith, and Matt Gardner. 2019. Quoref: A reading comprehension dataset with questions requiring coreferential reasoning. In EMNLP.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL.
- Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. 2018. Measuring and mitigating unintended bias in text classification. In ACM AIES.
- Timothy Dozat and Christopher D Manning. 201Deep biaffine attention for neural dependency parsing. In ICLR.
- Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In NAACL.
- Shi Feng, Eric Wallace, and Jordan Boyd-Graber. 201Misleading failures of partial-input baselines. In ACL.
- Shi Feng, Eric Wallace, Alvin Grissom II, Mohit Iyyer, Pedro Rodriguez, and Jordan Boyd-Graber. 2018. Pathologies of neural models make interpretations difficult. In EMNLP.
- Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eunsol Choi, and Danqi Chen. 2019. MRQA 2019 shared task: Evaluating generalization in reading comprehension. In EMNLP MRQA Workshop.
- Mor Geva, Yoav Goldberg, and Jonathan Berant. 2019. Are we modeling the task or the annotator? An investigation of annotator bias in natural language understanding datasets. In EMNLP.
- Max Glockner, Vered Shwartz, and Yoav Goldberg. 2018. Breaking NLI systems with sentences that require simple lexical inferences. In ACL.
- Danqi Chen, Jason Bolton, and Christopher D. Manning. 2016. A thorough examination of the CNN/Daily Mail reading comprehension task. In ACL.
- Omer Goldman, Veronica Latcinnik, Ehud Nave, Amir Globerson, and Jonathan Berant. 2018. Weakly supervised semantic parsing with abstract examples. In ACL.
- Sihao Chen, Daniel Khashabi, Wenpeng Yin, Chris Callison-Burch, and Dan Roth. 2019. Seeing things from a different angle: Discovering diverse perspectives about claims. In NAACL.
- Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural language inference data. In NAACL.
- Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the surprising difficulty of natural Yes/No questions. In NAACL.
- Minghao Hu, Yuxing Peng, Zhen Huang, and Dongsheng Li. 20A multi-type multi-span network for reading comprehension that requires discrete reasoning. In EMNLP.
- Michael Collins and James Brooks. 1995. Prepo- Pierre Isabelle, Colin Cherry, and George Foster. 2017.
- Robin Jia and Percy Liang. 2017. Adversarial exam- Qiang Ning, Sanjay Subramanian, and Dan Roth. 2019.
- Divyansh Kaushik, Eduard Hovy, and Zachary C Lipton. 2019. Learning the difference that makes a difference with counterfactually-augmented data. In ICLR.
- Divyansh Kaushik and Zachary C. Lipton. 2018. How much reading does reading comprehension require? A critical investigation of popular benchmarks. In EMNLP.
- Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. RACE: Large-scale reading comprehension dataset from examinations. In EMNLP.
- Kevin Lin, Oyvind Tafjord, Peter Clark, and Matt Gardner. 2019. Reasoning over paragraph effects in situations. In EMNLP MRQA Workshop.
- Zachary Lipton, Yu-Xiang Wang, and Alexander Smola. 2018. Detecting and correcting for label shift with black box predictors. In ICML.
- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
- Qiang Ning, Hao Wu, and Dan Roth. 2018. A MultiAxis Annotation Scheme for Event Temporal Relations. In ACL.
- Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman. 2016. Universal dependencies v1: A multilingual treebank collection. In LREC.
- Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. 2018. Hypothesis only baselines in natural language inference. In *SEM.
- Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. 2019. Do ImageNet classifiers generalize to ImageNet? In ICML.
- Marco Tulio Ribeiro, Carlos Guestrin, and Sameer Singh. 2019. Are red roses red? Evaluating consistency of question-answering models. In ACL.
- Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018a. Semantically equivalent adversarial rules for debugging NLP models. In ACL.
- Kaiji Lu, Piotr Mardziel, Fangjing Wu, Preetam Amancharla, and Anupam Datta. 2018. Gender bias in neural natural language processing. arXiv preprint arXiv:1807.11714.
- Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In ACL.
- Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of english: The Penn treebank. In Computational Linguistics.
- Rebecca Marvin and Tal Linzen. 2018. Targeted syntactic evaluation of language models. In EMNLP.
- Sewon Min, Eric Wallace, Sameer Singh, Matt Gardner, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2019. Compositional questions do not necessitate multi-hop reasoning. In ACL.
- Aakanksha Naik, Abhilasha Ravichander, Norman Sadeh, Carolyn Rose, and Graham Neubig. 2018. Stress test evaluation for natural language inference. In COLING.
- Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. 2019. Adversarial NLI: A new benchmark for natural language understanding. arXiv preprint arXiv:1910.14599.
- Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018b. Semantically equivalent adversarial rules for debugging NLP models. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 856–865, Melbourne, Australia. Association for Computational Linguistics.
- Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. 2018. Gender bias in coreference resolution. In NAACL.
- Manuela Sanguinetti and Cristina Bosco. 2015. PartTUT: The Turin university parallel treebank. In Harmonization and Development of Resources and Tools for Italian Natural Language Processing within the PARLI Project.
- Rico Sennrich. 2017. How grammatical is characterlevel neural machine translation? Assessing MT quality with contrastive translation pairs. In EACL.
- Ravi Shekhar, Sandro Pezzelle, Yauhen Klimovich, Aurelie Herbelot, Moin Nabi, Enver Sangineto, and Raffaella Bernardi. 2017. Foil it! Find One mismatch between image and language caption. In ACL.
- Hidetoshi Shimodaira. 2000. Improving predictive inference under covariate shift by weighting the loglikelihood function. In Journal of Statistical Planning and Inference.
- Natalia Silveira, Timothy Dozat, Marie-Catherine de Marneffe, Samuel Bowman, Miriam Connor, John Bauer, and Chris Manning. 2014. A gold standard dependency corpus for English. In LREC.
- Alane Suhr and Yoav Artzi. 2019. NLVR2 visual bias analysis. arXiv preprint arXiv:1909.10411.
- Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. 2019. A corpus for reasoning about natural language grounded in photographs. In ACL.
- Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks. In ICLR.
- Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In EMNLP.
- Naushad UzZaman, Hector Llorens, Leon Derczynski, James Allen, Marc Verhagen, and James Pustejovsky. 2013. SemEval-2013 Task 1: TempEval-3: Evaluating time expressions, events, and temporal relations. In *SEM.
- Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019a. Universal adversarial triggers for attacking and analyzing NLP. In EMNLP.
- Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Yamada, and Jordan Boyd-Graber. 2019b. Trick me if you can: Human-in-the-loop generation of adversarial question answering examples. In TACL.
- Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mohananey, Wei Peng, Sheng-Fu Wang, and Samuel R. Bowman. 2019. BLiMP: A benchmark of linguistic minimal pairs for english. arXiv preprint arXiv:1912.00582.
- Tongshuang Wu, Marco Tulio Ribeiro, Jeffrey Heer, and Daniel Weld. 2019. Errudite: Scalable, reproducible, and testable error analysis. In ACL.
- Chhavi Yadav and Leon Bottou. 2019. Cold case: The lost MNIST digits. In NeurIPS.
- Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. In NeurIPS.
- Amir Zeldes. 2017. The GUM corpus: Creating multilayer resources in the classroom. In LREC.
- Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. SWAG: A large-scale adversarial dataset for grounded commonsense inference. In EMNLP.
- Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. HellaSwag: Can a machine really finish your sentence? In ACL.
- Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. 2018. Gender bias in coreference resolution: Evaluation and debiasing methods. In NAACL.
- Ben Zhou, Daniel Khashabi, Qiang Ning, and Dan Roth. 2019. “Going on a vacation” takes longer than “going for a walk”: A study of temporal commonsense understanding. In EMNLP.
- Natural Language Visual Reasoning 2 (NLVR2) Given a natural language sentence about two photographs, the task is to determine if the sentence is true (Suhr et al., 2019). The dataset has highly compositional language, e.g., The left image contains twice the number of dogs as the right image, and at least two dogs in total are standing. To succeed at NLVR2, a model is supposed to be able to detect and count objects, recognize spatial relationships, and understand the natural language that describes these phenomena.
- Internet Movie Database (IMDb) The task is to predict the sentiment (positive or negative) of a movie review (Maas et al., 2011). We use the same set of reviews from Kaushik et al. (2019) in order to analyze the differences between crowd-edited reviews and expert-edited reviews.
- Temporal relation extraction (MATRES) The task is to determine what temporal relationship exists between two events, i.e., whether some event happened before or after another event (Ning et al., 2018). MATRES has events and temporal relations labeled for approximately 300 news articles. The event annotations are taken from the data provided in the TempEval3 workshop (UzZaman et al., 2013) and the temporal relations are re-annotated based on a multi-axis formalism. We assume that the events are given and only need to classify the relation label between them.
- Reasoning about perspectives (PERSPECTRUM) Given a debate-worthy natural language claim, the task is to identify the set of relevant argumentative sentences that represent perspectives for/against the claim (Chen et al., 2019). We focus on the stance prediction sub-task: a binary prediction of whether a relevant perspective is for/against the given claim.
- Discrete Reasoning Over Paragraphs (DROP) A reading comprehension dataset that requires numerical reasoning, e.g., adding, sorting, and counting numbers in paragraphs (Dua et al., 2019). In order to compute the consistency metric for the span answers of DROP, we report the average number of contrast sets in which F1 for all instances is above 0.8.
- QUOREF A reading comprehension task with span selection questions that require coreference resolution (Dasigi et al., 2019). In this dataset, most questions can be localized to a single event in the passage, and reference an argument in that event that is typically a pronoun or other anaphoric reference. Correctly answering the question requires resolving the pronoun. We use the same definition for consistency for QUOREFas we did for DROP.
- Reasoning Over Paragraph Effects in Situations (ROPES) A reading comprehension dataset that requires applying knowledge from a background passage to new situations (Lin et al., 2019). This task has background paragraphs drawn mostly from science texts that describe causes and effects (e.g., that brightly colored flowers attract insects), and situations written by crowd workers that instantiate either the cause (e.g., bright colors) or the effect (e.g., attracting insects). Questions are written that query the application of the statements in the background paragraphs to the instantiated situation. Correctly answering the questions is intended to require understanding how free-form causal language can be understood and applied. We use the same consistency metric for ROPES as we did for DROP and QUOREF.
- BoolQ A dataset of reading comprehension instances with Boolean (yes or no) answers (Clark et al., 2019). These questions were obtained from organic Google search queries and paired with paragraphs from Wikipedia pages that are labeled as sufficient to deduce the answer. As the questions are drawn from a distribution of what people search for on the internet, there is no clear set of “intended phenomena” in this data; it is an eclectic mix of different kinds of questions.
- MC-TACO A dataset of reading comprehension questions about multiple temporal common-sense phenomena (Zhou et al., 2019). Given a short paragraph (often a single sentence), a question, and a collection of candidate answers, the task is to determine which of the candidate answers are plausible. For example, the paragraph might describe a storm and the question might ask how long the storm lasted, with candidate answers ranging from seconds to weeks. This dataset is intended to test a system’s knowledge of typical event durations, orderings, and frequency. As the paragraph does not contain the information necessary to answer the question, this dataset is largely a test of background (common sense) knowledge.
- Model We use LXMERT (Tan and Bansal, 2019) trained on the NLVR2 training dataset.
- Model We use the same BERT model setup and training data as Kaushik et al. (2019) which allows us to fairly compare the crowd and expert revisions.
- Contrast Set Statistics We use 100 reviews from the validation set and 488 from the test set of Kaushik et al. (2019). Three annotators used approximately 70 hours to construct and validate the dataset.
- Model We use CogCompTime 2.0 (Ning et al., 2019).
Full Text
Tags
Comments