Evaluating NLP Models via Contrast Sets

Basmova Victoria
Basmova Victoria
Chen Sihao
Chen Sihao
Dua Dheeru
Dua Dheeru
Gottumukkala Ananth
Gottumukkala Ananth
Hajishirzi Hanna
Hajishirzi Hanna
Cited by: 0|Bibtex|Views170
Other Links: arxiv.org
Weibo:
We presented a new annotation paradigm for constructing more rigorous test sets for natural language processing

Abstract:

Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture a dataset's intended capabilities. We propose a...More

Code:

Data:

0
Introduction
  • Progress in natural language processing (NLP) has long been measured with standard benchmark datasets (e.g., Marcus et al, 1993).
  • These benchmarks help to provide a uniform evaluation of new modeling developments.
  • Three -colored and -posed chow dogs are face to face in one image.
  • Example Image Perturbation: Two -colored and -posed chow dogs are face to face in one image
Highlights
  • Progress in natural language processing (NLP) has long been measured with standard benchmark datasets (e.g., Marcus et al, 1993)
  • We propose that dataset authors manually perturb instances from their test set, creating contrast sets which characterize the local decision boundary around the test instances (Section 2)
  • We show that using about a person-week of work can yield high-quality perturbed test sets of approximately 1000 instances for many commonly studied NLP benchmarks, though the amount of work depends on the nature of the task (Section 3)
  • Dataset-Specific Instantiations The process for creating contrast sets is dataset-specific: we present general guidelines that hold across many tasks, experts must still characterize the type of phenomena each individual dataset is intended to capture
  • We presented a new annotation paradigm for constructing more rigorous test sets for NLP
  • Our procedure maintains most of the established processes for dataset creation but fills in the systematic gaps that are typically present in datasets
Methods
  • Post-hoc Construction of Contrast Sets Improving the evaluation for existing datasets well after their release is usually too late: new models have been designed, research papers have been published, and the community has absorbed potentially incorrect insights.
  • Post-hoc contrast sets may be biased by existing models.
  • The authors instead recommend that new datasets include contrast sets upon release, so that the authors can characterize beforehand when they will be satisfied that a model has acquired the dataset’s intended capabilities.
  • The effort to create contrast sets is a small fraction of the effort required to produce a new dataset in the first place
Results
  • The parser that the authors use achieves 95.7% unlabeled attachment score on the English Penn Treebank (Dozat and Manning, 2017).
Conclusion
  • The authors presented a new annotation paradigm for constructing more rigorous test sets for NLP.
  • The authors created contrast sets for 10 NLP datasets and released this data as new evaluation benchmarks.
  • The authors recommend that future data collection efforts create contrast sets to provide more comprehensive evaluations for both existing and new NLP datasets.
  • While the authors have created thousands of new test examples across a wide variety of datasets, the authors have only taken small steps towards the rigorous evaluations the authors would like to see in NLP.
  • The last several years have given them dramatic modeling advancements; the evaluation methodologies and datasets need to see similar improvements
Summary
  • Introduction:

    Progress in natural language processing (NLP) has long been measured with standard benchmark datasets (e.g., Marcus et al, 1993).
  • These benchmarks help to provide a uniform evaluation of new modeling developments.
  • Three -colored and -posed chow dogs are face to face in one image.
  • Example Image Perturbation: Two -colored and -posed chow dogs are face to face in one image
  • Methods:

    Post-hoc Construction of Contrast Sets Improving the evaluation for existing datasets well after their release is usually too late: new models have been designed, research papers have been published, and the community has absorbed potentially incorrect insights.
  • Post-hoc contrast sets may be biased by existing models.
  • The authors instead recommend that new datasets include contrast sets upon release, so that the authors can characterize beforehand when they will be satisfied that a model has acquired the dataset’s intended capabilities.
  • The effort to create contrast sets is a small fraction of the effort required to produce a new dataset in the first place
  • Results:

    The parser that the authors use achieves 95.7% unlabeled attachment score on the English Penn Treebank (Dozat and Manning, 2017).
  • Conclusion:

    The authors presented a new annotation paradigm for constructing more rigorous test sets for NLP.
  • The authors created contrast sets for 10 NLP datasets and released this data as new evaluation benchmarks.
  • The authors recommend that future data collection efforts create contrast sets to provide more comprehensive evaluations for both existing and new NLP datasets.
  • While the authors have created thousands of new test examples across a wide variety of datasets, the authors have only taken small steps towards the rigorous evaluations the authors would like to see in NLP.
  • The last several years have given them dramatic modeling advancements; the evaluation methodologies and datasets need to see similar improvements
Tables
  • Table1: We create contrast sets for 10 datasets and show instances from seven of them here
  • Table2: Models struggle on the contrast sets compared to the original test sets. For each dataset, we use a model that is at or near state-of-the-art performance and evaluate it on the “# Examples” examples in the contrast sets (not including the original example). We report percentage accuracy for NLVR2, IMDb, PERSPECTRUM, MATRES, and BoolQ; F1 scores for DROP and QUOREF; Exact Match (EM) scores for ROPES and MC-TACO; and unlabeled attachment score on modified attachments for the UD English dataset. We also report contrast consistency: the percentage of the “# Sets” contrast sets for which a model’s predictions are correct for all examples in the set (including the original example). More details on datasets, models, and evaluation metrics can be found in Appendix A and Appendix B
  • Table3: Humans achieve similar performance on the contrast sets and the original test sets. The metrics here are the same as those in Table 2
  • Table4: Accuracy breakdown of the perturbation types for MATRES
  • Table5: Accuracy breakdown of the perturbation types for DROP
Download tables as Excel
Related work
  • Here, we present related methods to contrast sets. Section 2.1 discusses other related work such as adversarial examples and input perturbations.

    creating expert-crafted contrast sets that evaluate local decision boundaries. On sentiment analysis, the task studied by both us and Kaushik et al (2019), the evaluation results were very similar. This suggests that contrast sets may be feasible to crowdsource for tasks that are easily explainable to crowd workers.

    Generalization to new data distributions The MRQA shared task (Fisch et al, 2019) evaluates generalization to held-out datasets which require different types of reasoning (e.g., numerical reasoning, compositional questions) and come from different domains (e.g., biomedical, newswire, Wikipedia). We instead perturb in-domain examples to fill in gaps in the original data distribution.
Funding
  • The parser that we use achieves 95.7% unlabeled attachment score on the English Penn Treebank (Dozat and Manning, 2017)
Study subjects and analysis
diverse NLP datasets: 10
Contrast sets provide a local view of a model's decision boundary, which can be used to more accurately evaluate a model's true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP reading comprehension, UD parsing, IMDb sentiment analysis). Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets---up to 25\% in some cases

existing NLP datasets: 10
We show that using about a person-week of work can yield high-quality perturbed test sets of approximately 1000 instances for many commonly studied NLP benchmarks, though the amount of work depends on the nature of the task (Section 3). We apply this annotation paradigm to a diverse set of 10 existing NLP datasets—including visual reasoning, reading comprehension, sentiment analysis, and syntactic parsing—to demonstrate its wide applicability and efficacy (Section 4). Although contrast sets are not intentionally adversarial, state-of-the-art models perform dramatically worse on our contrast sets than on the original test sets, especially when evaluating consistency

datasets: 3
3 How to Create Contrast Sets. Here, we walk through our process for creating contrast sets for three datasets (DROP, NLVR2, and UD Parsing). Examples are shown in Figure 1 and Table 1

dogs: 4
For example, we might change The left image contains twice the number of dogs as the right image to The left image contains three times the number of dogs as the right image. Similarly, given an image pair with four dogs in the left and two dogs in the right, we can replace individual images with photos of variably-sized groups of dogs. The textual perturbations were often changes in quantifiers (e.g., at least one to exactly one), entities (e.g., dogs to cats), or properties thereof (e.g., orange glass to green glass)

NLP datasets: 10
4.1 Original Datasets. We create contrast sets for 10 NLP datasets (full descriptions are provided in Section A):. • NLVR2 (Suhr et al, 2019)

datasets: 4
We (the authors) test ourselves on these examples. Human performance is comparable across the original test and contrasts set examples for the four datasets (Table 3). Dataset IMDb PERSPECTRUM QUOREF ROPES

NLP datasets: 10
By shifting evaluations from accuracy on i.i.d. test sets to consistency on contrast sets, we can better examine whether models have learned the desired capabilities or simply captured the idiosyncrasies of a dataset. We created contrast sets for 10 NLP datasets and released this data as new evaluation benchmarks. We recommend that future data collection efforts create contrast sets to provide more comprehensive evaluations for both existing and new NLP datasets

datasets: 10
. We create contrast sets for 10 datasets and show instances from seven of them here. Models struggle on the contrast sets compared to the original test sets. For each dataset, we use a model that is at or near state-of-the-art performance and evaluate it on the “# Examples” examples in the contrast sets (not including the original example). We report percentage accuracy for NLVR2, IMDb, PERSPECTRUM, MATRES, and BoolQ; F1 scores for DROP and QUOREF; Exact Match (EM) scores for ROPES and MC-TACO; and unlabeled attachment score on modified attachments for the UD English dataset. We also report contrast consistency: the percentage of the “# Sets” contrast sets for which a model’s predictions are correct for all examples in the set (including the original example). More details on datasets, models, and evaluation metrics can be found in Appendix A and Appendix B

Reference
  • Lars Ahrenberg. 2007. LinES: an English-Swedish parallel treebank. In NODALIDA.
    Google ScholarFindings
  • Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. 2010. A theory of learning from different domains. Machine Learning.
    Google ScholarFindings
  • Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. A large annotated corpus for learning natural language inference. In EMNLP.
    Google ScholarFindings
  • Pradeep Dasigi, Nelson F Liu, Ana Marasovic, Noah A Smith, and Matt Gardner. 2019. Quoref: A reading comprehension dataset with questions requiring coreferential reasoning. In EMNLP.
    Google ScholarFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL.
    Google ScholarFindings
  • Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. 2018. Measuring and mitigating unintended bias in text classification. In ACM AIES.
    Google ScholarFindings
  • Timothy Dozat and Christopher D Manning. 201Deep biaffine attention for neural dependency parsing. In ICLR.
    Google ScholarFindings
  • Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In NAACL.
    Google ScholarFindings
  • Shi Feng, Eric Wallace, and Jordan Boyd-Graber. 201Misleading failures of partial-input baselines. In ACL.
    Google ScholarFindings
  • Shi Feng, Eric Wallace, Alvin Grissom II, Mohit Iyyer, Pedro Rodriguez, and Jordan Boyd-Graber. 2018. Pathologies of neural models make interpretations difficult. In EMNLP.
    Google ScholarFindings
  • Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eunsol Choi, and Danqi Chen. 2019. MRQA 2019 shared task: Evaluating generalization in reading comprehension. In EMNLP MRQA Workshop.
    Google ScholarLocate open access versionFindings
  • Mor Geva, Yoav Goldberg, and Jonathan Berant. 2019. Are we modeling the task or the annotator? An investigation of annotator bias in natural language understanding datasets. In EMNLP.
    Google ScholarLocate open access versionFindings
  • Max Glockner, Vered Shwartz, and Yoav Goldberg. 2018. Breaking NLI systems with sentences that require simple lexical inferences. In ACL.
    Google ScholarFindings
  • Danqi Chen, Jason Bolton, and Christopher D. Manning. 2016. A thorough examination of the CNN/Daily Mail reading comprehension task. In ACL.
    Google ScholarFindings
  • Omer Goldman, Veronica Latcinnik, Ehud Nave, Amir Globerson, and Jonathan Berant. 2018. Weakly supervised semantic parsing with abstract examples. In ACL.
    Google ScholarFindings
  • Sihao Chen, Daniel Khashabi, Wenpeng Yin, Chris Callison-Burch, and Dan Roth. 2019. Seeing things from a different angle: Discovering diverse perspectives about claims. In NAACL.
    Google ScholarFindings
  • Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural language inference data. In NAACL.
    Google ScholarLocate open access versionFindings
  • Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the surprising difficulty of natural Yes/No questions. In NAACL.
    Google ScholarFindings
  • Minghao Hu, Yuxing Peng, Zhen Huang, and Dongsheng Li. 20A multi-type multi-span network for reading comprehension that requires discrete reasoning. In EMNLP.
    Google ScholarFindings
  • Michael Collins and James Brooks. 1995. Prepo- Pierre Isabelle, Colin Cherry, and George Foster. 2017.
    Google ScholarFindings
  • Robin Jia and Percy Liang. 2017. Adversarial exam- Qiang Ning, Sanjay Subramanian, and Dan Roth. 2019.
    Google ScholarFindings
  • Divyansh Kaushik, Eduard Hovy, and Zachary C Lipton. 2019. Learning the difference that makes a difference with counterfactually-augmented data. In ICLR.
    Google ScholarFindings
  • Divyansh Kaushik and Zachary C. Lipton. 2018. How much reading does reading comprehension require? A critical investigation of popular benchmarks. In EMNLP.
    Google ScholarFindings
  • Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. RACE: Large-scale reading comprehension dataset from examinations. In EMNLP.
    Google ScholarLocate open access versionFindings
  • Kevin Lin, Oyvind Tafjord, Peter Clark, and Matt Gardner. 2019. Reasoning over paragraph effects in situations. In EMNLP MRQA Workshop.
    Google ScholarLocate open access versionFindings
  • Zachary Lipton, Yu-Xiang Wang, and Alexander Smola. 2018. Detecting and correcting for label shift with black box predictors. In ICML.
    Google ScholarFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
    Findings
  • Qiang Ning, Hao Wu, and Dan Roth. 2018. A MultiAxis Annotation Scheme for Event Temporal Relations. In ACL.
    Google ScholarFindings
  • Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman. 2016. Universal dependencies v1: A multilingual treebank collection. In LREC.
    Google ScholarFindings
  • Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. 2018. Hypothesis only baselines in natural language inference. In *SEM.
    Google ScholarLocate open access versionFindings
  • Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. 2019. Do ImageNet classifiers generalize to ImageNet? In ICML.
    Google ScholarLocate open access versionFindings
  • Marco Tulio Ribeiro, Carlos Guestrin, and Sameer Singh. 2019. Are red roses red? Evaluating consistency of question-answering models. In ACL.
    Google ScholarFindings
  • Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018a. Semantically equivalent adversarial rules for debugging NLP models. In ACL.
    Google ScholarFindings
  • Kaiji Lu, Piotr Mardziel, Fangjing Wu, Preetam Amancharla, and Anupam Datta. 2018. Gender bias in neural natural language processing. arXiv preprint arXiv:1807.11714.
    Findings
  • Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In ACL.
    Google ScholarFindings
  • Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of english: The Penn treebank. In Computational Linguistics.
    Google ScholarFindings
  • Rebecca Marvin and Tal Linzen. 2018. Targeted syntactic evaluation of language models. In EMNLP.
    Google ScholarFindings
  • Sewon Min, Eric Wallace, Sameer Singh, Matt Gardner, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2019. Compositional questions do not necessitate multi-hop reasoning. In ACL.
    Google ScholarFindings
  • Aakanksha Naik, Abhilasha Ravichander, Norman Sadeh, Carolyn Rose, and Graham Neubig. 2018. Stress test evaluation for natural language inference. In COLING.
    Google ScholarFindings
  • Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. 2019. Adversarial NLI: A new benchmark for natural language understanding. arXiv preprint arXiv:1910.14599.
    Findings
  • Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018b. Semantically equivalent adversarial rules for debugging NLP models. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 856–865, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. 2018. Gender bias in coreference resolution. In NAACL.
    Google ScholarFindings
  • Manuela Sanguinetti and Cristina Bosco. 2015. PartTUT: The Turin university parallel treebank. In Harmonization and Development of Resources and Tools for Italian Natural Language Processing within the PARLI Project.
    Google ScholarFindings
  • Rico Sennrich. 2017. How grammatical is characterlevel neural machine translation? Assessing MT quality with contrastive translation pairs. In EACL.
    Google ScholarFindings
  • Ravi Shekhar, Sandro Pezzelle, Yauhen Klimovich, Aurelie Herbelot, Moin Nabi, Enver Sangineto, and Raffaella Bernardi. 2017. Foil it! Find One mismatch between image and language caption. In ACL.
    Google ScholarFindings
  • Hidetoshi Shimodaira. 2000. Improving predictive inference under covariate shift by weighting the loglikelihood function. In Journal of Statistical Planning and Inference.
    Google ScholarLocate open access versionFindings
  • Natalia Silveira, Timothy Dozat, Marie-Catherine de Marneffe, Samuel Bowman, Miriam Connor, John Bauer, and Chris Manning. 2014. A gold standard dependency corpus for English. In LREC.
    Google ScholarFindings
  • Alane Suhr and Yoav Artzi. 2019. NLVR2 visual bias analysis. arXiv preprint arXiv:1909.10411.
    Findings
  • Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. 2019. A corpus for reasoning about natural language grounded in photographs. In ACL.
    Google ScholarLocate open access versionFindings
  • Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks. In ICLR.
    Google ScholarFindings
  • Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In EMNLP.
    Google ScholarLocate open access versionFindings
  • Naushad UzZaman, Hector Llorens, Leon Derczynski, James Allen, Marc Verhagen, and James Pustejovsky. 2013. SemEval-2013 Task 1: TempEval-3: Evaluating time expressions, events, and temporal relations. In *SEM.
    Google ScholarFindings
  • Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019a. Universal adversarial triggers for attacking and analyzing NLP. In EMNLP.
    Google ScholarFindings
  • Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Yamada, and Jordan Boyd-Graber. 2019b. Trick me if you can: Human-in-the-loop generation of adversarial question answering examples. In TACL.
    Google ScholarFindings
  • Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mohananey, Wei Peng, Sheng-Fu Wang, and Samuel R. Bowman. 2019. BLiMP: A benchmark of linguistic minimal pairs for english. arXiv preprint arXiv:1912.00582.
    Findings
  • Tongshuang Wu, Marco Tulio Ribeiro, Jeffrey Heer, and Daniel Weld. 2019. Errudite: Scalable, reproducible, and testable error analysis. In ACL.
    Google ScholarFindings
  • Chhavi Yadav and Leon Bottou. 2019. Cold case: The lost MNIST digits. In NeurIPS.
    Google ScholarFindings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. In NeurIPS.
    Google ScholarFindings
  • Amir Zeldes. 2017. The GUM corpus: Creating multilayer resources in the classroom. In LREC.
    Google ScholarFindings
  • Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. SWAG: A large-scale adversarial dataset for grounded commonsense inference. In EMNLP.
    Google ScholarFindings
  • Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. HellaSwag: Can a machine really finish your sentence? In ACL.
    Google ScholarFindings
  • Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. 2018. Gender bias in coreference resolution: Evaluation and debiasing methods. In NAACL.
    Google ScholarFindings
  • Ben Zhou, Daniel Khashabi, Qiang Ning, and Dan Roth. 2019. “Going on a vacation” takes longer than “going for a walk”: A study of temporal commonsense understanding. In EMNLP.
    Google ScholarFindings
  • Natural Language Visual Reasoning 2 (NLVR2) Given a natural language sentence about two photographs, the task is to determine if the sentence is true (Suhr et al., 2019). The dataset has highly compositional language, e.g., The left image contains twice the number of dogs as the right image, and at least two dogs in total are standing. To succeed at NLVR2, a model is supposed to be able to detect and count objects, recognize spatial relationships, and understand the natural language that describes these phenomena.
    Google ScholarLocate open access versionFindings
  • Internet Movie Database (IMDb) The task is to predict the sentiment (positive or negative) of a movie review (Maas et al., 2011). We use the same set of reviews from Kaushik et al. (2019) in order to analyze the differences between crowd-edited reviews and expert-edited reviews.
    Google ScholarLocate open access versionFindings
  • Temporal relation extraction (MATRES) The task is to determine what temporal relationship exists between two events, i.e., whether some event happened before or after another event (Ning et al., 2018). MATRES has events and temporal relations labeled for approximately 300 news articles. The event annotations are taken from the data provided in the TempEval3 workshop (UzZaman et al., 2013) and the temporal relations are re-annotated based on a multi-axis formalism. We assume that the events are given and only need to classify the relation label between them.
    Google ScholarLocate open access versionFindings
  • Reasoning about perspectives (PERSPECTRUM) Given a debate-worthy natural language claim, the task is to identify the set of relevant argumentative sentences that represent perspectives for/against the claim (Chen et al., 2019). We focus on the stance prediction sub-task: a binary prediction of whether a relevant perspective is for/against the given claim.
    Google ScholarLocate open access versionFindings
  • Discrete Reasoning Over Paragraphs (DROP) A reading comprehension dataset that requires numerical reasoning, e.g., adding, sorting, and counting numbers in paragraphs (Dua et al., 2019). In order to compute the consistency metric for the span answers of DROP, we report the average number of contrast sets in which F1 for all instances is above 0.8.
    Google ScholarFindings
  • QUOREF A reading comprehension task with span selection questions that require coreference resolution (Dasigi et al., 2019). In this dataset, most questions can be localized to a single event in the passage, and reference an argument in that event that is typically a pronoun or other anaphoric reference. Correctly answering the question requires resolving the pronoun. We use the same definition for consistency for QUOREFas we did for DROP.
    Google ScholarLocate open access versionFindings
  • Reasoning Over Paragraph Effects in Situations (ROPES) A reading comprehension dataset that requires applying knowledge from a background passage to new situations (Lin et al., 2019). This task has background paragraphs drawn mostly from science texts that describe causes and effects (e.g., that brightly colored flowers attract insects), and situations written by crowd workers that instantiate either the cause (e.g., bright colors) or the effect (e.g., attracting insects). Questions are written that query the application of the statements in the background paragraphs to the instantiated situation. Correctly answering the questions is intended to require understanding how free-form causal language can be understood and applied. We use the same consistency metric for ROPES as we did for DROP and QUOREF.
    Google ScholarLocate open access versionFindings
  • BoolQ A dataset of reading comprehension instances with Boolean (yes or no) answers (Clark et al., 2019). These questions were obtained from organic Google search queries and paired with paragraphs from Wikipedia pages that are labeled as sufficient to deduce the answer. As the questions are drawn from a distribution of what people search for on the internet, there is no clear set of “intended phenomena” in this data; it is an eclectic mix of different kinds of questions.
    Google ScholarLocate open access versionFindings
  • MC-TACO A dataset of reading comprehension questions about multiple temporal common-sense phenomena (Zhou et al., 2019). Given a short paragraph (often a single sentence), a question, and a collection of candidate answers, the task is to determine which of the candidate answers are plausible. For example, the paragraph might describe a storm and the question might ask how long the storm lasted, with candidate answers ranging from seconds to weeks. This dataset is intended to test a system’s knowledge of typical event durations, orderings, and frequency. As the paragraph does not contain the information necessary to answer the question, this dataset is largely a test of background (common sense) knowledge.
    Google ScholarLocate open access versionFindings
  • Model We use LXMERT (Tan and Bansal, 2019) trained on the NLVR2 training dataset.
    Google ScholarFindings
  • Model We use the same BERT model setup and training data as Kaushik et al. (2019) which allows us to fairly compare the crowd and expert revisions.
    Google ScholarLocate open access versionFindings
  • Contrast Set Statistics We use 100 reviews from the validation set and 488 from the test set of Kaushik et al. (2019). Three annotators used approximately 70 hours to construct and validate the dataset.
    Google ScholarLocate open access versionFindings
  • Model We use CogCompTime 2.0 (Ning et al., 2019).
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments