TORQUE: A Reading Comprehension Dataset of Temporal Ordering Questions

EMNLP 2020, pp. 1158-1172, 2020.

Cited by: 0|Bibtex|Views76|DOI:https://doi.org/10.18653/V1/2020.EMNLP-MAIN.88
Other Links: arxiv.org|academic.microsoft.com
Weibo:
This paper presents TORQUE, a new English reading comprehension dataset of temporal ordering questions on 3.2k news snippets

Abstract:

A critical part of reading is being able to understand the temporal relationships between events described in a passage of text, even when those relationships are not explicitly stated. However, current machine reading comprehension benchmarks have practically no questions that test temporal phenomena, so systems trained on these benchmar...More

Code:

Data:

0
Introduction
  • Time is important for understanding events and stories described in natural language text such as news articles, social media, financial reports, and electronic health records (Verhagen et al, 2007, 2010; UzZaman et al, 2013; Minard et al, 2015; Bethard et al, 2016, 2017; Laparra et al, 2018).
  • Other datasets largely only require an understanding of the predicate-argument structure, and would ask questions like “what was a woman trapped in?” But a temporal relation question would be “what started before a woman was trapped?” To answer it, the system needs to identify events (e.g., LANDSLIDE is an event and “body” is not), the time of these events (e.g., LANDSLIDE is a correct answer, while SAID is not because of the time that the two events happen), and look at the entire passage rather than the local predicateargument structures within a sentence (e.g., SNOW and RAINFALL are correct answers from the sentence before “a woman trapped”).
Highlights
  • Time is important for understanding events and stories described in natural language text such as news articles, social media, financial reports, and electronic health records (Verhagen et al, 2007, 2010; UzZaman et al, 2013; Minard et al, 2015; Bethard et al, 2016, 2017; Laparra et al, 2018)
  • There are still many relations that cannot be expressed because the assumption that every event has a time interval is inaccurate: The time scope of an event may be fuzzy, an event can have a non-factual modality, or events can be repetitive and invoke multiple intervals. To better handle these phenomena, we move away from the fixed set of relations used in prior work and instead use natural language to annotate the relationships between events, as described
  • In a random sample of 200 questions in the test set of TORQUE, we found 94 questions querying about relations that cannot be directly represented by the previous single-interval-based labels
  • Understanding temporal ordering of events is critical in reading comprehension, but existing works have studied very little about it
  • This paper presents TORQUE, a new English reading comprehension dataset of temporal ordering questions on 3.2k news snippets
  • We argue that studying temporal relations as a reading comprehension task allows for more convenient representation of these temporal phenomena than is possible in conventional formalisms
Results
  • There are many events in a typical passage of text, so temporal relation questions typically query more than one relationship at the same time.
  • The authors trained crowd workers to label events in text, and to write and answer questions that query temporal relationships between these events.
  • TORQUE has 25k events and 21k user-generated and fully answered temporal relation questions.
  • Both the events and question-answer (QA) pairs from 20% of TORQUE were further validated by additional crowd workers, which the authors use for evaluation.
  • To better handle these phenomena, the authors move away from the fixed set of relations used in prior work and instead use natural language to annotate the relationships between events, as described .
  • Motivated by recent works (He et al, 2015; Michael et al, 2017; Levy et al, 2017; Gardner et al, 2019b), the authors propose using natural language question answering as an annotation format for temporal relations.
  • Figure 7 shows examples of not before, probably before, before under some conditions, often before, Questions that query events in different modes [Negated] What didn’t the lion do after a large meal?
  • Qualification The authors designed a separate qualification task where crowd workers were trained and tested on three individual capabilities: labeling events, asking temporal relation questions, and
Conclusion
  • The approach that prior works took to handle the aforementioned temporal phenemona was to define formalisms such as the different modes of events (Fig. 3), different time axes for events (Ning et al, 2018b), and specific rules to follow when there is confusion.
  • Results show that even a state-of-theart language model, RoBERTa-large, falls behind human performance on TORQUE by a large margin, necessitating more investigation on improving reading comprehension on temporal relationships in the future.
Summary
  • Time is important for understanding events and stories described in natural language text such as news articles, social media, financial reports, and electronic health records (Verhagen et al, 2007, 2010; UzZaman et al, 2013; Minard et al, 2015; Bethard et al, 2016, 2017; Laparra et al, 2018).
  • Other datasets largely only require an understanding of the predicate-argument structure, and would ask questions like “what was a woman trapped in?” But a temporal relation question would be “what started before a woman was trapped?” To answer it, the system needs to identify events (e.g., LANDSLIDE is an event and “body” is not), the time of these events (e.g., LANDSLIDE is a correct answer, while SAID is not because of the time that the two events happen), and look at the entire passage rather than the local predicateargument structures within a sentence (e.g., SNOW and RAINFALL are correct answers from the sentence before “a woman trapped”).
  • There are many events in a typical passage of text, so temporal relation questions typically query more than one relationship at the same time.
  • The authors trained crowd workers to label events in text, and to write and answer questions that query temporal relationships between these events.
  • TORQUE has 25k events and 21k user-generated and fully answered temporal relation questions.
  • Both the events and question-answer (QA) pairs from 20% of TORQUE were further validated by additional crowd workers, which the authors use for evaluation.
  • To better handle these phenomena, the authors move away from the fixed set of relations used in prior work and instead use natural language to annotate the relationships between events, as described .
  • Motivated by recent works (He et al, 2015; Michael et al, 2017; Levy et al, 2017; Gardner et al, 2019b), the authors propose using natural language question answering as an annotation format for temporal relations.
  • Figure 7 shows examples of not before, probably before, before under some conditions, often before, Questions that query events in different modes [Negated] What didn’t the lion do after a large meal?
  • Qualification The authors designed a separate qualification task where crowd workers were trained and tested on three individual capabilities: labeling events, asking temporal relation questions, and
  • The approach that prior works took to handle the aforementioned temporal phenemona was to define formalisms such as the different modes of events (Fig. 3), different time axes for events (Ning et al, 2018b), and specific rules to follow when there is confusion.
  • Results show that even a state-of-theart language model, RoBERTa-large, falls behind human performance on TORQUE by a large margin, necessitating more investigation on improving reading comprehension on temporal relationships in the future.
Tables
  • Table1: System output of some example questions on temporal relations. Training dataset in parenthesis
  • Table2: Temporal phenomena in TORQUE. “Standard” are those that can be directly captured by the previous single-interval-based label set, while other types cannot. Percentages are based on manual inspection of a random sample of 200 questions from TORQUE; some questions can have multiple types
  • Table3: Columns from left to right: questions, questions per passage, answers, and answers per question. Modified is a subset of questions that is created by slightly modifying an original question
  • Table4: Human/system performance on the test set of TORQUE. System performance is averaged from 3 runs; all std. dev. were ≤ 4% and those in [1%, 4%] are underlined. C (consistency) is the percentage of contrast groups for which a model’s predictions have F1 ≥ 80% for all questions in a group (<a class="ref-link" id="cGardner_et+al_2020_a" href="#rGardner_et+al_2020_a">Gardner et al, 2020</a>)
  • Table5: Inter-annotator agreement (IAA) of the event annotations in TORQUE. Above: compare the aggregated event list with either all the annotators or the initial annotator. Below: how many candidates in each category were successfully added into the aggregated event list
  • Table6: IAA of the answer annotations in TORQUE
Download tables as Excel
Related work
Study subjects and analysis
workers: 3
A crowd worker was considered level-1 qualified if they could pass the test within three attempts. In practice, roughly 1 out of 3 workers passed our qualification test. Pilot We then asked level-1 crowd workers to do a small amount of the real task

humans: 5
We also intentionally added noise to the original event list so that the validators must carefully identify wrong events. The final event list was determined by aggregating all 5 humans using majority vote. Second, we validated the answers in the same portion of the data

Reference
  • 2005. The ACE 2005 (ACE 05) Evaluation Plan. Technical report.
    Google ScholarFindings
  • James F Allen. 1984. Towards a general theory of action and time. Artificial Intelligence, 23(2):123– 154.
    Google ScholarLocate open access versionFindings
  • Steven Bethard, James H Martin, and Sara Klingenstein. 2007. Timelines from text: Identification of syntactic temporal relations. In IEEE International Conference on Semantic Computing (ICSC), pages 11–18.
    Google ScholarLocate open access versionFindings
  • Steven Bethard, Guergana Savova, Wei-Te Chen, Leon Derczynski, James Pustejovsky, and Marc Verhagen. 2016. SemEval-2016 Task 12: Clinical TempEval. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 1052–1062, San Diego, California. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Steven Bethard, Guergana Savova, Martha Palmer, and James Pustejovsky. 2017. Semeval-2017 task 12: Clinical tempeval. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 565–572. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Taylor Cassidy, Bill McDowell, Nathanel Chambers, and Steven Bethard. 2014. An annotation framework for dense event ordering. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 501–506.
    Google ScholarLocate open access versionFindings
  • Nathanael Chambers, Taylor Cassidy, Bill McDowell, and Steven Bethard. 2014. Dense event ordering with a multi-pass architecture. Transactions of the Association for Computational Linguistics (TACL), 2:273–284.
    Google ScholarLocate open access versionFindings
  • Fei Cheng and Yusuke Miyao. 2017. Classifying temporal relations by bidirectional LSTM over dependency paths. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), volume 2, pages 1–6.
    Google ScholarLocate open access versionFindings
  • Pradeep Dasigi, Nelson F. Liu, Ana Marasovic, Noah A. Smith, and Matt Gardner. 201Quoref: A reading comprehension dataset with questions requiring coreferential reasoning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5925–5932.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).
    Google ScholarLocate open access versionFindings
  • Dmitriy Dligach, Timothy Miller, Chen Lin, Steven Bethard, and Guergana Savova. 2017. Neural temporal relation extraction. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL), volume 2, pages 746–751.
    Google ScholarLocate open access versionFindings
  • Quang Do, Wei Lu, and Dan Roth. 20Joint inference for event timeline construction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
    Google ScholarLocate open access versionFindings
  • Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).
    Google ScholarLocate open access versionFindings
  • Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Berant, Ben Bogin, Sihao Chen, et al. 2020. Evaluating nlp models via contrast sets. arXiv preprint arXiv:2004.02709.
    Findings
  • Matt Gardner, Jonathan Berant, Hannaneh Hajishirzi, Alon Talmor, and Sewon Min. 2019a. On making reading comprehension more comprehensive.
    Google ScholarFindings
  • Matt Gardner, Jonathan Berant, Hannaneh Hajishirzi, Alon Talmor, and Sewon Min. 2019b. Question answering is a format; when is it useful? arXiv preprint arXiv:1909.11291.
    Findings
  • Mor Geva, Yoav Goldberg, and Jonathan Berant. 2019. Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
    Google ScholarLocate open access versionFindings
  • Luheng He, Mike Lewis, and Luke Zettlemoyer. 2015. Question-answer driven semantic role labeling: Using natural language to annotate natural language. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 643–653.
    Google ScholarLocate open access versionFindings
  • Egoitz Laparra, Dongfang Xu, Ahmed Elsayed, Steven Bethard, and Martha Palmer. 2018. SemEval 2018 task 6: Parsing time normalizations. In Proceedings of The 12th International Workshop on Semantic Evaluation, pages 88–96.
    Google ScholarLocate open access versionFindings
  • Artuur Leeuwenberg and Marie-Francine Moens. 2018. Temporal information extraction by predicting relative time-lines. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
    Google ScholarLocate open access versionFindings
  • Tuur Leeuwenberg and Marie-Francine Moens. 2017. Structured learning for temporal relation extraction from clinical records. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. 2017. Zero-shot relation extraction via reading comprehension. In Proceedings of the SIGNLL Conference on Computational Natural Language Learning (CoNLL), pages 333–342.
    Google ScholarLocate open access versionFindings
  • Chen Lin, Timothy Miller, Dmitriy Dligach, Steven Bethard, and Guergana Savova. 2017. Representations of time expressions for temporal relation extraction with convolutional neural networks. BioNLP 2017, pages 322–327.
    Google ScholarLocate open access versionFindings
  • Kevin Lin, Oyvind Tafjord, Peter Clark, and Matt Gardner. 2019. Reasoning over paragraph effects in situations. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pages 58–62.
    Google ScholarLocate open access versionFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
    Findings
  • Hector Llorens, Nathanael Chambers, Naushad UzZaman, Nasrin Mostafazadeh, James Allen, and James Pustejovsky. 2015. SemEval-2015 Task 5: QA TEMPEVAL - evaluating temporal information understanding with question answering. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 792–800.
    Google ScholarLocate open access versionFindings
  • Yuanliang Meng and Anna Rumshisky. 2018. Contextaware neural model for temporal information extraction. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), volume 1, pages 527–536.
    Google ScholarLocate open access versionFindings
  • Julian Michael, Gabriel Stanovsky, Luheng He, Ido Dagan, and Luke Zettlemoyer. 2017. Crowdsourcing question-answer meaning representations. arXiv preprint arXiv:1711.05885.
    Findings
  • Anne-Lyse Minard, Manuela Speranza, Eneko Agirre, Itziar Aldabe, Marieke van Erp, Bernardo Magnini, German Rigau, Ruben Urizar, and Fondazione Bruno Kessler. 2015. SemEval-2015 Task 4: TimeLine: Cross-document event ordering. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 778–786.
    Google ScholarLocate open access versionFindings
  • T. Mitamura, Y. Yamakawa, S. Holm, Z. Song, A. Bies, S. Kulick, and S. Strassel. 2015. Event nugget annotation: Processes and issues. In Proceedings of the Workshop on Events at NAACL-HLT.
    Google ScholarLocate open access versionFindings
  • Nasrin Mostafazadeh, Alyson Grealish, Nathanael Chambers, James Allen, and Lucy Vanderwende. 2016. CaTeRS: Causal and temporal relation scheme for semantic annotation of event structures. In Proceedings of the 4th Workshop on Events: Definition, Detection, Coreference, and Representation, pages 51–61.
    Google ScholarLocate open access versionFindings
  • Qiang Ning, Zhili Feng, and Dan Roth. 2017. A structured learning approach to temporal relation extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1038–1048, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Qiang Ning, Zhili Feng, Hao Wu, and Dan Roth. 2018a. Joint reasoning for temporal and causal relations. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 2278–2288. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Qiang Ning, Sanjay Subramanian, and Dan Roth. 2019. An Improved Neural Baseline for Temporal Relation Extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
    Google ScholarLocate open access versionFindings
  • Qiang Ning, Hao Wu, and Dan Roth. 2018b. A multiaxis annotation scheme for event temporal relations. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 1318–1328. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tim O’Gorman, Kristin Wright-Bettner, and Martha Palmer. 2016. Richer Event Description: Integrating event coreference with temporal, causal and bridging annotation. In Proceedings of the 2nd Workshop on Computing News Storylines (CNS 2016), pages 47–56, Austin, Texas. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • James Pustejovsky, Patrick Hanks, Roser Sauri, Andrew See, Robert Gaizauskas, Andrea Setzer, Dragomir Radev, Beth Sundheim, David Day, Lisa Ferro, et al. 2003. The TIMEBANK corpus. In Corpus Linguistics, page 40.
    Google ScholarLocate open access versionFindings
  • Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 784–789.
    Google ScholarLocate open access versionFindings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2383–2392.
    Google ScholarLocate open access versionFindings
  • Jessica Semega, Melissa Kollar, John Creamer, and Abinash Mohanty. 2019. Income and Poverty in the United States: 2018. U.S. Department of Commerce.
    Google ScholarFindings
  • Savova, et al. 2014. Temporal annotation in the clinical domain. Transactions of the Association for Computational Linguistics (TACL), 2:143.
    Google ScholarLocate open access versionFindings
  • Julien Tourille, Olivier Ferret, Aurelie Neveol, and Xavier Tannier. 2017. Neural architecture for temporal relation extraction: A Bi-LSTM approach for detecting narrative containers. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), volume 2, pages 224–230.
    Google ScholarLocate open access versionFindings
  • Naushad UzZaman, Hector Llorens, James Allen, Leon Derczynski, Marc Verhagen, and James Pustejovsky. 2013. SemEval-2013 Task 1: TEMPEVAL-3: Evaluating time expressions, events, and temporal relations. Proceedings of the Joint Conference on Lexical and Computational Semantics (*SEM), 2:1–9.
    Google ScholarLocate open access versionFindings
  • Alakananda Vempala, Eduardo Blanco, and Alexis Palmer. 2018. Determining event durations: Models and error analysis. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), volume 2, pages 164–168.
    Google ScholarLocate open access versionFindings
  • Marc Verhagen, Robert Gaizauskas, Frank Schilder, Mark Hepple, Graham Katz, and James Pustejovsky. 2007. SemEval-2007 Task 15: TempEval temporal relation identification. In Proceedings of the 4th International Workshop on Semantic Evaluations, pages 75–80. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Marc Verhagen, Roser Sauri, Tommaso Caselli, and James Pustejovsky. 2010. SemEval-2010 Task 13: TempEval-2. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 57– 62. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ben Zhou, Daniel Khashabi, Qiang Ning, and Dan Roth. 2019. “Going on a vacation” takes longer than “Going for a walk”: A Study of Temporal Commonsense Understanding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
    Google ScholarLocate open access versionFindings
  • Ben Zhou, Qiang Ning, Daniel Khashabi, and Dan Roth. 2020. Temporal Common Sense Acquisition with Minimal Supervision. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments