A Dataset for Tracking Entities in Open Domain Procedural Text

EMNLP 2020, pp. 6408-6417, 2020.

Other Links: arxiv.org|academic.microsoft.com
Weibo:
We presented the first dataset to track entities in open domain procedural text

Abstract:

We present the first dataset for tracking state changes in procedural text from arbitrary domains by using an unrestricted (open) vocabulary. For example, in a text describing fog removal using potatoes, a car window may transition between being foggy, sticky, opaque, and clear. Previous formulations of this task provide the text and enti...More

Code:

Data:

0
Introduction
  • Only about 12% of what the authors understand from text is expressed explicitly (Graesser, 1981).
  • When a potato is rubbed on a car window, the unstated effects of this action are the following state changes: windows becomes sticky, opaque, and the potato becomes dirty, etc.
  • These changes can be tracked across the paragraph.
  • An exemplary use case of text with actions is procedural text where modeling such state changes helps in various reasoning-based end tasks, e.g. automatic execution of biology experiments (Mysore et al, 2019), cooking recipes (Bollini et al, 2012) and everyday activities (Yang and Nyberg, 2015)
Highlights
  • By one estimate, only about 12% of what we understand from text is expressed explicitly (Graesser, 1981)
  • As mentioned in Section 4.2, OPENPI consists of two kinds of annotations: withimages and without-image
  • GPT-2 gets to see only text as input but the state changes it has to predict are different depending on the setting
  • The GPT-2 model struggles to predict the right set of state changes indicating that the task is hard
  • Wrong relp(yipre) (17%): We find that relational phrases are very hard for the model currently. 184 out of 210 relational state changes predicted by the model have incorrect relational phrase
  • We presented the first dataset to track entities in open domain procedural text
Methods
  • 6.1 Metrics

    To measure the performance on OPENPI the authors compare the predicted set y and gold set y*, for every point x.
  • Precision for a data point x is computed based on the best matching gold state change for each predicted state change i.e., P (x) =.
  • 1 2 y∈y maxy∗ O(y∗pre, ypre)+O(y∗post, ypost).
  • The authors did not perform facet-based evaluation of the templated output for two reasons
  • While it might seem when computing overlaps of gold and predicted state changes as two long strings, BLEU or ROUGE may accidentally see an overlap when there was none.
  • It is unclear how to compute F1 over individual facets that requires the best match based on all facets as tuple
Results
  • The authors evaluate state of the art generation model GPT2 on OPENPI dataset.
  • As mentioned in Section 4.2, OPENPI consists of two kinds of annotations: withimages and without-image.
  • GPT-2 gets to see only text as input but the state changes it has to predict are different depending on the setting.
  • Table 4 reports P, R and F1 when GPT-2 model is tested on different subsets.
  • The GPT-2 model struggles to predict the right set of state changes indicating that the task is hard.
Conclusion
  • The authors presented the first dataset to track entities in open domain procedural text.
  • To this end, the authors crowdsourced a large, high-quality dataset with examples for this task.
  • The authors established a strong generation baseline highlighting the difficulty of this task.
  • The authors will explore more sophisticated models that can address the highlighted shortcomings of the current model.
  • An exciting direction is to leverage visuals of each step to deal with unmentioned entities and indirect effects
Summary
  • Introduction:

    Only about 12% of what the authors understand from text is expressed explicitly (Graesser, 1981).
  • When a potato is rubbed on a car window, the unstated effects of this action are the following state changes: windows becomes sticky, opaque, and the potato becomes dirty, etc.
  • These changes can be tracked across the paragraph.
  • An exemplary use case of text with actions is procedural text where modeling such state changes helps in various reasoning-based end tasks, e.g. automatic execution of biology experiments (Mysore et al, 2019), cooking recipes (Bollini et al, 2012) and everyday activities (Yang and Nyberg, 2015)
  • Methods:

    6.1 Metrics

    To measure the performance on OPENPI the authors compare the predicted set y and gold set y*, for every point x.
  • Precision for a data point x is computed based on the best matching gold state change for each predicted state change i.e., P (x) =.
  • 1 2 y∈y maxy∗ O(y∗pre, ypre)+O(y∗post, ypost).
  • The authors did not perform facet-based evaluation of the templated output for two reasons
  • While it might seem when computing overlaps of gold and predicted state changes as two long strings, BLEU or ROUGE may accidentally see an overlap when there was none.
  • It is unclear how to compute F1 over individual facets that requires the best match based on all facets as tuple
  • Results:

    The authors evaluate state of the art generation model GPT2 on OPENPI dataset.
  • As mentioned in Section 4.2, OPENPI consists of two kinds of annotations: withimages and without-image.
  • GPT-2 gets to see only text as input but the state changes it has to predict are different depending on the setting.
  • Table 4 reports P, R and F1 when GPT-2 model is tested on different subsets.
  • The GPT-2 model struggles to predict the right set of state changes indicating that the task is hard.
  • Conclusion:

    The authors presented the first dataset to track entities in open domain procedural text.
  • To this end, the authors crowdsourced a large, high-quality dataset with examples for this task.
  • The authors established a strong generation baseline highlighting the difficulty of this task.
  • The authors will explore more sophisticated models that can address the highlighted shortcomings of the current model.
  • An exciting direction is to leverage visuals of each step to deal with unmentioned entities and indirect effects
Tables
  • Table1: Examples of the task based on our dataset. The input x comprises a query xq and a context xc (past sentences before this step in the paragraph– not shown due to limited space). The output is a set y of pre and postconditions. The paragraphs in this table are: above (how to clean oven) and below (cooking recipe)
  • Table2: Comparison of our dataset to existing datasets edge, (ii) zero shot learning: during inference on a previously unseen domain, there are previously unseen attributes, entities, and state change types. This makes the problem very challenging and places this task in a novel setting (see §3.1)
  • Table3: Basic statistics of the OPENPI dataset: the articles’ WikiHow category, the number of WikiHow articles (i.e., paragraphs) in each category and number of state changes |y| (total), and data collected using with, and without image setting)
  • Table4: GPT-2 on OpenPI, and its sub-categories
  • Table5: GPT-2 on topics seen, unseen during training
  • Table6: Error types in 1,811 dev predictions. One state change prediction can have multiple error types
Download tables as Excel
Related work
  • Tracking state changes: Procedural text understanding addresses the task of tracking entity states throughout the text (Bosselut et al, 2018; Henaff et al, 2017). This ability is an important part of text understanding. While syntactic parsing methods such as AMR (abstract meaning representation) (Banarescu et al, 2013) represent “who did what to whom” by uncovering stated facts, tracking entity states uncovers unstated facts such as how ingredients change during a recipe.

    Datasets with closed state changes: The bAbI dataset (Weston et al, 2015) includes questions about objects moved throughout a paragraph, using machine-generated language over a deterministic domain with a small lexicon. The SCoNE dataset (Long et al, 2016) contains paragraphs describing a changing world state in three synthetic, deterministic domains. However, approaches developed using synthetic data often fail to handle the inherent complexity in language when applied to organic, real-world data (Hermann et al, 2015; Winograd, 1972). The ProPara dataset (Dalvi et al, 2018) contains three state changes (create, destroy, move) for natural text describing scientific procedures. Other domain specific datasets include recipe domain (Bosselut et al, 2018), and biology experiments (Mysore et al, 2019). These datasets contain a small, closed set of state change types that are relevant to a specific domain. Our dataset is general domain, and to accommodate this generality we have an open vocabulary of state changes.
Funding
  • All annotators met the following prerequisites as a minimum qualification: (1) 5K previous HITs approvals, (2) 99% or higher approval rate, (3) location is US, UK, CA, AU, or NZ
  • Without any context (e.g., for the first step), the model gets a low accuracy of 8.3%
Study subjects and analysis
crowd workers: 3
See Figure 2 for an example of the annotation procedure. After collecting the data, we cleaned up the state changes by asking three crowd workers if each state change is valid or not with the same annotation setting as data collection (e.g., with or without visual illustration). We discarded state changes that did not get the agreement by the majority (2 or more workers)

WikiHow articles: 810
Visuals helped the annotators to provide more state changes (e.g., 4.2 Dataset statistics. The resulting OPENPI dataset comprises 29,928 state changes over 4,050 sentences from 810 WikiHow articles. Of these, 15,445 (4.3 per step) state changes were obtained from the with images setting and 14,483 (3.8 per step) from without images, indicating that the additional visual modality helped workers to come up with more state changes (e.g., the color of cut potato turns gray)

Reference
  • Laura Banarescu, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. 2013. Abstract meaning representation for sembanking. In LAW@ACL.
    Google ScholarFindings
  • Mario Bollini, Stefanie Tellex, Tyler Thompson, Nicholas Roy, and Daniela Rus. 201Interpreting and executing recipes with a cooking robot. In ISER.
    Google ScholarFindings
  • Antoine Bosselut, Omer Levy, Ari Holtzman, Corin Ennis, Dieter Fox, and Yejin Choi. 2018. Simulating action dynamics with neural process networks. ICLR.
    Google ScholarLocate open access versionFindings
  • Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Celikyilmaz, and Yejin Choi. 2019. Comet: Commonsense transformers for automatic knowledge graph construction. In ACL.
    Google ScholarFindings
  • Bhavana Dalvi, Lifu Huang, Niket Tandon, Wen-tau Yih, and Peter Clark. 2018. Tracking state changes in procedural text: A challenge dataset and models for process comprehension. NAACL.
    Google ScholarLocate open access versionFindings
  • Jun Gao, Wei Bi, Xiaojiang Liu, Junhui Li, and Shuming Shi. 2019. Generating multiple diverse responses for short-text conversation. In AAAI.
    Google ScholarFindings
  • Arthur C. Graesser. 1981. Prose comprehension beyond the word. In Springer.
    Google ScholarFindings
  • Mikael Henaff, Jason Weston, Arthur Szlam, Antoine Bordes, and Yann LeCun. 2017. Tracking the world state with recurrent entity networks. In ICLR.
    Google ScholarFindings
  • Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pages 1693–1701.
    Google ScholarLocate open access versionFindings
  • Phillip Isola, Joseph J. Lim, and Edward H. Adelson. 2015. Discovering states and transformations in image collections. CVPR.
    Google ScholarLocate open access versionFindings
  • Mainak Jas and Devi Parikh. 2015. Image specificity. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2727–2736.
    Google ScholarLocate open access versionFindings
  • Reginald Long, Panupong Pasupat, and Percy Liang. 2016. Simpler context-dependent logical forms via model projections. In ACL.
    Google ScholarFindings
  • Sheshera Mysore, Zach Jensen, Edward Kim, Kevin Huang, Haw-Shiuan Chang, Emma Strubell, Jeffrey Flanigan, Andrew McCallum, and Elsa Olivetti. 2019. The materials science procedural text corpus: Annotating materials synthesis procedures with shallow semantic structures. arXiv preprint arXiv:1905.06939.
    Findings
  • Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. 2018. Virtualhome: Simulating household activities via programs. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8494– 8502.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9.
    Google ScholarLocate open access versionFindings
  • Hannah Rashkin, Antoine Bosselut, Maarten Sap, Kevin Knight, and Yejin Choi. 2018. Modeling naive psychology of characters in simple commonsense stories. In ACL.
    Google ScholarFindings
  • Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A. Smith, and Yejin Choi. 2019. Atomic: An atlas of machine commonsense for ifthen reasoning. In AAAI.
    Google ScholarFindings
  • Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. 2019. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. ArXiv, abs/1912.01734.
    Findings
  • Robyn Speer and Catherine Havasi. 2013. Conceptnet 5: A large semantic network for relational knowledge. In The People’s Web Meets NLP.
    Google ScholarFindings
  • Niket Tandon, Bhavana Dalvi Mishra, Joel Grus, Wentau Yih, Antoine Bosselut, and Peter Clark. 2018. Reasoning about actions and state changes by injecting commonsense knowledge. EMNLP.
    Google ScholarLocate open access versionFindings
  • Ashwin K Vijayakumar, Michael Cogswell, Ramprasath R Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2018. Diverse beam search: Decoding diverse solutions from neural sequence models. AAAI.
    Google ScholarFindings
  • Jason Weston, Antoine Bordes, Sumit Chopra, and Tomas Mikolov. 2016. Towards ai-complete question answering: A set of prerequisite toy tasks. ICLR, abs/1502.05698.
    Google ScholarLocate open access versionFindings
  • Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart van Merrienboer, Armand Joulin, and Tomas Mikolov. 2015. Towards AI-complete question answering: A set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698.
    Findings
  • Terry Winograd. 1972. Understanding natural language. Cognitive Psychology, 3(1):1–191.
    Google ScholarLocate open access versionFindings
  • Zi Yang and Eric Nyberg. 2015. Leveraging procedural knowledge for task-oriented search. In SIGIR.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments