ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning

Emily Allaway
Emily Allaway
Nicholas Lourie
Nicholas Lourie
Hannah Rashkin
Hannah Rashkin
Brendan Roof
Brendan Roof

AAAI, Volume abs/1811.00146, 2019.

Cited by: 79|Bibtex|Views101
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
We present ATOMIC, an atlas of everyday commonsense inferential knowledge about events described in natural language and associated with typed if- relations

Abstract:

We present ATOMIC, an atlas of everyday commonsense reasoning, organized through 877k textual descriptions of inferential knowledge. Compared to existing resources that center around taxonomic knowledge, ATOMIC focuses on inferential knowledge organized as typed if-then relations with variables (e.g., "if X pays Y a compliment, then Y w...More

Code:

Data:

0
Introduction
  • Given a snapshot observation of an event, people can anticipate and reason about unobserved causes and effects in relation to the observed event: what might have happened just before, what might happen as a result, and how different events are chained through causes and effects.
  • If the authors observe an event “X repels Y’s attack” (Figure 1), the authors can immediately infer various plausible facts surrounding that event.
  • In terms of the plausible motivations behind the event, X probably wants to protect herself.
  • As for the plausible pre-conditions prior to the event, X may have been trained in self-defense to successfully fend off Y’s attack.
  • As a result of the event, X probably feels angry and might want to file a police report.
  • Y, on the other hand, might feel scared of getting caught and want to run away
Highlights
  • Given a snapshot observation of an event, people can anticipate and reason about unobserved causes and effects in relation to the observed event: what might have happened just before, what might happen as a result, and how different events are chained through causes and effects
  • We investigate neural network models that can acquire simple commonsense capabilities and reason about previously unseen events by embedding the rich inferential knowledge described in ATOMIC
  • We evaluate models on their ability to reason about previously unseen events
  • We present ATOMIC, an atlas of everyday commonsense inferential knowledge about events described in natural language and associated with typed if- relations
  • Human evaluation indicates that 86.2% of the descriptions are valid, showcasing the quality of commonsense knowledge contained in ATOMIC
  • We present neural network models that can learn to reason about previously unseen events to generate their likely causes and effects in natural language
Methods
  • The authors' goal is to investigate whether models can learn to perform If- commonsense inference given a previously unseen event.
  • The event sequence is compressed into a hidden representation h through an encoding function fenc : Ri×henc → Rh. In this work, the authors use 300-dimensional static GloVe pretrained embeddings (Pennington, Socher, and Manning 2014) as the base word vectors.
  • The authors use 300-dimensional static GloVe pretrained embeddings (Pennington, Socher, and Manning 2014) as the base word vectors
  • The authors augment these embeddings with 1024-dimensional ELMo pre-trained embeddings (Peters et al 2018).
  • The encoding function is a bidirectional GRU (Cho et al 2014) of hidden size henc
Results
  • Models generate natural language expressions for each of the nine dimension of if- inferences.
  • The authors report performance using automatic scores and a human evaluation of the generated inferences.
  • The authors automatically evaluate the sequence generation for each model and each inference dimension using BLEU scores.
  • Given an event “PersonX bakes bread”, the model can correctly infer that X probably needs to “go to the store” or.
  • The authors' model correctly predicts that the likely effect of this event would be that X will “get dirty” or “eat food”
Conclusion
  • The authors present ATOMIC, an atlas of everyday commonsense inferential knowledge about events described in natural language and associated with typed if- relations.
  • ATOMIC consists of over 300k events associated with 877k inferential relations, making it the largest knowledge graph of its kind.
  • The authors' crowdsourcing framework gathers annotations in the form of free-form textual responses to simple questions which enables large-scale high quality collection of commonsense about events.
  • The authors present neural network models that can learn to reason about previously unseen events to generate their likely causes and effects in natural language
Summary
  • Introduction:

    Given a snapshot observation of an event, people can anticipate and reason about unobserved causes and effects in relation to the observed event: what might have happened just before, what might happen as a result, and how different events are chained through causes and effects.
  • If the authors observe an event “X repels Y’s attack” (Figure 1), the authors can immediately infer various plausible facts surrounding that event.
  • In terms of the plausible motivations behind the event, X probably wants to protect herself.
  • As for the plausible pre-conditions prior to the event, X may have been trained in self-defense to successfully fend off Y’s attack.
  • As a result of the event, X probably feels angry and might want to file a police report.
  • Y, on the other hand, might feel scared of getting caught and want to run away
  • Objectives:

    The goal of the study is to create a knowledge repository that meets three requirements: scale, coverage, and quality.
  • The authors' goal is to investigate whether models can learn to perform If- commonsense inference given a previously unseen event
  • Methods:

    The authors' goal is to investigate whether models can learn to perform If- commonsense inference given a previously unseen event.
  • The event sequence is compressed into a hidden representation h through an encoding function fenc : Ri×henc → Rh. In this work, the authors use 300-dimensional static GloVe pretrained embeddings (Pennington, Socher, and Manning 2014) as the base word vectors.
  • The authors use 300-dimensional static GloVe pretrained embeddings (Pennington, Socher, and Manning 2014) as the base word vectors
  • The authors augment these embeddings with 1024-dimensional ELMo pre-trained embeddings (Peters et al 2018).
  • The encoding function is a bidirectional GRU (Cho et al 2014) of hidden size henc
  • Results:

    Models generate natural language expressions for each of the nine dimension of if- inferences.
  • The authors report performance using automatic scores and a human evaluation of the generated inferences.
  • The authors automatically evaluate the sequence generation for each model and each inference dimension using BLEU scores.
  • Given an event “PersonX bakes bread”, the model can correctly infer that X probably needs to “go to the store” or.
  • The authors' model correctly predicts that the likely effect of this event would be that X will “get dirty” or “eat food”
  • Conclusion:

    The authors present ATOMIC, an atlas of everyday commonsense inferential knowledge about events described in natural language and associated with typed if- relations.
  • ATOMIC consists of over 300k events associated with 877k inferential relations, making it the largest knowledge graph of its kind.
  • The authors' crowdsourcing framework gathers annotations in the form of free-form textual responses to simple questions which enables large-scale high quality collection of commonsense about events.
  • The authors present neural network models that can learn to reason about previously unseen events to generate their likely causes and effects in natural language
Tables
  • Table1: Examples of If-Event-Then-X commonsense knowledge present in ATOMIC. For inference dimensions, “x” and “o” pertain to PersonX and others, respectively (e.g., “xAttr”: attribute of PersonX, “oEffect”: effect on others)
  • Table2: Statistics of ATOMIC. Triples represent distinct <event, relation, event>. #words represents the average number of words per node
  • Table3: Average BLEU score (reported as percentages) for the top 10 generations for each inference dimension: comparison of multitask models to single-task model. Note that BLEU scores are known to be brittle to generations worded differently from the references (<a class="ref-link" id="cLiu_et+al_2016_a" href="#rLiu_et+al_2016_a">Liu et al 2016</a>). We embolden the best performing model for each dimension
  • Table4: Precision at 10 (%) of generated inferences as selected by human judges for four models, averaged and broken down by dimension. We embolden the best performing model for each dimension. EVENT2(IN)VOLUNTARY outperforms all other models significantly (p < 0.05). For comparison, we show precision of gold ATOMIC annotations. Note that there is a varying number of gold annotations per event/dimension, while all models were constrained to make 10 predictions
Download tables as Excel
Related work
  • Descriptive Knowledge from Crowdsourcing Knowledge acquisition and representation have been extensively studied in prior research (Espinosa and Lieberman 2005; Speer and Havasi 2012; Lenat 1995). However, most prior efforts focused on taxonomic or encyclopedic knowledge (Davis and Marcus 2015), which, in terms of epistemology, corresponds to knowledge of “what”. Relatively less progress has been made on knowledge of “how” and “why”. For example, OpenCyc 4.0 is a large commonsense knowledge base consisting of 239,000 concepts and 2,039,000 facts in LISP-style logic (Lenat 1995), known to be mostly taxonomic (Davis and Marcus 2015). In fact, only 0.42% of ATOMIC events appear in OpenCyc, which we found contains 99.8% relations that are either taxonomic (isA), string formatting relations, or various definitional relations. A typical example is shown below:
Funding
  • This work was supported in part by NSF GRFP DGE-1256082, NSF IIS-1714566, IIS-1524371, IIS-1703166, Samsung AI Grant, DARPA CwC program through ARO (W911NF-15-1-0543), and the IARPA DIVA program through D17PC00343
Study subjects and analysis
workers: 3
Disambiguating the participants is important, since it can drastically change the meaning of the event (e.g., “PersonX breaks PersonX’s arm” vs. “PersonX breaks PersonY’s arm” have very different implications). Three workers selected whether each “Person” mention in an event refers to PersonX, PersonY, or PersonZ, and we keep base events with combinations that at least two workers selected as valid (ppa=77%). To ensure scalability, we implement a free-form text annotation setup which asks workers to write answers to questions about a specific event

workers: 3
We create four tasks on Amazon Mechanical Turk (MTurk) (sample task in Figure 3) for gathering commonsense annotations.3, 4. For each dimension, up to three workers are asked to provide as many as four likely annotations for an event, covering multiple possible situations (e.g., if “PersonX drinks coffee”, then “PersonX needed to brew coffee” or “PersonX needed to buy coffee”; both are distinct but likely). Note that some events are not caused by PersonX, and some do not affect other people, making annotations for certain dimensions not necessary (specifically, for xIntent, xNeed, oReact, oEffect, and oWant) for all events

data: 2
We automatically evaluate the sequence generation for each model and each inference dimension using BLEU scores. Specifically, we compute the average BLEU score (n = 2, Smoothing1; Chen and Cherry, 2014) between each sequence in the top 10 predictions and the corresponding set of MTurk annotations. As an event may not involve all nine inference dimensions (e.g., “PersonX sees PersonX’s house” has no implications for anybody other than “PersonX”), annotators may decide to leave an inference dimension empty

crowdworkers with the 10 generated inferences: 5
We randomly select 100 events from the test set and use beam search to generate the 10 most likely inferences per dimension. We present five crowdworkers with the 10 generated inferences, and ask them to select all inferences they think are valid. Table 4 shows each model’s precision at 10, computed as the average number of correct generations per dimension

Reference
  • Chambers, N., and Jurafsky, D. 2008. Unsupervised learning of narrative event chains. In ACL.
    Google ScholarFindings
  • Chen, B., and Cherry, C. 2014. A systematic comparison of smoothing techniques for sentence-level bleu. In Proceedings of the Ninth Workshop on Statistical Machine Translation, 362–367.
    Google ScholarLocate open access versionFindings
  • Cho, K.; van Merrienboer, B.; Bahdanau, D.; and Bengio, Y. 2014. On the properties of neural machine translation: Encoder-decoder approaches. In SSST@EMNLP.
    Google ScholarFindings
  • Chu, C. X.; Tandon, N.; and Weikum, G. 2017. Distilling task knowledge from how-to communities. In WWW.
    Google ScholarFindings
  • Davis, E., and Marcus, G. 201Commonsense reasoning and commonsense knowledge in artificial intelligence. Commun. ACM 58:92–103.
    Google ScholarLocate open access versionFindings
  • de Marneffe, M.-C.; Manning, C. D.; and Potts, C. 2012. Did it happen? the pragmatic complexity of veridicality assessment. Comput. Linguist. 38(2):301–333.
    Google ScholarLocate open access versionFindings
  • Espinosa, J. H., and Lieberman, H. 2005. Eventnet: Inferring temporal relations between commonsense events. In MICAI.
    Google ScholarFindings
  • Galarraga, L.; Teflioudi, C.; Hose, K.; and Suchanek, F. M. 2013. Amie: association rule mining under incomplete evidence in ontological knowledge bases. In WWW.
    Google ScholarFindings
  • Gardner, M.; Grus, J.; Neumann, M.; Tafjord, O.; Dasigi, P.; Liu, N. F.; Peters, M.; Schmitz, M.; and Zettlemoyer, L. S. 2017. Allennlp: A deep semantic natural language processing platform.
    Google ScholarFindings
  • Goldberg, Y., and Orwant, J. 2013. A dataset of syntacticngrams over time from a very large corpus of english books. In SEM2013.
    Google ScholarFindings
  • Gordon, A. S., and Hobbs, J. R. 2017. A Formal Theory of Commonsense Psychology: How People Think People Think. Cambridge University Press.
    Google ScholarFindings
  • Gordon, A. S., and Swanson, R. 2008. StoryUpgrade: finding stories in internet weblogs. In ICWSM.
    Google ScholarFindings
  • Gordon, J., and Van Durme, B. 20Reporting bias and knowledge acquisition. In Proceedings of the 2013 Workshop on Automated Knowledge Base Construction, AKBC ’13, 25–30. New York, NY, USA: ACM.
    Google ScholarLocate open access versionFindings
  • Lake, B. M.; Ullman, T. D.; Tenenbaum, J. B.; and Gershman, S. J. 2017. Building machines that learn and think like people. The Behavioral and brain sciences 40:e253.
    Google ScholarLocate open access versionFindings
  • Lenat, D. B. 1995. Cyc: A large-scale investment in knowledge infrastructure. Communications of the ACM 38(11):33–38.
    Google ScholarLocate open access versionFindings
  • Liu, C.-W.; Lowe, R.; Serban, I. V.; Noseworthy, M.; Charlin, L.; and Pineau, J. 20How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In EMNLP.
    Google ScholarFindings
  • Marcus, G. 2018. Deep learning: A critical appraisal. CoRR abs/1801.00631.
    Findings
  • Moore, C. 2013. The development of commonsense psychology. Psychology Press.
    Google ScholarFindings
  • Mostafazadeh, N.; Chambers, N.; He, X.; Parikh, D.; Batra, D.; Vanderwende, L.; Kohli, P.; and Allen, J. 2016. A corpus and cloze evaluation for deeper understanding of commonsense stories. In NAACL. Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global vectors for word representation. In EMNLP. Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized word representations. In Proc. of NAACL. Rashkin, H.; Sap, M.; Allaway, E.; Smith, N. A.; and Choi, Y. 2018. Event2mind: Commonsense inference on events, intents, and reactions. In ACL. Roemmele, M.; Bejan, C. A.; and Gordon, A. S. 2011. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.
    Google ScholarLocate open access versionFindings
  • Schank, R., and Abelson, R. 1977.
    Google ScholarFindings
  • Schubert, L. 2002. Can we derive general world knowledge from texts? In Proceedings of the Second International Conference on Human Language Technology Research, HLT ’02, 94–97. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. Speer, R., and Havasi, C. 2012. Representing general relational knowledge in conceptnet 5. In LREC. Speer, R.; Chin, J.; and Havasi, C. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. In AAAI, 4444–4451.
    Google ScholarLocate open access versionFindings
  • Tandon, N.; de Melo, G.; and Weikum, G. 2017. Webchild 2.0: Fine-grained commonsense knowledge distillation. In ACL. Yang, B.; Yih, S. W.-t.; He, X.; Gao, J.; and Deng, L. 2015. Embedding entities and relations for learning and inference in knowledge bases. In ICLR.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments