Abductive Commonsense Reasoning

Chaitanya Malaviya
Chaitanya Malaviya
Ari Holtzman
Ari Holtzman
Hannah Rashkin
Hannah Rashkin

ICLR, 2020.

Cited by: 19|Bibtex|Views110
EI
Other Links: dblp.uni-trier.de|arxiv.org
Weibo:
We hope that ART will serve as a challenging benchmark for future research in languagebased abductive reasoning and the αNLI and αNLG tasks will encourage representation learning that enables complex reasoning capabilities in AI systems

Abstract:

Abductive reasoning is inference to the most plausible explanation. For example, if Jenny finds her house in a mess when she returns from work, and remembers that she left a window open, she can hypothesize that a thief broke into her house and caused the mess, as the most plausible explanation. While abduction has long been con...More

Code:

Data:

0
Introduction
  • Abductive reasoning is inference to the most plausible explanation for incomplete observations (Peirce, 1965a).
  • Given the incomplete observations about the world that O1: “Jenny cleaned her house and went to work, leaving the window just a crack open.” and sometime later O2: “When Jenny returned home, she saw her house was a mess.”, we can hypothesize different potential explanations and reason about which is the most likely.
  • One crucial observation Peirce makes about abductive reasoning is that abduction is “the only logical operation which introduces any new ideas”, which contrasts with other types of inference such as entailment, that focuses on inferring only such information that is already provided in the premise
Highlights
  • The brain is an abduction machine, continuously trying to prove abductively that the observables in its environment constitute a coherent situation.

    – Jerry Hobbs, ACL 2013 Lifetime Achievement Award1

    Abductive reasoning is inference to the most plausible explanation for incomplete observations (Peirce, 1965a)
  • We present our evaluation of finetuned state-of-the-art pre-trained language models on the ART dataset, and several other baseline systems for both αNLI and αNLG
  • We present the first study that investigates the viability of language-based abductive reasoning
  • We conceptualize and introduce Abductive Natural Language Inference – a novel task focused on abductive reasoning in narrative contexts
  • We establish comprehensive baseline performance on this new task based on state-of-the-art NLI and language models, which leads to 68.9% accuracy with a considerable gap with human performance (91.4%)
  • We hope that ART will serve as a challenging benchmark for future research in languagebased abductive reasoning and the αNLI and αNLG tasks will encourage representation learning that enables complex reasoning capabilities in AI systems
Results
  • The authors present the evaluation of finetuned state-of-the-art pre-trained language models on the ART dataset, and several other baseline systems for both αNLI and αNLG.
  • Despite strong performance on several other NLP benchmark datasets, the best baseline model based on BERT achieves an accuracy of just 68.9% on ART compared to human performance of 91.4%.
  • The large gap between human performance and that of the best system provides significant scope for development of more sophisticated abductive reasoning models.
Conclusion
  • The authors introduce Abductive Natural Language Generation – a novel task that requires machines to generate plausible hypotheses for given observations.
  • To support these tasks, the authors create and introduce a new challenge dataset, ART, which consists of 20,000 commonsense narratives accompanied with over 200,000 explanatory hypotheses.
  • The authors hope that ART will serve as a challenging benchmark for future research in languagebased abductive reasoning and the αNLI and αNLG tasks will encourage representation learning that enables complex reasoning capabilities in AI systems
Summary
  • Introduction:

    Abductive reasoning is inference to the most plausible explanation for incomplete observations (Peirce, 1965a).
  • Given the incomplete observations about the world that O1: “Jenny cleaned her house and went to work, leaving the window just a crack open.” and sometime later O2: “When Jenny returned home, she saw her house was a mess.”, we can hypothesize different potential explanations and reason about which is the most likely.
  • One crucial observation Peirce makes about abductive reasoning is that abduction is “the only logical operation which introduces any new ideas”, which contrasts with other types of inference such as entailment, that focuses on inferring only such information that is already provided in the premise
  • Results:

    The authors present the evaluation of finetuned state-of-the-art pre-trained language models on the ART dataset, and several other baseline systems for both αNLI and αNLG.
  • Despite strong performance on several other NLP benchmark datasets, the best baseline model based on BERT achieves an accuracy of just 68.9% on ART compared to human performance of 91.4%.
  • The large gap between human performance and that of the best system provides significant scope for development of more sophisticated abductive reasoning models.
  • Conclusion:

    The authors introduce Abductive Natural Language Generation – a novel task that requires machines to generate plausible hypotheses for given observations.
  • To support these tasks, the authors create and introduce a new challenge dataset, ART, which consists of 20,000 commonsense narratives accompanied with over 200,000 explanatory hypotheses.
  • The authors hope that ART will serve as a challenging benchmark for future research in languagebased abductive reasoning and the αNLI and αNLG tasks will encourage representation learning that enables complex reasoning capabilities in AI systems
Tables
  • Table1: Performance of baselines and finetuned-LM
  • Table2: Performance of generative models on the test set of ART. All models except GPT2-Fixed are finetuned on ART
  • Table3: BERT’s performance and human evaluation on categories for 1,000 instances from the test set, based on commonsense reasoning domains (Numerical, Spatial, Emotional). The number in parenthesis indicates the size of the category
  • Table4: Fraction of dataset for which a particular transition in the story is broken for the negative hypothesis, for 1,000 random instances from the test set
  • Table5: Transfer Learning from ART
  • Table6: Some statistics summarizing the ART dataset. The train set includes all plausible and implausible hypotheses collected via crowdsourcing, while the dev and test sets include the hypotheses selected through the Adversarial Filtering algorithm
  • Table7: Input formats for GPT and BERT fine-tuning
  • Table8: Input format used to training and generated text from various GPT2 based models. cji refers to the COMeTembeddings obtained using a separate transformer model for relation i and observation j. Similarly, Tij is the textual phrase for relation i, observation j. Where appropriate, field specific start and end-tags are added to the sequence of inputs
Download tables as Excel
Related work
  • WSC Levesque et al (2011)

    Cloze-Style Task vs. Abductive Reasoning Since abduction is fundamentally concerned

    DPR Rahman & Ng (2012)

    Hellaswag Zellers et al (2019)

    with plausible chains of cause-and-effect, our work draws inspiration from previous works that deal with narratives such as script learning (Schank & Abelson, 1975) and the narrative cloze test (Chambers & Jurafsky, 2009; Jans et al., 2012; Pichotta & Mooney, 2014; Rudinger et al, 2015). Rather than learning prototypical scripts or narrative chains, we instead reason about the most plausible events conditioned on observations. We make use of the ROCStories dataset (Mostafazadeh et al, 2016), which was specifically designed for the narrative cloze task. But, instead of reasoning about plausible event sequences, our task requires reasoning about plausible explanations for narrative omissions.

    Entailment vs. Abductive Reasoning The formulation of αNLI is closely related to entailment NLI, but there are two critical distinctions that make abductive reasoning uniquely challenging. First, abduction requires reasoning about commonsense implications of observations (e.g., if we observe that the “grass is wet”, a likely hypothesis is that “it rained earlier”) which go beyond the linguistic notion of entailment (also noted by Josephson (2000)). Second, abduction requires non-monotonic reasoning about a set of commonsense implications collectively, to check the potential contradictions against multiple observations and to compare the level of plausibility of different hypotheses. This makes abductive reasoning distinctly challenging compared to other forms of reasoning such as induction and deduction (Shank, 1998). Perhaps more importantly, abduction is closely related to the kind of reasoning humans perform in everyday situations, where information is incomplete and definite inferences cannot be made.
Funding
  • This research was supported in part by NSF (IIS-1524371), the National Science Foundation Graduate Research Fellowship under Grant No DGE 1256082, DARPA CwC through ARO (W911NF15-1- 0543), DARPA MCS program through NIWC Pacific (N66001-19-2-4031), and the Allen Institute for AI
  • Computations on beaker.org were supported in part by credits from Google Cloud
Reference
  • Henning Andersen. Abductive and deductive change. Language, pp. 765–793, 1973. URL https://www.jstor.org/stable/pdf/412063.pdf.
    Locate open access versionFindings
  • Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72, 2005.
    Google ScholarLocate open access versionFindings
  • Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Celikyilmaz, and Yejin Choi. Comet: Commonsense transformers for automatic knowledge graph construction. arXiv preprint arXiv:1906.05317, 2019.
    Findings
  • Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2015. URL https://nlp.stanford.edu/pubs/snli_paper.pdf.
    Locate open access versionFindings
  • Oana-Maria Camburu, Tim Rocktaschel, Thomas Lukasiewicz, and Phil Blunsom. e-snli: Natural language inference with natural language explanations. In Advances in Neural Information Processing Systems, pp. 9560–9572, 2018. URL https://papers.nips.cc/paper/8163-e-snli-natural-language-inference-with-natural-language-explanations.pdf.
    Locate open access versionFindings
  • Nathanael Chambers and Dan Jurafsky. Unsupervised learning of narrative schemas and their participants. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 602–610, Suntec, Singapore, August 2009. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P/P09/P09-1068.
    Locate open access versionFindings
  • Eugene Charniak and Solomon Eyal Shimony. Probabilistic semantics for cost based abduction. Brown University, Department of Computer Science, 1990. URL https://www.aaai.org/ Papers/AAAI/1990/AAAI90-016.pdf.
    Locate open access versionFindings
  • Qian Chen, Xiao-Dan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. Enhanced lstm for natural language inference. In ACL, 2017. URL https://www.aclweb.org/anthology/ P17-1152.
    Locate open access versionFindings
  • Peter E. Clark, Philip Harrison, John A. Thompson, William R. Murray, Jerry R. Hobbs, and Christiane Fellbaum. On the role of lexical and world knowledge in rte3. In ACL-PASCAL@ACL, 2007. URL https://www.aclweb.org/anthology/W07-1409.
    Locate open access versionFindings
  • Alexis Conneau, Douwe Kiela, Holger Schwenk, Loıc Barrault, and Antoine Bordes. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 670– 680, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1070. URL https://www.aclweb.org/anthology/D17-1070.
    Locate open access versionFindings
  • Springer, 2006. URL http://u.cs.biu.ac.il/̃dagan/publications/RTEChallenge.pdf.
    Findings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. URL https://arxiv.org/abs/1810.04805.
    Findings
  • Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, and Noah A. Smith. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 107–112, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-2017. URL https://www.aclweb.org/anthology/N18-2017.
    Locate open access versionFindings
  • Jerry R. Hobbs, Mark Stickel, Paul Martin, and Douglas Edwards. Interpretation as abduction. In Proceedings of the 26th Annual Meeting of the Association for Computational Linguistics, pp. 95–103, Buffalo, New York, USA, June 1988. Association for Computational Linguistics. doi: 10.3115/982023.982035. URL https://www.aclweb.org/anthology/P88-1012.
    Locate open access versionFindings
  • Bram Jans, Steven Bethard, Ivan Vulic, and Marie-Francine Moens. Skip n-grams and ranking functions for predicting script events. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 336–344, Avignon, France, April 2012. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/ E12-1034.
    Locate open access versionFindings
  • Susan G. Josephson. Abductive inference: Computation, philosophy, technology. 2000. URL https://philpapers.org/rec/JOSAIC.
    Findings
  • George Lakoff. Linguistics and natural logic. Synthese, 22(1-2):151–271, 1970. URL https://link.springer.com/article/10.1007/BF00413602.
    Locate open access versionFindings
  • Hector J. Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge. In KR, 2011.
    Google ScholarLocate open access versionFindings
  • Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out, 2004.
    Google ScholarFindings
  • Peter LoBue and Alexander Yates. Types of common-sense knowledge needed for recognizing textual entailment. In ACL, 2011. URL https://www.aclweb.org/anthology/ P11-2057.
    Locate open access versionFindings
  • Bill MacCartney and Christopher D. Manning. Natural logic for textual inference. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, pp. 193–200, Prague, June 2007. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/W07-1431.
    Locate open access versionFindings
  • Bill MacCartney and Christopher D. Manning. An extended model of natural logic. In Proceedings of the Eight International Conference on Computational Semantics, pp. 140–156, Tilburg, The Netherlands, January 2009. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/W09-3714.
    Locate open access versionFindings
  • Nicole Maslan, Melissa Roemmele, and Andrew S. Gordon. One hundred challenge problems for logical formalizations of commonsense psychology. In AAAI Spring Symposia, 2015. URL http://people.ict.usc.edu/̃gordon/publications/AAAI-SPRING15.PDF.
    Locate open access versionFindings
  • Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pp. 839–849. Association for Computational Linguistics, 2016. doi: 10.18653/v1/N16-1098. URL http://aclweb.org/anthology/N16-1098.
    Locate open access versionFindings
  • Peter Norvig. Inference in text understanding. In AAAI, pp. 561–565, 1987. URL http://norvig.com/aaai87.pdf.
    Locate open access versionFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.
    Google ScholarLocate open access versionFindings
  • Judea Pearl. Reasoning with cause and effect. AI Magazine, 23(1):95, 2002. URL https://ftp.cs.ucla.edu/pub/stat_ser/r265-ai-mag.pdf.
    Locate open access versionFindings
  • Judea Pearl and Dana Mackenzie. The Book of Why: The New Science of Cause and Effect. Basic Books, Inc., New York, NY, USA, 1st edition, 2018. ISBN 046509760X, 9780465097609. URL https://dl.acm.org/citation.cfm?id=3238230.
    Findings
  • Charles Sanders Peirce. Collected papers of Charles Sanders Peirce, volume 5. Harvard University Press, 1965a. URL http://www.hup.harvard.edu/catalog.php?isbn=9780674138001.
    Findings
  • Charles Sanders Peirce. Pragmatism and pragmaticism, volume 5. Belknap Press of Harvard University Press, 1965b. URL https://www.jstor.org/stable/224970.
    Locate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/D14-1162. URL https://www.aclweb.org/anthology/ D14-1162.
    Locate open access versionFindings
  • Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1202. URL https://www.aclweb.org/anthology/N18-1202.
    Locate open access versionFindings
  • Karl Pichotta and Raymond Mooney. Statistical script learning with multi-argument events. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 220–229, Gothenburg, Sweden, April 2014. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/E14-1024.
    Locate open access versionFindings
  • Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. Hypothesis only baselines in natural language inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pp. 180–191, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/S18-2023. URL https://www.aclweb.org/anthology/S18-2023.
    Locate open access versionFindings
  • Alec Radford. Improving language understanding by generative pre-training. 2018.
    Google ScholarFindings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 2019.
    Google ScholarLocate open access versionFindings
  • Altaf Rahman and Vincent Ng. Resolving complex cases of definite pronouns: The winograd schema challenge. In EMNLP-CoNLL, 2012.
    Google ScholarLocate open access versionFindings
  • Rajat Raina, Andrew Y Ng, and Christopher D Manning. Robust textual inference via learning and abductive reasoning. In AAAI, pp. 1099–1105, 2005. URL https://nlp.stanford.edu/̃manning/papers/aaai05-learnabduction.pdf.
    Locate open access versionFindings
  • Rachel Rudinger, Pushpendre Rastogi, Francis Ferraro, and Benjamin Van Durme. Script induction as language modeling. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1681–1686, Lisbon, Portugal, September 2015. Association for Computational Linguistics. URL http://aclweb.org/anthology/D15-1195.
    Locate open access versionFindings
  • Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. In AAAI, 2020.
    Google ScholarLocate open access versionFindings
  • Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A Smith, and Yejin Choi. Atomic: an atlas of machine commonsense for if-then reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 3027–3035, 2019.
    Google ScholarLocate open access versionFindings
  • Roger C. Schank and Robert P. Abelson. Scripts, plans, and knowledge. In Proceedings of the 4th International Joint Conference on Artificial Intelligence - Volume 1, IJCAI’75, pp. 151–157, San Francisco, CA, USA, 1975. Morgan Kaufmann Publishers Inc. URL http://dl.acm.org/citation.cfm?id=1624626.1624649.
    Locate open access versionFindings
  • Gary Shank. The extraordinary ordinary powers of abductive reasoning. Theory & Psychology, 8(6):841–860, 1998. URL https://journals.sagepub.com/doi/10.1177/0959354398086007.
    Locate open access versionFindings
  • Masatoshi Tsuchiya. Performance impact caused by hidden bias of training data for recognizing textual entailment. CoRR, abs/1804.08117, 2018. URL http://www.lrec-conf.org/proceedings/lrec2018/pdf/786.pdf.
    Findings
  • Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575, 2015.
    Google ScholarLocate open access versionFindings
  • Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353–355, Brussels, Belgium, November 2018. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/W18-5446.
    Locate open access versionFindings
  • Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112–1122. Association for Computational Linguistics, 2018a. URL http://aclweb.org/anthology/N18-1101.
    Locate open access versionFindings
  • Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112–1122, New Orleans, Louisiana, June 2018b. Association for Computational Linguistics. doi: 10.18653/v1/N18-1101. URL https://www.aclweb.org/anthology/N18-1101.
    Locate open access versionFindings
  • Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. Swag: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018. URL https://aclweb.org/anthology/D18-1009.
    Locate open access versionFindings
  • Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In ACL, 2019.
    Google ScholarLocate open access versionFindings
  • Sheng Zhang, Rachel Rudinger, Kevin Duh, and Benjamin Van Durme. Ordinal common-sense inference. Transactions of the Association for Computational Linguistics, 5:379–395, 2017. doi: 10.1162/tacl a 00068. URL https://www.aclweb.org/anthology/Q17-1027.
    Locate open access versionFindings
  • Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
    Findings
Full Text
Your rating :
0

 

Tags
Comments