GLUCOSE: GeneraLized and COntextualized Story Explanations

Nasrin Mostafazadeh
Nasrin Mostafazadeh
Lori Moon
Lori Moon
David Buchanan
David Buchanan
Lauren Berkowitz
Lauren Berkowitz
Or Biran
Or Biran

EMNLP 2020, 2020.

Cited by: 0|Bibtex|Views38
Keywords:
mini theoryscale datasetsemi structuredimplicit commonsense inferencecausal explanationMore(14+)
Weibo:
We introduced GLUCOSE, a large-scale dataset of implicit commonsense knowledge, encoded as explanatory mini-theories grounded in a narrative context

Abstract:

When humans read or listen, they make implicit commonsense inferences that frame their understanding of what happened and why. As a step toward AI systems that can build similar mental models, we introduce GLUCOSE, a large-scale dataset of implicit commonsense causal knowledge, encoded as causal mini-theories about the world, each grounde...More

Code:

Data:

0
Introduction
  • Humans make countless implicit commonsense inferences about everyday situations. For example, consider the following short story from the ROCStories corpus (Mostafazadeh et al, 2016): Gage was riding his bike.
  • When even young children read this story, they construct a coherent representation of what happened and why, combining information from the text with relevant background knowledge (Kintsch and Van Dijk, 1978)
  • They can construct the causal chain that explains how the car’s unexpected turn led to Gage falling, describe how Gage’s emotion and location changed throughout, and even hypothesize that he likely shouted for help after falling
Highlights
  • Humans make countless implicit commonsense inferences about everyday situations
  • We show a strong correlation between human and automatic evaluation metrics, which makes systematic and reliable evaluation of models feasible
  • We introduced GLUCOSE, a large-scale dataset of implicit commonsense knowledge, encoded as explanatory mini-theories grounded in a narrative context
  • We presented our multi-stage pipeline for acquiring semi-structured causal explanations at scale from lay workers, resulting in 440K annotations in the context of everyday children’s stories
Results
  • Table 4 shows the results from the models described in Section 6, evaluated as per Section 5.
  • It shows that Enc-Dec uniformly outperforms all other models, confirming that full visibility into context11 helps an architecture better learn the intricacies of GLUCOSE rules.
  • Enc-Dec performs competitively with humans in many dimensions
  • The strength of this model’s performance in predicting both specific and general rules is a testament to the high quality of the GLUCOSE training data.
  • PT-LM’s poor performance shows that finetuning on our
Conclusion
  • The authors introduced GLUCOSE, a large-scale dataset of implicit commonsense knowledge, encoded as explanatory mini-theories grounded in a narrative context.
  • 2) In order to evaluate how well AI models can predict GLUCOSE knowledge on novel inputs, the ultimate value of such a dataset, the authors defined a standalone evaluation task for predicting specific and general inference rules given a story/sentence pair and a dimension.
  • The authors show that training on GLUCOSE data improves model performances significantly on unseen stories
Summary
  • Introduction:

    Humans make countless implicit commonsense inferences about everyday situations. For example, consider the following short story from the ROCStories corpus (Mostafazadeh et al, 2016): Gage was riding his bike.
  • When even young children read this story, they construct a coherent representation of what happened and why, combining information from the text with relevant background knowledge (Kintsch and Van Dijk, 1978)
  • They can construct the causal chain that explains how the car’s unexpected turn led to Gage falling, describe how Gage’s emotion and location changed throughout, and even hypothesize that he likely shouted for help after falling
  • Objectives:

    To enable developing models that can build mental models of narratives, the authors aimed to crowdsource a large, high-quality dataset.
  • Results:

    Table 4 shows the results from the models described in Section 6, evaluated as per Section 5.
  • It shows that Enc-Dec uniformly outperforms all other models, confirming that full visibility into context11 helps an architecture better learn the intricacies of GLUCOSE rules.
  • Enc-Dec performs competitively with humans in many dimensions
  • The strength of this model’s performance in predicting both specific and general rules is a testament to the high quality of the GLUCOSE training data.
  • PT-LM’s poor performance shows that finetuning on our
  • Conclusion:

    The authors introduced GLUCOSE, a large-scale dataset of implicit commonsense knowledge, encoded as explanatory mini-theories grounded in a narrative context.
  • 2) In order to evaluate how well AI models can predict GLUCOSE knowledge on novel inputs, the ultimate value of such a dataset, the authors defined a standalone evaluation task for predicting specific and general inference rules given a story/sentence pair and a dimension.
  • The authors show that training on GLUCOSE data improves model performances significantly on unseen stories
Tables
  • Table1: Entries in the GLUCOSE dataset that explain the Gage story around the sentence X= Gage turned his bike sharply. White and gray rows show specific statements and general rules, respectively. “Sth” is an abbreviation of “Something”. The syntactic slots used for constructing each semi-structured entry are shown underneath it
  • Table2: Statistics about GLUCOSE dataset collection
  • Table3: Ceiling overlap between GLUCOSE and other resources. Omitted dimensions had no overlap
  • Table4: Evaluation results for GLUCOSE models. Human evaluation scores are out of 3; BLEU scores are out of 100. Gray and regular rows show results on general and specific rules, respectively. Human model’s performance was computed by showing judges a randomly selected answer from the three gold references
  • Table5: Example model generations for the input story: Karen made a pan of lasagna. She brought it to the party. Nobody wanted to eat lasagna. Karen ate it for a week. She became tired of lasagna. (Sentence X is underlined.) Note that all test stories are unseen in the train or validation set
Download tables as Excel
Related work
  • Recently, there has been a renewed interest in commonsense reasoning (Talmor et al, 2019; Tandon et al, 2019; Rashkin et al, 2018a; Zellers et al, 2018), further fostered by the increasing need for explainable AI systems (Yang et al, 2018).

    One well-known type of commonsense knowledge is script knowledge, defined by Schank and Abelson (1977) as structured knowledge about stereotypical event sequences and their participants. However, manual encoding of such knowledge is notoriously unscalable and brittle. A more recent line of work is unsupervised learning of “narrative schemas” (Chambers and Jurafsky, 2008, 2009; Balasubramanian et al, 2013; Sha et al, 2016), where common event sequences are automatically induced from large corpora. While promising, this approach has not produced high-quality knowledge usable for downstream tasks at scale (Mostafazadeh et al, 2016). Furthermore, since commonsense knowledge is often implicit, such corpus-based methods are unlikely to induce implicit commonsense inferences (Gordon and Van Durme, 2013). In contrast, our data collection framework enables us to acquire high-quality and robust commonsense knowledge, including often unstated rules such as “SomeoneA gives SomeoneB SomethingA Results in SomeoneB possesses SomethingA” or “SomeoneA is at SomewhereA Enables SomeoneA puts SomethingA at SomewhereA.”
Study subjects and analysis
rounds of pilot studies: 6
Dimension 1 2 5 6 7 10 ConceptNet 1.2% 0.3% 0% 1.9% 0% 0% ATOMIC 7.8% 1.2% 2.9% 5.3% 1.8% 4.9%. the result of more than six rounds of pilot studies, iteratively improving the interaction elements, functionality, dimension definitions, instructions, and examples.4. See Appendix B for more details on our crowdsourcing pipeline

workers: 3
We computed an estimated target age5 for each story and sampled from the 5–8 age group. To ensure diverse viewpoints and hypotheses, each S, X pair was assigned to three workers. Data collection statistics are shown in Table 2 and Figure 1

workers with the highest quality rating: 3
Test Set Curation For a test set on commonsense reasoning to offer accurate and reliable evaluation, it should contain unambiguous examples with clear gold answers. This led to a curation process that identifies examples on which humans have high agreement, as follows: we sampled S, X pairs annotated by any three workers with the highest quality rating. A dimension d for S, X was allowed into the test set if 1) d was annotated by all three workers, and 2) the three specific statements had a round-robin average sentence-level BLEU (Lin and Och, 2004) score8 above 0.75

workers: 3
This led to a curation process that identifies examples on which humans have high agreement, as follows: we sampled S, X pairs annotated by any three workers with the highest quality rating. A dimension d for S, X was allowed into the test set if 1) d was annotated by all three workers, and 2) the three specific statements had a round-robin average sentence-level BLEU (Lin and Och, 2004) score8 above 0.75. Finally, 7We evaluated GLUCOSE’s specific statements against ConceptNet, with nearly identical results to those in Table 3

Reference
  • Niranjan Balasubramanian, Stephen Soderland, Mausam, and Oren Etzioni. 2013. Generating coherent event schemas at scale. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1721–1731, Seattle, Washington, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yonatan Belinkov, Adam Poliak, Stuart Shieber, Benjamin Van Durme, and Alexander Rush. 2019. On adversarial removal of hypothesis-only bias in natural language inference. In Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019), pages 256–262, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Celikyilmaz, and Yejin Choi. 2019. COMET: Commonsense transformers for automatic knowledge graph construction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4762–4779, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Justin TA Busch, Aiyana K Willard, and Cristine H Legare. 2018. Explanation scaffolds causal learning and problem solving in childhood. In Active Learning from Infancy to Childhood, pages 113–127. Springer.
    Google ScholarLocate open access versionFindings
  • Nathanael Chambers and Dan Jurafsky. 2008. Unsupervised learning of narrative event chains. In Proceedings of ACL-08: HLT, pages 789–797, Columbus, Ohio. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Nathanael Chambers and Dan Jurafsky. 2009. Unsupervised learning of narrative schemas and their participants. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 602–610, Suntec, Singapore. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Albert Gatt and Emiel Krahmer. 201Survey of the state of the art in natural language generation: Core tasks, applications and evaluation. J. Artif. Int. Res., 61(1):65–170.
    Google ScholarLocate open access versionFindings
  • Jonathan Gordon and Benjamin Van Durme. 2013. Reporting bias and knowledge acquisition. In Proceedings of the 2013 Workshop on Automated Knowledge Base Construction, AKBC ’13, pages 25–30, New York, NY, USA. ACM.
    Google ScholarLocate open access versionFindings
  • Ilaria Grazzani, Veronica Ornaghi, Elisabetta Conte, Alessandro Pepe, and Claudia Caprin. 2018. The relation between emotion understanding and theory of mind in children aged 3 to 8: The key role of language. Frontiers in Psychology, 9:724.
    Google ScholarLocate open access versionFindings
  • Tatsunori Hashimoto, Hugh Zhang, and Percy Liang. 2019. Unifying human and statistical evaluation for natural language generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1689–1701, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Walter Kintsch and Teun A Van Dijk. 1978. Toward a model of text comprehension and production. Psychological review, 85(5):363.
    Google ScholarLocate open access versionFindings
  • Victor Kuperman, Hans Stadthagen-Gonzalez, and Marc Brysbaert. 2012. Age-of-acquisition ratings for 30,000 english words. Behavior Research Methods, 44(4):978–990.
    Google ScholarLocate open access versionFindings
  • Chin-Yew Lin and Franz Josef Och. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pages 605–612, Barcelona, Spain.
    Google ScholarLocate open access versionFindings
  • Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2122–2132, Austin, Texas. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tania Lombrozo. 2006. The structure and function of explanations. Trends in Cognitive Sciences, 10(10):464–470.
    Google ScholarLocate open access versionFindings
  • Tim Miller. 2019. Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence, 267:1–38.
    Google ScholarLocate open access versionFindings
  • Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. 2016. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 839–849, San Diego, California. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jekaterina Novikova, Ondrej Dusek, Amanda Cercas Curry, and Verena Rieser. 2017. Why we need new evaluation metrics for NLG. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2241–2252, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Rishi Sharma, James Allen, Omid Bakhshandeh, and Nasrin Mostafazadeh. 2018. Tackling the story ending biases in the story cloze test. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 752–757, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186– 191, Belgium, Brussels. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8).
    Google ScholarLocate open access versionFindings
  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints.
    Google ScholarFindings
  • Hannah Rashkin, Antoine Bosselut, Maarten Sap, Kevin Knight, and Yejin Choi. 2018a. Modeling naive psychology of characters in simple commonsense stories. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2289– 2299, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Hannah Rashkin, Maarten Sap, Emily Allaway, Noah A. Smith, and Yejin Choi. 2018b. Event2Mind: Commonsense inference on events, intents, and reactions. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 463–473, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah Smith, and Yejin Choi. 2019. ATOMIC: An atlas of machine commonsense for ifthen reasoning. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Roger C. Schank and Robert P. Abelson. 1977.
    Google ScholarFindings
  • Lei Sha, Sujian Li, Baobao Chang, and Zhifang Sui. 2016. Joint learning templates and slots for event schema induction. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 428–434, San Diego, California. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. In Proceedings of the 31st AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Niket Tandon, Bhavana Dalvi, Keisuke Sakaguchi, Peter Clark, and Antoine Bosselut. 2019. WIQA: A dataset for “what if...” reasoning over procedural text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6076–6085, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Shaohua Yang, Qiaozi Gao, Sari Sadiya, and Joyce Chai. 2018. Commonsense justification for action explanation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2627–2637, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. SWAG: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 93– 104, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yilun Zhou, Steven Schockaert, and Julie Shah. 2019. Predicting conceptnet path quality using crowdsourced assessments of naturalness. In The World Wide Web Conference, WWW ’19, pages 2460– 2471, New York, NY, USA. ACM.
    Google ScholarLocate open access versionFindings
  • Rolf A. Zwaan, Mark C. Langston, and Arthur C. Graesser. 1995. The construction of situation models in narrative comprehension: An event-indexing model. Psychological Science, 6(5):292–297.
    Google ScholarLocate open access versionFindings
  • Rolf A Zwaan and Gabriel A Radvansky. 1998. Situation models in language comprehension and memory. Psychological bulletin, 123(2):162.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments