AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We presented DELOREAN, an unsupervised LMbased approach to generate text conditioned on past context as well as future constraints, through forward and backward passes considering each condition

Back to the Future: Unsupervised Backprop based Decoding for Counterfactual and Abductive Commonsense Reasoning

EMNLP 2020, (2020)

被引用0|浏览76
下载 PDF 全文
引用
微博一下

摘要

Abductive and counterfactual reasoning, core abilities of everyday human cognition, require reasoning about what might have happened at time t, while conditioning on multiple contexts from the relative past and future. However, simultaneous incorporation of past and future contexts using generative language models (LMs) can be challenging...更多

代码

数据

0
简介
  • Everyday causal reasoning requires reasoning about the likely explanations to partially observable past and future (abductive reasoning (Peirce, 1960)) and reasoning about the alternative future based on counterfactual past.
  • Ray hung a tire on a rope to make his daughter a swing.
  • She hit the rope and the tire fell on top of her.
  • Ray ran to his daughter to make sure she was okay.
  • Most NLP benchmarks have focused on reasoning about information that is entailed from the premise.
  • It has been noted that human reasoning is often the other way, where hypotheses often contain new information that was not available in the premise, but plausibly true (but
重点内容
  • Everyday causal reasoning requires reasoning about the likely explanations to partially observable past and future (abductive reasoning (Peirce, 1960)) and reasoning about the alternative future based on counterfactual past
  • We investigate an alternative path toward language-based nonmonotonic reasoning using pre-trained language models as is
  • This paper presents DELOREAN: DEcoding for nonmonotonic LOgical REAsoNing, an unsupervised decoding algorithm that only assumes off-theshelf left-to-right language models with no supervision
  • The results in Table 1 show that DELOREAN performs best among the unsupervised systems across all metrics
  • We presented DELOREAN, an unsupervised LMbased approach to generate text conditioned on past context as well as future constraints, through forward and backward passes considering each condition
  • We demonstrated its effectiveness for abductive and counterfactual reasoning, on which it performed substantially better than unsupervised baselines
方法
  • Baselines The authors compare the method against baselines from Bhagavatula et al (2019). The unsupervised baselines use a pre-trained GPT-2 model

    Ray drive his car on a steep mountain road.

    ? Ray was fine but his car was totaled.

    As he drives the car to the top of the mountain his car is hit by a car.

    Peter was excited to go to the Sanders rally in New Hampshire. ? He couldn't wait to vote for him.

    He has a long history of supporting Bernie Sanders and was excited to see him in person.

    to generate Y given a prompt text—either the observation X alone (Zero-ShotX ) or Z e X (ZeroShotZX ), where e denotes a special end-of-text token.
  • The unsupervised baselines use a pre-trained GPT-2 model.
  • Ray drive his car on a steep mountain road.
  • ? He couldn't wait to vote for him
  • He has a long history of supporting Bernie Sanders and was excited to see him in person.
  • The zero-shot baseline uses the pre-trained GPT-2 model to generate Y as a continuation to the counterfactual condition X.
  • The authors experiment with two baselines that finetune GPT-2 on the original story XoriZ to fit the model to the story domain, either with an LM objective (FT) or a tailored conditional objective that encourages minimal edits of Z (Recon+CF).6 the authors report the performance of a supervised baseline (Sup), in which GPT-2 is fine-tuned to produce the gold Y from XoriZ and X
结果
  • Automatic Evaluation The authors report the same metrics as Bhagavatula et al (2019): BLEU-4 (Papineni et al, 2002), ROUGE-L (Lin, 2004) and BERTSCORE (Zhang et al, 2019).
  • Automatic Evaluation Following Qin et al (2019a), the authors report BERTSCORE (Zhang et al, 2019), which was shown to best correlate with human judges’ notion of counterfactual coherence, and BLEU-4 and ROUGE-L, which better measure minimum-edits.
  • Presented with the original story, the counterfactual condition X, and the generated ending Y , workers were asked to judge (1) the coherence of Y with respect to the X; and (2) to what extent the generated ending minimally-edits the original ending.7 In order to judge both criteria, the authors report the weighted harmonic mean Hβ of these scores across a range of weights β (Figure 4)
结论
  • The authors presented DELOREAN, an unsupervised LMbased approach to generate text conditioned on past context as well as future constraints, through forward and backward passes considering each condition.
  • The authors demonstrated its effectiveness for abductive and counterfactual reasoning, on which it performed substantially better than unsupervised baselines.
  • The authors' method is general and can be adapted for other generative reasoning tasks
总结
  • Introduction:

    Everyday causal reasoning requires reasoning about the likely explanations to partially observable past and future (abductive reasoning (Peirce, 1960)) and reasoning about the alternative future based on counterfactual past.
  • Ray hung a tire on a rope to make his daughter a swing.
  • She hit the rope and the tire fell on top of her.
  • Ray ran to his daughter to make sure she was okay.
  • Most NLP benchmarks have focused on reasoning about information that is entailed from the premise.
  • It has been noted that human reasoning is often the other way, where hypotheses often contain new information that was not available in the premise, but plausibly true (but
  • Methods:

    Baselines The authors compare the method against baselines from Bhagavatula et al (2019). The unsupervised baselines use a pre-trained GPT-2 model

    Ray drive his car on a steep mountain road.

    ? Ray was fine but his car was totaled.

    As he drives the car to the top of the mountain his car is hit by a car.

    Peter was excited to go to the Sanders rally in New Hampshire. ? He couldn't wait to vote for him.

    He has a long history of supporting Bernie Sanders and was excited to see him in person.

    to generate Y given a prompt text—either the observation X alone (Zero-ShotX ) or Z e X (ZeroShotZX ), where e denotes a special end-of-text token.
  • The unsupervised baselines use a pre-trained GPT-2 model.
  • Ray drive his car on a steep mountain road.
  • ? He couldn't wait to vote for him
  • He has a long history of supporting Bernie Sanders and was excited to see him in person.
  • The zero-shot baseline uses the pre-trained GPT-2 model to generate Y as a continuation to the counterfactual condition X.
  • The authors experiment with two baselines that finetune GPT-2 on the original story XoriZ to fit the model to the story domain, either with an LM objective (FT) or a tailored conditional objective that encourages minimal edits of Z (Recon+CF).6 the authors report the performance of a supervised baseline (Sup), in which GPT-2 is fine-tuned to produce the gold Y from XoriZ and X
  • Results:

    Automatic Evaluation The authors report the same metrics as Bhagavatula et al (2019): BLEU-4 (Papineni et al, 2002), ROUGE-L (Lin, 2004) and BERTSCORE (Zhang et al, 2019).
  • Automatic Evaluation Following Qin et al (2019a), the authors report BERTSCORE (Zhang et al, 2019), which was shown to best correlate with human judges’ notion of counterfactual coherence, and BLEU-4 and ROUGE-L, which better measure minimum-edits.
  • Presented with the original story, the counterfactual condition X, and the generated ending Y , workers were asked to judge (1) the coherence of Y with respect to the X; and (2) to what extent the generated ending minimally-edits the original ending.7 In order to judge both criteria, the authors report the weighted harmonic mean Hβ of these scores across a range of weights β (Figure 4)
  • Conclusion:

    The authors presented DELOREAN, an unsupervised LMbased approach to generate text conditioned on past context as well as future constraints, through forward and backward passes considering each condition.
  • The authors demonstrated its effectiveness for abductive and counterfactual reasoning, on which it performed substantially better than unsupervised baselines.
  • The authors' method is general and can be adapted for other generative reasoning tasks
表格
  • Table1: Automatic evaluation results on the abductive task, using the test set of ART
  • Table2: Human calibration results on test set of ART . All scores are normalized to [0, 1]
  • Table3: Human pairwise comparison results on the test set of ART, between DELOREAN and each of the baselines, by jointly considering all 3 criteria from Table 2. “Neutral” means “equally good/bad”
  • Table4: Automatic evaluation results of counterfactual story rewriting, on the test set of TIMETRAVEL
  • Table5: Human pairwise comparison results on the counterfactual task, between our best model and each baseline with respect to coherence and min-edits
Download tables as Excel
相关工作
  • Unsupervised text generation. Unsupervised approaches are often applied to problems that copy information from a source text into decoded text. Unsupervised paraphrasing requires repeating this information (Miao et al, 2019; Bao et al, 2019), as does translation, but with a bilingual transformation (Artetxe et al, 2017; Lample et al, 2018). In summarization there is an additional task to select a subset of the original text (Baziotis et al, 2019; Schumann et al, 2020; West et al, 2019). In cases where information is mostly copied from the original, auto-encoding objectives can ensure the correct information is captured (Bao et al, 2019; Baziotis et al, 2019; Artetxe et al, 2017). This work tackles problems where generation is more open-ended. Rather than reproducing information from the prompt, generations should agree with and expand on it, making autoencoding less applicable.
基金
  • This research was supported in part by DARPA CwC through ARO (W911NF15-1-0543), DARPA MCS program through NIWC Pacific (N66001-192-4031), and Allen Institute for AI
研究对象与分析
test examples using crowdworkers: 100
We also note that our ranking step improves both the performance of our model and that of the zero-shot baselines. Human Evaluation We conduct two sets of human evaluations on 100 test examples using crowdworkers from Amazon Mechanical Turk. In the scoring setting, presented in Table 2, workers were presented a pair of observations (X and Z) and a generated hypothesis Y , and asked to rate the coherence of the hypothesis with respect to the observation X (X-Y ), the observation Z (Y -Z), and both (X-Y -Z), on a 4-point Likert scale

workers: 3
11% 86% Human pairwise comparison setting, presented in Table 3, workers were presented the outputs from a pair of systems (DELOREAN and baseline) and asked to choose the better output in terms of the same coherence criteria. Each example was labeled by 3 workers.5. In both evaluation setups, our method substantially outperform the unsupervised baselines, achieving a relative improvement of 36% − 215% with respect to Y -Z coherence

引用论文
  • Henning Andersen. 1973. Abductive and deductive change. Language, pages 765–793.
    Google ScholarLocate open access versionFindings
  • Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2017. Unsupervised neural machine translation. arXiv preprint arXiv:1710.11041.
    Findings
  • Yu Bao, Hao Zhou, Shujian Huang, Lei Li, Lili Mou, Olga Vechtomova, Xinyu Dai, and Jiajun Chen. 2019. Generating sentences from disentangled syntactic and semantic spaces. arXiv preprint arXiv:1907.05789.
    Findings
  • Christos Baziotis, Ion Androutsopoulos, Ioannis Konstas, and Alexandros Potamianos. 2019. Seq3: Differentiable sequence-to-sequence-to-sequence autoencoder for unsupervised abstractive sentence compression. In NAACL-HLT.
    Google ScholarFindings
  • Chandra Bhagavatula, Ronan Le Bras, Chaitanya Malaviya, Keisuke Sakaguchi, Ari Holtzman, Hannah Rashkin, Doug Downey, Wen-tau Yih, and Yejin Choi. 2019. Abductive commonsense reasoning. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Celikyilmaz, and Yejin Choi. 2019. COMET: Commonsense transformers for automatic knowledge graph construction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4762–4779, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.
    Findings
  • Nathanael Chambers and Dan Jurafsky. 200Unsupervised learning of narrative event chains. In Proceedings of ACL-08: HLT, pages 789–797.
    Google ScholarLocate open access versionFindings
  • Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. 201Plug and play language models: a simple approach to controlled text generation. arXiv preprint arXiv:1912.02164.
    Findings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
    Findings
  • C. Donahue, M. Lee, and P. Liang. 2020. Enabling language models to fill in the blanks. In ACL.
    Google ScholarFindings
  • Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hierarchical neural story generation. In ACL.
    Google ScholarFindings
  • Jessica Ficler and Yoav Goldberg. 2017. Controlling linguistic style aspects in neural language generation. In Proceedings of the Workshop on Stylistic Variation, pages 94–104.
    Google ScholarLocate open access versionFindings
  • Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378.
    Google ScholarLocate open access versionFindings
  • Matthew L Ginsberg. 1986. Counterfactuals. Artificial intelligence, 30(1):35–79.
    Google ScholarLocate open access versionFindings
  • Nelson Goodman. 1947. The problem of counterfactual conditionals. The Journal of Philosophy, 44(5):113–128.
    Google ScholarLocate open access versionFindings
  • Mark Granroth-Wilding and Stephen Clark. 2016. What happens next? event prediction using a compositional neural network model. In Thirtieth AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Jerry R Hobbs, Mark E Stickel, Douglas E Appelt, and Paul Martin. 1993. Interpretation as abduction. Artificial intelligence, 63(1-2):69–142.
    Google ScholarLocate open access versionFindings
  • Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. 2017. Toward controlled generation of text. In ICML.
    Google ScholarFindings
  • Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2019. Cosmos qa: Machine reading comprehension with contextual commonsense reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2391–2401.
    Google ScholarLocate open access versionFindings
  • Steve D Isard. 1974. What would you have done if...? Theoretical Linguistics, 1(1-3):233–256.
    Google ScholarLocate open access versionFindings
  • Philip Nicholas Johnson-Laird. 2006. How we reason. Oxford University Press, USA.
    Google ScholarFindings
  • Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. 2019. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858.
    Findings
  • Rik Koncel-Kedziorski, Ioannis Konstas, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2016. A themerewriting approach for generating algebra word problems. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1617–1628.
    Google ScholarLocate open access versionFindings
  • Sarit Kraus, Daniel Lehmann, and Menachem Magidor. 1990. Nonmonotonic reasoning, preferential models and cumulative logics. Artificial intelligence, 44(12):167–207.
    Google ScholarLocate open access versionFindings
  • Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018. Phrase-based & neural unsupervised machine translation. arXiv preprint arXiv:1804.07755.
    Findings
  • Guillaume Lample, Sandeep Subramanian, Eric Smith, Ludovic Denoyer, Marc’Aurelio Ranzato, and YLan Boureau. 2019. Multiple-attribute text rewriting. In ICLR.
    Google ScholarFindings
  • Carolin Lawrence and Stefan Riezler. 2018. Improving a neural semantic parser by counterfactual learning from human bandit feedback. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1820–1830.
    Google ScholarLocate open access versionFindings
  • Carolin Lawrence, Artem Sokolov, and Stefan Riezler. 2017. Counterfactual learning from bandit feedback under deterministic logging: A case study in statistical machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2566–2576.
    Google ScholarLocate open access versionFindings
  • Zhongyang Li, Xiao Ding, and Ting Liu. 2018. Constructing narrative event evolutionary graph for script event prediction. In IJCAI.
    Google ScholarFindings
  • Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out.
    Google ScholarFindings
  • Charles Sanders Peirce. 1960. Collected papers of charles sanders peirce, volume 2. Harvard University Press.
    Google ScholarFindings
  • Karl Pichotta and Raymond Mooney. 2014. Statistical script learning with multi-argument events. In EACL, pages 220–229.
    Google ScholarLocate open access versionFindings
  • Lianhui Qin, Antoine Bosselut, Ari Holtzman, Chandra Bhagavatula, Elizabeth Clark, and Yejin Choi. 2019a. Counterfactual story reasoning and generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5046–5056.
    Google ScholarLocate open access versionFindings
  • Lianhui Qin, Michel Galley, Chris Brockett, Xiaodong Liu, Xiang Gao, Bill Dolan, Yejin Choi, and Jianfeng Gao. 2019b. Conversing by reading: Contentful neural conversation with on-demand machine reading. In ACL, pages 5427–5436.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019a. Language models are unsupervised multitask learners. -.
    Google ScholarFindings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019b. Language models are unsupervised multitask learners. OpenAI Blog, 1:8.
    Google ScholarLocate open access versionFindings
  • Hannah Rashkin, Antoine Bosselut, Maarten Sap, Kevin Knight, and Yejin Choi. 2018. Modeling naive psychology of characters in simple commonsense stories. arXiv preprint arXiv:1805.06533.
    Findings
  • Hugo Mercier and Dan Sperber. 2017. The enigma of reason. Harvard University Press.
    Google ScholarFindings
  • Ning Miao, Hao Zhou, Lili Mou, Rui Yan, and Lei Li. 2019. Cgmh: Constrained sentence generation by metropolis-hastings sampling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6834–6842.
    Google ScholarLocate open access versionFindings
  • Raymond Reiter. 1988. Nonmonotonic reasoning. In Exploring artificial intelligence, pages 439–481. Elsevier.
    Google ScholarLocate open access versionFindings
  • Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. 2011. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series.
    Google ScholarLocate open access versionFindings
  • Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. 2016a. A corpus and evaluation framework for deeper understanding of commonsense stories. arXiv preprint arXiv:1604.01696.
    Findings
  • Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James F. Allen. 2016b. A corpus and cloze evaluation for deeper understanding of commonsense stories. In HLT-NAACL.
    Google ScholarFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In ACL, pages 311– 318.
    Google ScholarLocate open access versionFindings
  • Judea Pearl and Dana Mackenzie. 2018. The book of why: the new science of cause and effect. Basic Books.
    Google ScholarFindings
  • Maarten Sap, Ronan LeBras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A Smith, and Yejin Choi. 2019. ATOMIC: An atlas of machine commonsense for ifthen reasoning. In AAAI.
    Google ScholarFindings
  • Roger C Schank and Robert P Abelson. 1977.
    Google ScholarFindings
  • Raphael Schumann, Lili Mou, Yao Lu, Olga Vechtomova, and Katja Markert. 2020. Discrete optimization for unsupervised sentence summarization with word-level extraction. arXiv preprint arXiv:2005.01791.
    Findings
  • Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2017. Style transfer from non-parallel text by cross-alignment. In Advances in neural information processing systems, pages 6830–6841.
    Google ScholarLocate open access versionFindings
  • Youngseo Son, Anneke Buffone, Joe Raso, Allegra Larche, Anthony Janocko, Kevin Zembroski, H Andrew Schwartz, and Lyle Ungar. 2017. Recognizing counterfactual thinking in social media texts. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 654–658.
    Google ScholarLocate open access versionFindings
  • William Starr. 2019. Counterfactuals. In Edward N. Zalta, editor, The Stanford Encyclopedia of Philosophy, fall 2019 edition. Metaphysics Research Lab, Stanford University.
    Google ScholarLocate open access versionFindings
  • Qing Sun, Stefan Lee, and Dhruv Batra. 2017. Bidirectional beam search: Forward-backward inference in neural sequence models for fill-in-the-blank image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6961–6969.
    Google ScholarLocate open access versionFindings
  • Niket Tandon, Bhavana Dalvi Mishra, Keisuke Sakaguchi, Antoine Bosselut, and Peter Clark. 2019. Wiqa: A dataset for” what if...” reasoning over procedural text. In EMNLP.
    Google ScholarFindings
  • 2019. Bottlesum: Unsupervised and self-supervised sentence summarization using the information bottleneck principle. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 3743–3752.
    Google ScholarLocate open access versionFindings
  • Yoel Zeldes, Dan Padnos, and Barak Peleg. 2020. Haim-1.5 - the next generation.
    Google ScholarFindings
  • Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. 2019. Defending against neural fake news. In Advances in Neural Information Processing Systems, pages 9051–9062.
    Google ScholarLocate open access versionFindings
  • Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2019. BERTScore: Evaluating text generation with BERT. CoRR, abs/1904.09675.
    Findings
  • Wanrong Zhu, Zhiting Hu, and Eric Xing. 2019. Text infilling. arXiv preprint arXiv:1901.00158.
    Findings
作者
Vered Shwartz
Vered Shwartz
Peter West
Peter West
Jena D. Hwang
Jena D. Hwang
Antoine Bosselut
Antoine Bosselut
您的评分 :
0

 

标签
评论
小科