Learning to Rationalize for Nonmonotonic Reasoning with Distant Supervision
Weibo:
Abstract:
The black-box nature of neural models has motivated a line of research that aims to generate natural language rationales to explain why a model made certain predictions. Such rationale generation models, to date, have been trained on dataset-specific crowdsourced rationales, but this approach is costly and is not generalizable to new ta...More
Code:
Data:
Introduction
- Deep neural models perform increasingly well across NLP tasks, but due to their black-box nature, their success comes at the cost of the understanding of the system.
- Recognizing Textual Entailment (RTE; Dagan et al 2013), or, in its newer variant, Natural Language Inference (NLI; Bowman et al 2015), is defined as a 3-way classification task.
- The authors focus on the Stanford Natural Language Inference dataset (SNLI; Bowman et al 2015), in which image captions serve as premises, and hypotheses were crowdsourced
Highlights
- Deep neural models perform increasingly well across NLP tasks, but due to their black-box nature, their success comes at the cost of our understanding of the system
- We focus on the Defeasible Inference task (δ-Natural Language Inference (NLI); Rudinger et al 2020), illustrated in Figure 1
- We focus on the Stanford Natural Language Inference dataset (SNLI; Bowman et al 2015), in which image captions serve as premises, and hypotheses were crowdsourced
- The result of the automatic measures are reported in Table 3
- We study the quality of rationales in the e-δ-NLI dataset through human evaluation
- As we show in Section 6.2, most generated rationales are in the format of e-SNLI rationales, which might explain the discrepancy between the quality of the generated rationales and that of the training data
Results
- For each combination of rationale generation training setup, the authors generated a rationale for each instance in the test set using beam search with 5 beams.
- The authors evaluated the generated rationales both in terms of automatic metrics and human evaluation.
- The authors used standard n-gram overlap metrics: the precisionoriented BLEU score (Papineni et al 2002) and recalloriented ROUGE score (Lin 2004).
- The authors used BLEU-4 that measures overlap of n-grams up to n = 4, and ROUGE-L that measures longest matching sequences, and compared multiple predictions against multiple distantly supervised rationales as references.
- The result of the automatic measures are reported in Table 3.
- The authors observe additive gain using multi-task setup on both BLEU and ROUGE scores
Conclusion
- The authors presented an approach for generating rationales for the defeasible inference task, i.e., explaining why a given update either strengthened or weakened the hypothesis.
- The authors experimented with various training setups categorized into post-hoc rationalization and joint prediction and rationalization.
- The results indicated that the posthoc rationalization setup is easier than the joint setup, with many of the post-hoc generated rationales considered by humans as explanatory.
- The authors hope that future work will focus on jointly predicting a label and generating a rationale, which is a more realistic setup and which may yield less trivial and more faithful rationales
Summary
Introduction:
Deep neural models perform increasingly well across NLP tasks, but due to their black-box nature, their success comes at the cost of the understanding of the system.- Recognizing Textual Entailment (RTE; Dagan et al 2013), or, in its newer variant, Natural Language Inference (NLI; Bowman et al 2015), is defined as a 3-way classification task.
- The authors focus on the Stanford Natural Language Inference dataset (SNLI; Bowman et al 2015), in which image captions serve as premises, and hypotheses were crowdsourced
Results:
For each combination of rationale generation training setup, the authors generated a rationale for each instance in the test set using beam search with 5 beams.- The authors evaluated the generated rationales both in terms of automatic metrics and human evaluation.
- The authors used standard n-gram overlap metrics: the precisionoriented BLEU score (Papineni et al 2002) and recalloriented ROUGE score (Lin 2004).
- The authors used BLEU-4 that measures overlap of n-grams up to n = 4, and ROUGE-L that measures longest matching sequences, and compared multiple predictions against multiple distantly supervised rationales as references.
- The result of the automatic measures are reported in Table 3.
- The authors observe additive gain using multi-task setup on both BLEU and ROUGE scores
Conclusion:
The authors presented an approach for generating rationales for the defeasible inference task, i.e., explaining why a given update either strengthened or weakened the hypothesis.- The authors experimented with various training setups categorized into post-hoc rationalization and joint prediction and rationalization.
- The results indicated that the posthoc rationalization setup is easier than the joint setup, with many of the post-hoc generated rationales considered by humans as explanatory.
- The authors hope that future work will focus on jointly predicting a label and generating a rationale, which is a more realistic setup and which may yield less trivial and more faithful rationales
Tables
- Table1: Examples of rationales generated from each of the sources. W stands for a weakener update and S for strengthener
- Table2: The different training setups we experiment with. We add special tokens to mark the boundaries of each input and output span, e.g. [premise] marks the beginning of the premise
- Table3: Automatic and human evaluation of rationale generation for the test set. Human evaluation results are presented for strengtheners and weakeners separately (S/W)
- Table4: Human evaluation for the distant supervision rationales in the test set. Results (percents) are presented for strengtheners and weakeners separately (S/W)
- Table5: Patterns of rationales generated by Rationale BartL that were considered explanatory. H, S, and W stand for Hypothesis, Strengthener and Weakener
- Table6: Examples for the common error types
- Table7: Percent of rationales with each error type
- Table8: Ablation studies human evaluation. Results (percents) are presented for strengtheners and weakeners (S/W)
Funding
- But if we map entailment to strengthener and contradiction to weakener, we get 64% accuracy on the update type prediction
- The best performance is achieved by Rationale BART-L, in which 80% of the rationales were considered relevant, over 55% correct, and between 33% (weakeners) to 47% (strengtheners) explanatory
Study subjects and analysis
most similar pairs: 3
In addition, we generate the relationship between pairs of spans. We take the top 3 most similar pairs of su (subset of SG originated from U ) and sh (subset of SG originated from H), judged by the cosine similarity between their word2vec embeddings (Mikolov et al 2013).3. We prompt the LM with “P+ U + H
workers: 3
To ensure the quality of annotations, we required that the workers be located in the US, UK, or Canada, and have a 99% approval rate for at least 5,000 prior tasks. We aggregated annotations from 3 workers using majority vote. The annotations yielded fair levels of agreement, with Fleiss’ Kappa (Landis and Koch 1977) between κ = 0.22 for relevance and κ = 0.37 for being explanatory
individuals: 4
We manually analyzed the rationales generated by the best model (Rationale Bart-L) that were considered grammatical, relevant, and correct by humans. P: Four individuals are sitting on a small dock by the water as a boat sails by. (1) H: Four people sitting near the ocean. W: They’re in Egypt
people: 4
W: They’re in Egypt. R: Before, four people needed to go to the beach. P: Two men in orange uniforms stand before a train and do some work
Reference
- Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In ICLR.
- Bastings, J.; Aziz, W.; and Titov, I. 2019. Interpretable Neural Predictions with Differentiable Binary Variables. In ACL.
- Bhagavatula, C.; Le Bras, R.; Malaviya, C.; Sakaguchi, K.; Holtzman, A.; Rashkin, H.; Downey, D.; Yih, W.-t.; and Choi, Y. 2019. Abductive Commonsense Reasoning. In ICLR.
- Bosselut, A.; Rashkin, H.; Sap, M.; Malaviya, C.; Celikyilmaz, A.; and Choi, Y. 2019. COMET: Commonsense Transformers for Automatic Knowledge Graph Construction. In ACL.
- Bowman, S.; Angeli, G.; Potts, C.; and Manning, C. D. 201A large annotated corpus for learning natural language inference. In EMNLP.
- Camburu, O.-M.; Rocktaschel, T.; Lukasiewicz, T.; and Blunsom, P. 2018. e-SNLI: Natural Language Inference with Natural Language Explanations. In Neurips.
- Dagan, I.; Roth, D.; Sammons, M.; and Zanzotto, F. M. 2013. Recognizing Textual Entailment: Models and Applications. Morgan & Claypool Publishers.
- Davison, J.; Feldman, J.; and Rush, A. 2019. Commonsense Knowledge Mining from Pretrained Models. In EMNLPIJCNLP.
- Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 201BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL.
- Dodge, J.; Liao, Q. V.; Zhang, Y.; Bellamy, R. K. E.; and Dugan, C. 2019. Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment. In IUI.
- Gazzaniga, M. S.; and LeDoux, J. E. 2013. The integrated mind. Springer Science & Business Media.
- Guan, J.; Huang, F.; Zhao, Z.; Zhu, X.; and Huang, M. 2020. A knowledge-enhanced pretraining model for commonsense story generation. TACL.
- Gururangan, S.; Swayamdipta, S.; Levy, O.; Schwartz, R.; Bowman, S.; and Smith, N. A. 2018. Annotation Artifacts in Natural Language Inference Data. In NAACL.
- Hase, P.; and Bansal, M. 2020. Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? In ACL.
- Holtzman, A.; Buys, J.; Du, L.; Forbes, M.; and Choi, Y. 2020. The Curious Case of Neural Text Degeneration. In ICLR.
- Honnibal, M.; and Montani, I. 2017. spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. To appear 7(1).
- Jacovi, A.; and Goldberg, Y. 2020. Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? In ACL.
- Jain, S.; and Wallace, B. C. 2019. Attention is not Explanation. In NAACL.
- Jain, S.; Wiegreffe, S.; Pinter, Y.; and Wallace, B. C. 2020. Learning to Faithfully Rationalize by Construction. In Jurafsky, D.; Chai, J.; Schluter, N.; and Tetreault, J. R., eds., ACL.
- Kumar, S.; and Talukdar, P. P. 20NILE: Natural Language Inference with Faithful Natural Language Explanations. In Jurafsky, D.; Chai, J.; Schluter, N.; and Tetreault, J. R., eds., ACL.
- Landis, J. R.; and Koch, G. G. 1977. The measurement of observer agreement for categorical data. biometrics.
- Latcinnik, V.; and Berant, J. 2020. Explaining question answering models through text generation. arXiv.
- Lei, T.; Barzilay, R.; and Jaakkola, T. 2016. Rationalizing Neural Predictions. In EMNLP.
- Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; and Zettlemoyer, L. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In ACL.
- Lin, C.-Y. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out.
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv.
- Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Efficient estimation of word representations in vector space. arXiv.
- Narang, S.; Raffel, C.; Lee, K.; Roberts, A.; Fiedel, N.; and Malkan, K. 2020. WT5?! Training Text-to-Text Models to Explain their Predictions. arXiv.
- Nie, Y.; Williams, A.; Dinan, E.; Bansal, M.; Weston, J.; and Kiela, D. 2020. Adversarial NLI: A New Benchmark for Natural Language Understanding. In ACL.
- Novikova, J.; Dusek, O.; Curry, A. C.; and Rieser, V. 2017. Why We Need New Evaluation Metrics for NLG. In EMNLP.
- Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. BLEU: a method for automatic evaluation of machine translation. In ACL.
- Petroni, F.; Rocktaschel, T.; Riedel, S.; Lewis, P.; Bakhtin, A.; Wu, Y.; and Miller, A. 2019. Language Models as Knowledge Bases? In EMNLP-IJCNLP. Hong Kong, China.
- Poliak, A.; Naradowsky, J.; Haldar, A.; Rudinger, R.; and Van Durme, B. 2018. Hypothesis Only Baselines in Natural Language Inference. In *SEM.
- Qin, L.; Bosselut, A.; Holtzman, A.; Bhagavatula, C.; Clark, E.; and Choi, Y. 2019. Counterfactual Story Reasoning and Generation. In EMNLP-IJCNLP.
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and Sutskever, I. 2019. Language Models are Unsupervised Multitask Learners. OpenAI Blog.
- Raffel, C.; Luong, M.; Liu, P. J.; Weiss, R. J.; and Eck, D. 2017. Online and Linear-Time Attention by Enforcing Monotonic Alignments. In CML.
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR.
- Rajani, N. F.; McCann, B.; Xiong, C.; and Socher, R. 2019. Explain Yourself! Leveraging Language Models for Commonsense Reasoning. In ACL.
- Reiter, R. 1980. A logic for default reasoning. Artificial intelligence.
- Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. ” Why should I trust you?” Explaining the predictions of any classifier. In ACM SIGKDD.
- Rudinger, R.; Shwartz, V.; Hwang, J. D.; Bhagavatula, C.; Forbes, M.; Le Bras, R.; Smith, N. A.; and Choi, Y. 2020. Thinking Like a Skeptic: Defeasible Inference in Natural Language. In Findings of ACL: EMNLP.
- Sap, M.; Le Bras, R.; Allaway, E.; Bhagavatula, C.; Lourie, N.; Rashkin, H.; Roof, B.; Smith, N. A.; and Choi, Y. 2019. Atomic: An atlas of machine commonsense for if-then reasoning. In AAAI.
- Serrano, S.; and Smith, N. A. 2019. Is Attention Interpretable? In ACL.
- Shwartz, V.; and Dagan, I. 2018. Paraphrase to Explicate: Revealing Implicit Noun-Compound Relations. In ACL.
- Shwartz, V.; West, P.; Le Bras, R.; Bhagavatula, C.; and Choi, Y. 2020. Unsupervised Commonsense Question Answering with Self-Talk. In EMNLP.
- Speer, R.; Chin, J.; and Havasi, C. 2017. ConceptNet 5.5: an open multilingual graph of general knowledge. In AAAI.
- Talmor, A.; Herzig, J.; Lourie, N.; and Berant, J. 2019. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In NAACL.
- Tamborrino, A.; Pellicano, N.; Pannier, B.; Voitot, P.; and Naudin, L. 2020. Pre-training Is (Almost) All You Need: An Application to Commonsense Reasoning. In ACL.
- Tenney, I.; Wexler, J.; Bastings, J.; Bolukbasi, T.; Coenen, A.; Gehrmann, S.; Jiang, E.; Pushkarna, M.; Radebaugh, C.; Reif, E.; and Yuan, A. 2020. The Language Interpretability Tool: Extensible, Interactive Visualizations and Analysis for NLP Models. In EMNLP.
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. Attention is All you Need. In Neurips.
- Wang, C.; Liang, S.; Zhang, Y.; Li, X.; and Gao, T. 2019. Does it Make Sense? And Why? A Pilot Study for Sense Making and Explanation. In ACL.
- Wiegreffe, S.; and Pinter, Y. 2019. Attention is not not Explanation. In EMNLP-IJCNLP.
- Wiegreffe, Sarah and Marasovic, Ana, and Smith, Noah A. 2020. Measuring Association Between Labels and Free-Text Rationales. arXiv.
- Williams, A.; Nangia, N.; and Bowman, S. 2018. A BroadCoverage Challenge Corpus for Sentence Understanding through Inference. In NAACL-HLT.
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; and Brew, J. 2019. HuggingFace’s Transformers: State-of-theart Natural Language Processing. arXiv.
- Zellers, R.; Holtzman, A.; Rashkin, H.; Bisk, Y.; Farhadi, A.; Roesner, F.; and Choi, Y. 2019. Defending Against Neural Fake News. In Neurips.
Full Text
Tags
Comments