Learning to Rationalize for Nonmonotonic Reasoning with Distant Supervision

Faeze Brahman
Faeze Brahman
Vered Shwartz
Vered Shwartz
Rachel Rudinger
Rachel Rudinger
Cited by: 0|Bibtex|Views9
Other Links: arxiv.org
Weibo:
We focus on the Stanford Natural Language Inference dataset, in which image captions serve as premises, and hypotheses were crowdsourced

Abstract:

The black-box nature of neural models has motivated a line of research that aims to generate natural language rationales to explain why a model made certain predictions. Such rationale generation models, to date, have been trained on dataset-specific crowdsourced rationales, but this approach is costly and is not generalizable to new ta...More

Code:

Data:

0
Introduction
  • Deep neural models perform increasingly well across NLP tasks, but due to their black-box nature, their success comes at the cost of the understanding of the system.
  • Recognizing Textual Entailment (RTE; Dagan et al 2013), or, in its newer variant, Natural Language Inference (NLI; Bowman et al 2015), is defined as a 3-way classification task.
  • The authors focus on the Stanford Natural Language Inference dataset (SNLI; Bowman et al 2015), in which image captions serve as premises, and hypotheses were crowdsourced
Highlights
  • Deep neural models perform increasingly well across NLP tasks, but due to their black-box nature, their success comes at the cost of our understanding of the system
  • We focus on the Defeasible Inference task (δ-Natural Language Inference (NLI); Rudinger et al 2020), illustrated in Figure 1
  • We focus on the Stanford Natural Language Inference dataset (SNLI; Bowman et al 2015), in which image captions serve as premises, and hypotheses were crowdsourced
  • The result of the automatic measures are reported in Table 3
  • We study the quality of rationales in the e-δ-NLI dataset through human evaluation
  • As we show in Section 6.2, most generated rationales are in the format of e-SNLI rationales, which might explain the discrepancy between the quality of the generated rationales and that of the training data
Results
  • For each combination of rationale generation training setup, the authors generated a rationale for each instance in the test set using beam search with 5 beams.
  • The authors evaluated the generated rationales both in terms of automatic metrics and human evaluation.
  • The authors used standard n-gram overlap metrics: the precisionoriented BLEU score (Papineni et al 2002) and recalloriented ROUGE score (Lin 2004).
  • The authors used BLEU-4 that measures overlap of n-grams up to n = 4, and ROUGE-L that measures longest matching sequences, and compared multiple predictions against multiple distantly supervised rationales as references.
  • The result of the automatic measures are reported in Table 3.
  • The authors observe additive gain using multi-task setup on both BLEU and ROUGE scores
Conclusion
  • The authors presented an approach for generating rationales for the defeasible inference task, i.e., explaining why a given update either strengthened or weakened the hypothesis.
  • The authors experimented with various training setups categorized into post-hoc rationalization and joint prediction and rationalization.
  • The results indicated that the posthoc rationalization setup is easier than the joint setup, with many of the post-hoc generated rationales considered by humans as explanatory.
  • The authors hope that future work will focus on jointly predicting a label and generating a rationale, which is a more realistic setup and which may yield less trivial and more faithful rationales
Summary
  • Introduction:

    Deep neural models perform increasingly well across NLP tasks, but due to their black-box nature, their success comes at the cost of the understanding of the system.
  • Recognizing Textual Entailment (RTE; Dagan et al 2013), or, in its newer variant, Natural Language Inference (NLI; Bowman et al 2015), is defined as a 3-way classification task.
  • The authors focus on the Stanford Natural Language Inference dataset (SNLI; Bowman et al 2015), in which image captions serve as premises, and hypotheses were crowdsourced
  • Results:

    For each combination of rationale generation training setup, the authors generated a rationale for each instance in the test set using beam search with 5 beams.
  • The authors evaluated the generated rationales both in terms of automatic metrics and human evaluation.
  • The authors used standard n-gram overlap metrics: the precisionoriented BLEU score (Papineni et al 2002) and recalloriented ROUGE score (Lin 2004).
  • The authors used BLEU-4 that measures overlap of n-grams up to n = 4, and ROUGE-L that measures longest matching sequences, and compared multiple predictions against multiple distantly supervised rationales as references.
  • The result of the automatic measures are reported in Table 3.
  • The authors observe additive gain using multi-task setup on both BLEU and ROUGE scores
  • Conclusion:

    The authors presented an approach for generating rationales for the defeasible inference task, i.e., explaining why a given update either strengthened or weakened the hypothesis.
  • The authors experimented with various training setups categorized into post-hoc rationalization and joint prediction and rationalization.
  • The results indicated that the posthoc rationalization setup is easier than the joint setup, with many of the post-hoc generated rationales considered by humans as explanatory.
  • The authors hope that future work will focus on jointly predicting a label and generating a rationale, which is a more realistic setup and which may yield less trivial and more faithful rationales
Tables
  • Table1: Examples of rationales generated from each of the sources. W stands for a weakener update and S for strengthener
  • Table2: The different training setups we experiment with. We add special tokens to mark the boundaries of each input and output span, e.g. [premise] marks the beginning of the premise
  • Table3: Automatic and human evaluation of rationale generation for the test set. Human evaluation results are presented for strengtheners and weakeners separately (S/W)
  • Table4: Human evaluation for the distant supervision rationales in the test set. Results (percents) are presented for strengtheners and weakeners separately (S/W)
  • Table5: Patterns of rationales generated by Rationale BartL that were considered explanatory. H, S, and W stand for Hypothesis, Strengthener and Weakener
  • Table6: Examples for the common error types
  • Table7: Percent of rationales with each error type
  • Table8: Ablation studies human evaluation. Results (percents) are presented for strengtheners and weakeners (S/W)
Download tables as Excel
Funding
  • But if we map entailment to strengthener and contradiction to weakener, we get 64% accuracy on the update type prediction
  • The best performance is achieved by Rationale BART-L, in which 80% of the rationales were considered relevant, over 55% correct, and between 33% (weakeners) to 47% (strengtheners) explanatory
Study subjects and analysis
most similar pairs: 3
In addition, we generate the relationship between pairs of spans. We take the top 3 most similar pairs of su (subset of SG originated from U ) and sh (subset of SG originated from H), judged by the cosine similarity between their word2vec embeddings (Mikolov et al 2013).3. We prompt the LM with “P+ U + H

workers: 3
To ensure the quality of annotations, we required that the workers be located in the US, UK, or Canada, and have a 99% approval rate for at least 5,000 prior tasks. We aggregated annotations from 3 workers using majority vote. The annotations yielded fair levels of agreement, with Fleiss’ Kappa (Landis and Koch 1977) between κ = 0.22 for relevance and κ = 0.37 for being explanatory

individuals: 4
We manually analyzed the rationales generated by the best model (Rationale Bart-L) that were considered grammatical, relevant, and correct by humans. P: Four individuals are sitting on a small dock by the water as a boat sails by. (1) H: Four people sitting near the ocean. W: They’re in Egypt

people: 4
W: They’re in Egypt. R: Before, four people needed to go to the beach. P: Two men in orange uniforms stand before a train and do some work

Reference
  • Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In ICLR.
    Google ScholarFindings
  • Bastings, J.; Aziz, W.; and Titov, I. 2019. Interpretable Neural Predictions with Differentiable Binary Variables. In ACL.
    Google ScholarFindings
  • Bhagavatula, C.; Le Bras, R.; Malaviya, C.; Sakaguchi, K.; Holtzman, A.; Rashkin, H.; Downey, D.; Yih, W.-t.; and Choi, Y. 2019. Abductive Commonsense Reasoning. In ICLR.
    Google ScholarFindings
  • Bosselut, A.; Rashkin, H.; Sap, M.; Malaviya, C.; Celikyilmaz, A.; and Choi, Y. 2019. COMET: Commonsense Transformers for Automatic Knowledge Graph Construction. In ACL.
    Google ScholarFindings
  • Bowman, S.; Angeli, G.; Potts, C.; and Manning, C. D. 201A large annotated corpus for learning natural language inference. In EMNLP.
    Google ScholarFindings
  • Camburu, O.-M.; Rocktaschel, T.; Lukasiewicz, T.; and Blunsom, P. 2018. e-SNLI: Natural Language Inference with Natural Language Explanations. In Neurips.
    Google ScholarFindings
  • Dagan, I.; Roth, D.; Sammons, M.; and Zanzotto, F. M. 2013. Recognizing Textual Entailment: Models and Applications. Morgan & Claypool Publishers.
    Google ScholarFindings
  • Davison, J.; Feldman, J.; and Rush, A. 2019. Commonsense Knowledge Mining from Pretrained Models. In EMNLPIJCNLP.
    Google ScholarFindings
  • Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 201BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL.
    Google ScholarFindings
  • Dodge, J.; Liao, Q. V.; Zhang, Y.; Bellamy, R. K. E.; and Dugan, C. 2019. Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment. In IUI.
    Google ScholarFindings
  • Gazzaniga, M. S.; and LeDoux, J. E. 2013. The integrated mind. Springer Science & Business Media.
    Google ScholarFindings
  • Guan, J.; Huang, F.; Zhao, Z.; Zhu, X.; and Huang, M. 2020. A knowledge-enhanced pretraining model for commonsense story generation. TACL.
    Google ScholarLocate open access versionFindings
  • Gururangan, S.; Swayamdipta, S.; Levy, O.; Schwartz, R.; Bowman, S.; and Smith, N. A. 2018. Annotation Artifacts in Natural Language Inference Data. In NAACL.
    Google ScholarLocate open access versionFindings
  • Hase, P.; and Bansal, M. 2020. Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? In ACL.
    Google ScholarFindings
  • Holtzman, A.; Buys, J.; Du, L.; Forbes, M.; and Choi, Y. 2020. The Curious Case of Neural Text Degeneration. In ICLR.
    Google ScholarFindings
  • Honnibal, M.; and Montani, I. 2017. spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. To appear 7(1).
    Google ScholarFindings
  • Jacovi, A.; and Goldberg, Y. 2020. Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? In ACL.
    Google ScholarLocate open access versionFindings
  • Jain, S.; and Wallace, B. C. 2019. Attention is not Explanation. In NAACL.
    Google ScholarFindings
  • Jain, S.; Wiegreffe, S.; Pinter, Y.; and Wallace, B. C. 2020. Learning to Faithfully Rationalize by Construction. In Jurafsky, D.; Chai, J.; Schluter, N.; and Tetreault, J. R., eds., ACL.
    Google ScholarFindings
  • Kumar, S.; and Talukdar, P. P. 20NILE: Natural Language Inference with Faithful Natural Language Explanations. In Jurafsky, D.; Chai, J.; Schluter, N.; and Tetreault, J. R., eds., ACL.
    Google ScholarFindings
  • Landis, J. R.; and Koch, G. G. 1977. The measurement of observer agreement for categorical data. biometrics.
    Google ScholarFindings
  • Latcinnik, V.; and Berant, J. 2020. Explaining question answering models through text generation. arXiv.
    Google ScholarFindings
  • Lei, T.; Barzilay, R.; and Jaakkola, T. 2016. Rationalizing Neural Predictions. In EMNLP.
    Google ScholarLocate open access versionFindings
  • Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; and Zettlemoyer, L. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In ACL.
    Google ScholarFindings
  • Lin, C.-Y. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out.
    Google ScholarFindings
  • Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv.
    Google ScholarFindings
  • Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Efficient estimation of word representations in vector space. arXiv.
    Google ScholarFindings
  • Narang, S.; Raffel, C.; Lee, K.; Roberts, A.; Fiedel, N.; and Malkan, K. 2020. WT5?! Training Text-to-Text Models to Explain their Predictions. arXiv.
    Google ScholarFindings
  • Nie, Y.; Williams, A.; Dinan, E.; Bansal, M.; Weston, J.; and Kiela, D. 2020. Adversarial NLI: A New Benchmark for Natural Language Understanding. In ACL.
    Google ScholarFindings
  • Novikova, J.; Dusek, O.; Curry, A. C.; and Rieser, V. 2017. Why We Need New Evaluation Metrics for NLG. In EMNLP.
    Google ScholarFindings
  • Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. BLEU: a method for automatic evaluation of machine translation. In ACL.
    Google ScholarFindings
  • Petroni, F.; Rocktaschel, T.; Riedel, S.; Lewis, P.; Bakhtin, A.; Wu, Y.; and Miller, A. 2019. Language Models as Knowledge Bases? In EMNLP-IJCNLP. Hong Kong, China.
    Google ScholarLocate open access versionFindings
  • Poliak, A.; Naradowsky, J.; Haldar, A.; Rudinger, R.; and Van Durme, B. 2018. Hypothesis Only Baselines in Natural Language Inference. In *SEM.
    Google ScholarLocate open access versionFindings
  • Qin, L.; Bosselut, A.; Holtzman, A.; Bhagavatula, C.; Clark, E.; and Choi, Y. 2019. Counterfactual Story Reasoning and Generation. In EMNLP-IJCNLP.
    Google ScholarLocate open access versionFindings
  • Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and Sutskever, I. 2019. Language Models are Unsupervised Multitask Learners. OpenAI Blog.
    Google ScholarFindings
  • Raffel, C.; Luong, M.; Liu, P. J.; Weiss, R. J.; and Eck, D. 2017. Online and Linear-Time Attention by Enforcing Monotonic Alignments. In CML.
    Google ScholarFindings
  • Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR.
    Google ScholarLocate open access versionFindings
  • Rajani, N. F.; McCann, B.; Xiong, C.; and Socher, R. 2019. Explain Yourself! Leveraging Language Models for Commonsense Reasoning. In ACL.
    Google ScholarFindings
  • Reiter, R. 1980. A logic for default reasoning. Artificial intelligence.
    Google ScholarFindings
  • Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. ” Why should I trust you?” Explaining the predictions of any classifier. In ACM SIGKDD.
    Google ScholarLocate open access versionFindings
  • Rudinger, R.; Shwartz, V.; Hwang, J. D.; Bhagavatula, C.; Forbes, M.; Le Bras, R.; Smith, N. A.; and Choi, Y. 2020. Thinking Like a Skeptic: Defeasible Inference in Natural Language. In Findings of ACL: EMNLP.
    Google ScholarLocate open access versionFindings
  • Sap, M.; Le Bras, R.; Allaway, E.; Bhagavatula, C.; Lourie, N.; Rashkin, H.; Roof, B.; Smith, N. A.; and Choi, Y. 2019. Atomic: An atlas of machine commonsense for if-then reasoning. In AAAI.
    Google ScholarFindings
  • Serrano, S.; and Smith, N. A. 2019. Is Attention Interpretable? In ACL.
    Google ScholarLocate open access versionFindings
  • Shwartz, V.; and Dagan, I. 2018. Paraphrase to Explicate: Revealing Implicit Noun-Compound Relations. In ACL.
    Google ScholarFindings
  • Shwartz, V.; West, P.; Le Bras, R.; Bhagavatula, C.; and Choi, Y. 2020. Unsupervised Commonsense Question Answering with Self-Talk. In EMNLP.
    Google ScholarFindings
  • Speer, R.; Chin, J.; and Havasi, C. 2017. ConceptNet 5.5: an open multilingual graph of general knowledge. In AAAI.
    Google ScholarFindings
  • Talmor, A.; Herzig, J.; Lourie, N.; and Berant, J. 2019. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In NAACL.
    Google ScholarFindings
  • Tamborrino, A.; Pellicano, N.; Pannier, B.; Voitot, P.; and Naudin, L. 2020. Pre-training Is (Almost) All You Need: An Application to Commonsense Reasoning. In ACL.
    Google ScholarFindings
  • Tenney, I.; Wexler, J.; Bastings, J.; Bolukbasi, T.; Coenen, A.; Gehrmann, S.; Jiang, E.; Pushkarna, M.; Radebaugh, C.; Reif, E.; and Yuan, A. 2020. The Language Interpretability Tool: Extensible, Interactive Visualizations and Analysis for NLP Models. In EMNLP.
    Google ScholarFindings
  • Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. Attention is All you Need. In Neurips.
    Google ScholarFindings
  • Wang, C.; Liang, S.; Zhang, Y.; Li, X.; and Gao, T. 2019. Does it Make Sense? And Why? A Pilot Study for Sense Making and Explanation. In ACL.
    Google ScholarFindings
  • Wiegreffe, S.; and Pinter, Y. 2019. Attention is not not Explanation. In EMNLP-IJCNLP.
    Google ScholarFindings
  • Wiegreffe, Sarah and Marasovic, Ana, and Smith, Noah A. 2020. Measuring Association Between Labels and Free-Text Rationales. arXiv.
    Google ScholarFindings
  • Williams, A.; Nangia, N.; and Bowman, S. 2018. A BroadCoverage Challenge Corpus for Sentence Understanding through Inference. In NAACL-HLT.
    Google ScholarFindings
  • Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; and Brew, J. 2019. HuggingFace’s Transformers: State-of-theart Natural Language Processing. arXiv.
    Google ScholarLocate open access versionFindings
  • Zellers, R.; Holtzman, A.; Rashkin, H.; Bisk, Y.; Farhadi, A.; Roesner, F.; and Choi, Y. 2019. Defending Against Neural Fake News. In Neurips.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments