Uncertain Natural Language Inference

arxiv, 2020.

Cited by: 1|Bibtex|Views41
Other Links: arxiv.org
Weibo:
We proposed Uncertain Natural Language Inference, a new task of directly predicting human likelihood judgments on Natural Language Inference premisehypothesis pairs

Abstract:

We propose a refinement of Natural Language Inference (NLI), called Uncertain Natural Language Inference (UNLI), that shifts away from categorical labels, targeting instead the direct prediction of subjective probability assessments. Chiefly, we demonstrate the feasibility of collecting annotations for UNLI by relabeling a portion of th...More

Code:

Data:

0
Introduction
Highlights
  • Variants of entailment tasks have been used for decades in benchmarking systems for natural language understanding
  • Recognizing Textual Entailment (RTE) or Natural Language Inference (NLI) is traditionally a categorical classification problem: predict which of a set of discrete labels apply to an inference pair, consisting of a premise (p) and hypothesis (h)
  • We propose Uncertain Natural Language Inference (UNLI), a refinement of Natural Language Inference that captures more subtle distinctions in meaning by shifting away from categorical labels to the direct prediction of human subjective probability assessments
  • We proposed Uncertain Natural Language Inference (UNLI), a new task of directly predicting human likelihood judgments on Natural Language Inference premisehypothesis pairs
  • We demonstrated that (1) eliciting supporting data is feasible, and (2) annotations in the data can be used for improving a scalar regression model beyond the information contained in existing categorical labels, using recent contextualized word embeddings, e.g. BERT
  • Humans are able to make finer distinctions between meanings than is being captured by current annotation approaches; we advocate the community strives for systems that can do the same, and shift away from categorical Natural Language Inference labels and move to something more fine-grained such as our Uncertain Natural Language Inference protocol
Results
  • Just training on 55, 517 u-SNLI examples yields a 62.71% Pearson r on test.
  • The hypothesis-only baseline achieved a correlation around 40%.
  • This result corroborates the findings that a hidden bias exists in the SNLI dataset’s hypotheses, and shows this bias may exist in u-SNLI.9 f(p, h) = BERT(CLS ; p ; SEP ; h ; SEP)[0].
  • Owing to the concerns raised with annotation artifacts in SNLI (Gururangan et al, 2018; Tsuchiya, 2018; Poliak et al, 2018), the authors include a hypothesis-only baseline.8
Conclusion
  • The authors proposed Uncertain Natural Language Inference (UNLI), a new task of directly predicting human likelihood judgments on NLI premisehypothesis pairs.
  • The authors have shown that not all NLI contradictions are created equal, nor neutrals, nor entailments.
  • Humans are able to make finer distinctions between meanings than is being captured by current annotation approaches; the authors advocate the community strives for systems that can do the same, and shift away from categorical NLI labels and move to something more fine-grained such as the UNLI protocol
Summary
  • Introduction:

    Variants of entailment tasks have been used for decades in benchmarking systems for natural language understanding.
  • The FraCaS consortium offered the task as an evaluation mechanism, along with a small challenge set (Cooper et al, 1996), which was followed by the RTE challenges (Dagan et al, 2005)
  • Despite differences between these and recent NLI datasets (Marelli et al, 2014; Lai et al, 2017; Williams et al, 2018; Khot et al, 2018, i.a.), NLI hsa remained a categorical prediction problem.
  • Results:

    Just training on 55, 517 u-SNLI examples yields a 62.71% Pearson r on test.
  • The hypothesis-only baseline achieved a correlation around 40%.
  • This result corroborates the findings that a hidden bias exists in the SNLI dataset’s hypotheses, and shows this bias may exist in u-SNLI.9 f(p, h) = BERT(CLS ; p ; SEP ; h ; SEP)[0].
  • Owing to the concerns raised with annotation artifacts in SNLI (Gururangan et al, 2018; Tsuchiya, 2018; Poliak et al, 2018), the authors include a hypothesis-only baseline.8
  • Conclusion:

    The authors proposed Uncertain Natural Language Inference (UNLI), a new task of directly predicting human likelihood judgments on NLI premisehypothesis pairs.
  • The authors have shown that not all NLI contradictions are created equal, nor neutrals, nor entailments.
  • Humans are able to make finer distinctions between meanings than is being captured by current annotation approaches; the authors advocate the community strives for systems that can do the same, and shift away from categorical NLI labels and move to something more fine-grained such as the UNLI protocol
Tables
  • Table1: Probability assessments on NLI pairs. The NLI and UNLI columns respectively indicate the categorical label (from SNLI) and the subjective probability for the corresponding pair
  • Table2: A premise in SNLI with its 5 hypotheses (labeled as neutral in SNLI) annotated in u-SNLI
  • Table3: Selected u-SNLI dev examples where BERT predictions greatly deviate from gold assessments
  • Table4: Metrics for training on u-SNLI
  • Table5: Metrics for training only on mapped SNLI or fine-tuning on u-SNLI
  • Table6: Statistics of SNLI data re-annotated under UNLI
Download tables as Excel
Related work
  • The probabilistic nature and the uncertainty of NLI has been considered from a variety of perspectives. Glickman et al (2005) modified the task to explicitly include the probabilistic aspect of NLI, stating that “p probabilistically entails h ... if p increases the likelihood of h being true,” while Lai and Hockenmaier (2017) noted how predicting the conditional probability of one phrase given another would be helpful in predicting textual entailment. Other prior work has elicited ordinal annotations (e.g. Likert scale) reflecting likelihood judgments (Pavlick and Callison-Burch, 2016; Zhang et al, 2017), but then collapsed the annotations into coarse categorical labels for modeling. Vulicet al. (2017) proposed graded lexical entailment, which is similar to our idea but applied to lexical-level inference, asking “to what degree x is a type of y.” Additionally, Lalor et al (2016, 2018) tried capturing the uncertainty of each inference pair by item response theory (IRT), showing fine-grained differences in discriminative power in each label. Pavlick and Kwiatkowski (2019) recently argued that models should “explicitly capture the full distribution of plausible human judgments” as plausible human judgments cause inherent disagreements. Our concern is different as we are interested in the uncertain and probabilistic nature of NLI. We are the first to propose a method for direct elicitation of subjective probability judgments on NLI pairs and direct prediction of these scalars, as opposed to reducing to categorical classification.
Reference
  • Jean-Philippe Bernardy, Rasmus Blanck, Stergios Chatzikyriakidis, and Shalom Lappin. 2018. A compositional Bayesian semantics for natural language. In Proceedings of the First International Workshop on Language Cognition and Computational Models, pages 1–10. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642.
    Google ScholarLocate open access versionFindings
  • Robin Cooper, Dick Crouch, Jan Van Eijck, Chris Fox, Johan Van Genabith, Jan Jaspars, Hans Kamp, David Milward, Manfred Pinkal, Massimo Poesio, et al. 1996. Using the framework. Technical report, The FraCaS Consortium.
    Google ScholarFindings
  • Robin Cooper, Simon Dobnik, Shalom Lappin, and Stefan Larsson. 2015. Probabilistic type theory and natural language semantics. Linguistic Issues in Language Technology, 10(1):1–43.
    Google ScholarLocate open access versionFindings
  • Ido Dagan, Oren Glickman, and Bernardo Magnini. 200The PASCAL recognising textual entailment challenge. In Machine Learning Challenges, Evaluating Predictive Uncertainty, Visual Object Classification and Recognizing Textual Entailment, First
    Google ScholarFindings
  • Donald Davidson. 1967. Truth and meaning. Synthese, 17(1):304–323.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1, pages 4171–4186.
    Google ScholarLocate open access versionFindings
  • Jan van Eijck and Shalom Lappin. 2014. Probabilistic semantics for natural language. In Zoe Christoff, Paulo Galeazzi, Nina Gierasimczuk, Alexandru Marcoci, and Sonja Smets, editors, The Logic and Interactive Rationality Yearbook 2012, volume II.
    Google ScholarLocate open access versionFindings
  • Oren Glickman, Ido Dagan, and Moshe Koppel. 2005. A probabilistic classification approach for lexical textual entailment. In Proc. AAAI, AAAI’05, pages 1050–1055. AAAI Press.
    Google ScholarLocate open access versionFindings
  • Noah D. Goodman and Daniel Lassiter. 2015. Probabilistic semantics and pragmatics: Uncertainty in language and thought. In Shalom Lappin and Chris Fox, editors, The Handbook of Contemporary Semantic Theory, 2nd edition.
    Google ScholarLocate open access versionFindings
  • Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2, pages 107– 112.
    Google ScholarLocate open access versionFindings
  • Daniel Kahneman and Amos Tversky. 1979. Prospect theory: An analysis of decision under risk. Econometrica, 47(2):263–292.
    Google ScholarLocate open access versionFindings
  • Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018. SciTail: A textual entailment dataset from science question answering. In AAAI.
    Google ScholarFindings
  • Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization.
    Google ScholarFindings
  • Alice Lai, Yonatan Bisk, and Julia Hockenmaier. 2017. Natural language inference from multiple premises. In Proceedings of the Eighth International Joint Conference on Natural Language Processing, Volume 1, pages 100–109.
    Google ScholarLocate open access versionFindings
  • Alice Lai and Julia Hockenmaier. 2017. Learning to predict denotational probabilities for modeling entailment. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Volume 1, pages 721–730.
    Google ScholarLocate open access versionFindings
  • John P. Lalor, Hao Wu, Tsendsuren Munkhdalai, and Hong Yu. 2018. Understanding deep learning performance through an examination of test set difficulty: A psychometric case study. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4711–4716.
    Google ScholarLocate open access versionFindings
  • John P. Lalor, Hao Wu, and Hong Yu. 2016. Building an evaluation scale using item response theory. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 648–657.
    Google ScholarLocate open access versionFindings
  • Kenton Lee, Yoav Artzi, Yejin Choi, and Luke Zettlemoyer. 2015. Event detection and factuality assessment with non-expert supervision. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2015, pages 1643–1648.
    Google ScholarLocate open access versionFindings
  • Zhongyang Li, Tongfei Chen, and Benjamin Van Durme. 2019. Learning to rank for plausible plausibility. In Proceedings of the 57th Conference of the Association for Computational Linguistics, pages 4818–4823.
    Google ScholarLocate open access versionFindings
  • Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, and Roberto Zamparelli. 2014. The SICK (Sentences Involving Compositional Knowledge) dataset for relatedness and entailment.
    Google ScholarFindings
  • Richard Montague. 1973. The proper treatment of quantification in ordinary english. In K. J. J. Hintikka, J. M. E. Moravcsik, and P. Suppes, editors, Approaches to Natural Language: Proceedings of the 1970 Stanford Workshop on Grammar and Semantics, pages 221–242. Springer Netherlands, Dordrecht.
    Google ScholarLocate open access versionFindings
  • Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James F. Allen. 2016. A corpus and cloze evaluation for deeper understanding of commonsense stories. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 839–849.
    Google ScholarLocate open access versionFindings
  • Ellie Pavlick and Chris Callison-Burch. 2016. Most "babies" are "little" and most "problems" are "huge": Compositional entailment in adjective-nouns. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Volume 1, pages 2164–2173.
    Google ScholarLocate open access versionFindings
  • Ellie Pavlick and Tom Kwiatkowski. 2019. Inherent disagreements in human textual inferences. Trans. Assoc. Comput. Linguistics, 7:677–694.
    Google ScholarLocate open access versionFindings
  • Jason Phang, Thibault Févry, and Samuel R. Bowman. 2018. Sentence encoders on stilts: Supplementary training on intermediate labeled-data tasks. CoRR, abs/1811.01088.
    Findings
  • Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. 2018. Hypothesis only baselines in natural language inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pages 180–191.
    Google ScholarLocate open access versionFindings
  • Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. 2011. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In Logical Formalizations of Commonsense Reasoning, Papers from the 2011 AAAI Spring Symposium.
    Google ScholarLocate open access versionFindings
  • Rachel Rudinger, Aaron Steven White, and Benjamin Van Durme. 2018. Neural models of factuality. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1, pages 731–744.
    Google ScholarLocate open access versionFindings
  • Keisuke Sakaguchi and Benjamin Van Durme. 2018. Efficient online scalar annotation with bounded support. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Volume 1, pages 208–218.
    Google ScholarLocate open access versionFindings
  • Amram Shapiro, Louise Firth Campbell, and Rosalind Wright. 2014. Book of Odds: From Lightning Strikes to Love at First Sight, the Odds of Everyday Life. William Morrow Paperbacks.
    Google ScholarFindings
  • Gabriel Stanovsky, Judith Eckle-Kohler, Yevgeniy Puzikov, Ido Dagan, and Iryna Gurevych. 2017. Integrating deep linguistic features in factuality prediction over unified datasets. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Volume 2, pages 352–357.
    Google ScholarLocate open access versionFindings
  • Adam R. Teichert, Adam Poliak, Benjamin Van Durme, and Matthew R. Gormley. 2017. Semantic protorole labeling. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pages 4459–4466.
    Google ScholarLocate open access versionFindings
  • Masatoshi Tsuchiya. 2018. Performance impact caused by hidden bias of training data for recognizing textual entailment. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation.
    Google ScholarLocate open access versionFindings
  • Amos Tversky and Daniel Kahneman. 1981. The framing of decisions and the psychology of choice. Science, 211(4481):453–458.
    Google ScholarLocate open access versionFindings
  • Amos Tversky and Daniel Kahneman. 1992. Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and uncertainty, 5(4):297– 323.
    Google ScholarLocate open access versionFindings
  • Ivan Vulic, Daniela Gerz, Douwe Kiela, Felix Hill, and Anna Korhonen. 2017. Hyperlex: A large-scale evaluation of graded lexical entailment. Computational Linguistics, 43(4).
    Google ScholarLocate open access versionFindings
  • Adina Williams, Nikita Nangia, and Samuel R. Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1, pages 1112–1122.
    Google ScholarLocate open access versionFindings
  • Sheng Zhang, Rachel Rudinger, Kevin Duh, and Benjamin Van Durme. 2017. Ordinal common-sense inference. Trans. Assoc. Comput. Linguistics, 5:379– 395.
    Google ScholarLocate open access versionFindings
  • Annotators were given a qualification test to ensure non-expert workers were able to give reasonable subjective probability estimates. We first extracted seven statements from Book of Odds (Shapiro et al., 2014), and manually split the statement into a bleached premise and hypothesis. We then wrote three easy premise-hypothesis pairs with definite probabilities like (p = “A girl tossed a coin.”, h =
    Google ScholarFindings
  • (2) Their overall 0.7 and Spearman ρ > 0.4. This qualification test led to a pool of 40 trusted annotators, which were employed for the entirety of our dataset creation.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments