Not All Claims are Created Equal: Choosing the Right Approach to Assess Your Hypotheses

arxiv, 2020.

Cited by: 0|Bibtex|Views57
Other Links: arxiv.org
Weibo:
Using well-founded mechanisms for assessing the validity of hypotheses is crucial for any field that relies on empirical work

Abstract:

Empirical research in Natural Language Processing (NLP) has adopted a narrow set of principles for assessing hypotheses, relying mainly on p-value computation, which suffers from several known issues. While alternative proposals have been well-debated and adopted in other fields, they remain rarely discussed or used within the NLP commu...More

Code:

Data:

0
Introduction
  • Empirical fields, such as Natural Language Processing (NLP), must follow scientific principles for assessing hypotheses and drawing conclusions from experiments.
  • While S1 has higher accuracy than S2 in both cases, the gap is moderate and the datasets are of limited size
  • Can this apparent difference in performance be explained by random chance, or do the authors have sufficient evidence to conclude that S1 is inherently different than S2 on these datasets?
Highlights
  • Empirical fields, such as Natural Language Processing (NLP), must follow scientific principles for assessing hypotheses and drawing conclusions from experiments
  • Our goal is to provide a unifying view of the pitfalls and best practices, and equip Natural Language Processing researchers with Bayesian hypothesis assessment approaches as an important alternative tool in their toolkit
  • Using well-founded mechanisms for assessing the validity of hypotheses is crucial for any field that relies on empirical work
  • Our survey indicates that the Natural Language Processing community is not fully utilizing scientific methods geared towards such assessment, with only a relatively small number of papers using such methods, and most of them relying on p-value
  • Our goal was to review different alternatives, especially a few often ignored in Natural Language Processing
  • A researcher should pick the right approach according to their needs and intentions, with a proper understanding of the techniques
Results
  • A radius of one percent for ROPE implies that improvements less than 1 percent are not considered notable.
Conclusion
  • Using well-founded mechanisms for assessing the validity of hypotheses is crucial for any field that relies on empirical work.
  • The authors' survey indicates that the NLP community is not fully utilizing scientific methods geared towards such assessment, with only a relatively small number of papers using such methods, and most of them relying on p-value.
  • The authors surfaced various issues and potential dangers of careless use and interpretations of different approaches.
  • A researcher should pick the right approach according to their needs and intentions, with a proper understanding of the techniques.
  • Incorrect use of any technique can result in misleading conclusions
Summary
  • Introduction:

    Empirical fields, such as Natural Language Processing (NLP), must follow scientific principles for assessing hypotheses and drawing conclusions from experiments.
  • While S1 has higher accuracy than S2 in both cases, the gap is moderate and the datasets are of limited size
  • Can this apparent difference in performance be explained by random chance, or do the authors have sufficient evidence to conclude that S1 is inherently different than S2 on these datasets?
  • Results:

    A radius of one percent for ROPE implies that improvements less than 1 percent are not considered notable.
  • Conclusion:

    Using well-founded mechanisms for assessing the validity of hypotheses is crucial for any field that relies on empirical work.
  • The authors' survey indicates that the NLP community is not fully utilizing scientific methods geared towards such assessment, with only a relatively small number of papers using such methods, and most of them relying on p-value.
  • The authors surfaced various issues and potential dangers of careless use and interpretations of different approaches.
  • A researcher should pick the right approach according to their needs and intentions, with a proper understanding of the techniques.
  • Incorrect use of any technique can result in misleading conclusions
Tables
  • Table1: Performance of two systems (<a class="ref-link" id="cDevlin_et+al_2019_a" href="#rDevlin_et+al_2019_a">Devlin et al, 2019</a>; <a class="ref-link" id="cSun_et+al_2018_a" href="#rSun_et+al_2018_a">Sun et al, 2018</a>) on the ARC question-answering dataset (<a class="ref-link" id="cClark_et+al_2018_a" href="#rClark_et+al_2018_a">Clark et al, 2018</a>). ARC-easy & ARCchallenge have 2376 & 1172 instances, respectively. Acc.: accuracy as a percentage
  • Table2: Various classes of methods for statistical assessment of hypotheses
  • Table3: A comparison of different statistical methods for evaluating the credibility of a hypothesis given a set of observations. The total number of published papers in at the ACL-2018 conference is 439
  • Table4: Select models supported by our package HyBayes at the time of this publication
Download tables as Excel
Related work
  • While there is an abundant discussion of significance testing in other fields, only a handful of NLP efforts address it. For instance, Chinchor (1992) defined the principles of using hypothesis testing in the context of NLP problems. Mostnotably, there are works studying various randomized tests (Koehn, 2004; Ojala and Garriga, 2010; Graham et al, 2014), or metric-specific tests (Evert, 2004). More recently, Dror et al (2018) and Dror and Reichart (2018) provide a thorough review of

    3https://www.aclweb.org/anthology/events/acl-2018/

    frequentist tests. While an important step in better informing the community, it covers a subset of statistical tools. Our work complements this effort by pointing out alternative tests.
Funding
  • This work was partly supported by a gift from the Allen Institute for AI and by DARPA contracts FA8750-19-2-1004 and FA875019-2-0201
Reference
  • Valentin Amrhein, Franzi Korner-Nievergelt, and Tobias Roth. 2017. The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research. PeerJ, 5:e3544.
    Google ScholarLocate open access versionFindings
  • Taylor Berg-Kirkpatrick, David Burkett, and Dan Klein. 201An empirical investigation of statistical significance in NLP. In Proceedings of EMNLP, pages 995–1005.
    Google ScholarLocate open access versionFindings
  • James O Berger and Thomas Sellke. 1987. Testing a point null hypothesis: The irreconcilability of p values and evidence. Journal of the American Statistical Association, 82(397):112–122.
    Google ScholarLocate open access versionFindings
  • Nancy Chinchor. 1992. The statistical significance of the MUC-4 results. In Proceedings of the 4th conference on Message understanding, pages 30–50. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? Try ARC, the AI2 Reasoning Challenge. CoRR, abs/1803.05457.
    Findings
  • Janez Demsar. 2008. On the appropriateness of statistical tests in machine learning. In Workshop on Evaluation Methods for Machine Learning in conjunction with ICML, page 65.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL, pages 4171– 4186.
    Google ScholarLocate open access versionFindings
  • Zoltan Dienes. 200Understanding psychology as a science: An introduction to scientific and statistical inference. Macmillan International Higher Education.
    Google ScholarFindings
  • Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. 2018. The hitchhikers guide to testing statistical significance in natural language processing. In Proceedings of ACL, pages 1383–1392.
    Google ScholarLocate open access versionFindings
  • Rotem Dror and Roi Reichart. 2018. Recommended statistical significance tests for NLP tasks. arXiv preprint arXiv:1809.01448.
    Findings
  • Stefan Evert. 2004. Significance tests for the evaluation of ranking methods. In Proceedings of COLING.
    Google ScholarLocate open access versionFindings
  • Dani Gamerman and Hedibert F Lopes. 2006. Markov chain Monte Carlo: stochastic simulation for Bayesian inference. CRC Press.
    Google ScholarFindings
  • Andrew Gelman. 20The problem with p-values is how they’re used.
    Google ScholarFindings
  • Steven Goodman. 2008. A dirty dozen: Twelve pvalue misconceptions. Seminars in Hematology, 45(3):135–140.
    Google ScholarLocate open access versionFindings
  • Yvette Graham, Nitika Mathur, and Timothy Baldwin. 2014. Randomized significance tests in machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 266–274.
    Google ScholarLocate open access versionFindings
  • Rink Hoekstra, Richard D Morey, Jeffrey N Rouder, and Eric-Jan Wagenmakers. 2014. Robust misinterpretation of confidence intervals. Psychonomic bulletin & review, 21(5):1157–1164.
    Google ScholarLocate open access versionFindings
  • Jeehyoung Kim and Heejung Bang. 2016. Three common misuses of p values. Dental hypotheses, 7(3):73.
    Google ScholarLocate open access versionFindings
  • Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of EMNLP.
    Google ScholarLocate open access versionFindings
  • John K Kruschke. 2010. Bayesian data analysis. Wiley Interdisciplinary Reviews: Cognitive Science, 1(5):658–676.
    Google ScholarLocate open access versionFindings
  • John K Kruschke. 2018. Rejecting or accepting parameter values in bayesian estimation. Advances in Methods and Practices in Psychological Science, 1(2):270–280.
    Google ScholarLocate open access versionFindings
  • John K Kruschke and Torrin M Liddell. 2018. The bayesian new statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a bayesian perspective. Psychonomic Bulletin & Review, 25(1):178–206.
    Google ScholarLocate open access versionFindings
  • Charles C Liu and Murray Aitkin. 2008. Bayes factors: Prior sensitivity and model generalizability. Journal of Mathematical Psychology, 52(6):362–375.
    Google ScholarLocate open access versionFindings
  • N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller. 1953. Equation of state calculations by fast computing machines. The journal of chemical physics, 21:1087.
    Google ScholarLocate open access versionFindings
  • Markus Ojala and Gemma C Garriga. 2010. Permutation tests for studying classifier performance. JMLR, 11(Jun):1833–1863.
    Google ScholarLocate open access versionFindings
  • Travis E Oliphant. 2006. A Bayesian perspective on estimating mean, variance, and standard-deviation from data. Technical report, Brigham Young University. https://scholarsarchive.byu.edu/facpub/278/.
    Findings
  • Jean-Baptist du Prel, Gerhard Hommel, Bernd Rohrig, and Maria Blettner. 2009. Confidence interval or p-value?: Part 4 of a series on evaluation of scientific publications. Deutsches Arzteblatt International, 106(19):335.
    Google ScholarLocate open access versionFindings
  • Stefan Riezler and John T Maxwell. 2005. On some pitfalls in automatic evaluation and significance testing for mt. In Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 57– 64.
    Google ScholarLocate open access versionFindings
  • Sandip Sinharay and Hal S Stern. 2002. On the sensitivity of Bayes factors to the prior distributions. The American Statistician, 56(3):196–201.
    Google ScholarLocate open access versionFindings
  • Anders Søgaard, Anders Johannsen, Barbara Plank, Dirk Hovy, and Hector Martınez Alonso. 2014. What’s in a p-value in NLP? In Proceedings of CoNLL, pages 1–10.
    Google ScholarLocate open access versionFindings
  • Kai Sun, Dian Yu, Dong Yu, and Claire Cardie. 2018. Improving machine reading comprehension with general reading strategies. In Proceedings of NAACL.
    Google ScholarLocate open access versionFindings
  • David Trafimow and Michael Marks. 2015. Editorial. Basic and Applied Social Psychology, 37(1):1–2.
    Google ScholarLocate open access versionFindings
  • Ronald L Wasserstein, Nicole A Lazar, et al. 2016. The ASAs statement on p-values: context, process, and purpose. The American Statistician, 70(2):129–133.
    Google ScholarLocate open access versionFindings
  • Donna M Windish, Stephen J Huot, and Michael L Green. 2007. Medicine residents’ understanding of the biostatistics and results in the medical literature. Jama, 298(9):1010–1022.
    Google ScholarLocate open access versionFindings
  • Here we use a one-sided z-test to compare s1 = 1721 out of 2376 vs s2 = 1637 out of 2376. We start with calculating the z-score: s1 = 1721, n1 = 2376 (3)
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments