# Not All Claims are Created Equal: Choosing the Right Approach to Assess Your Hypotheses

arxiv, 2020.

Weibo:

Abstract:

Empirical research in Natural Language Processing (NLP) has adopted a narrow set of principles for assessing hypotheses, relying mainly on p-value computation, which suffers from several known issues. While alternative proposals have been well-debated and adopted in other fields, they remain rarely discussed or used within the NLP commu...More

Code:

Data:

Introduction

- Empirical fields, such as Natural Language Processing (NLP), must follow scientific principles for assessing hypotheses and drawing conclusions from experiments.
- While S1 has higher accuracy than S2 in both cases, the gap is moderate and the datasets are of limited size
- Can this apparent difference in performance be explained by random chance, or do the authors have sufficient evidence to conclude that S1 is inherently different than S2 on these datasets?

Highlights

- Empirical fields, such as Natural Language Processing (NLP), must follow scientific principles for assessing hypotheses and drawing conclusions from experiments
- Our goal is to provide a unifying view of the pitfalls and best practices, and equip Natural Language Processing researchers with Bayesian hypothesis assessment approaches as an important alternative tool in their toolkit
- Using well-founded mechanisms for assessing the validity of hypotheses is crucial for any field that relies on empirical work
- Our survey indicates that the Natural Language Processing community is not fully utilizing scientific methods geared towards such assessment, with only a relatively small number of papers using such methods, and most of them relying on p-value
- Our goal was to review different alternatives, especially a few often ignored in Natural Language Processing
- A researcher should pick the right approach according to their needs and intentions, with a proper understanding of the techniques

Results

- A radius of one percent for ROPE implies that improvements less than 1 percent are not considered notable.

Conclusion

- Using well-founded mechanisms for assessing the validity of hypotheses is crucial for any field that relies on empirical work.
- The authors' survey indicates that the NLP community is not fully utilizing scientific methods geared towards such assessment, with only a relatively small number of papers using such methods, and most of them relying on p-value.
- The authors surfaced various issues and potential dangers of careless use and interpretations of different approaches.
- A researcher should pick the right approach according to their needs and intentions, with a proper understanding of the techniques.
- Incorrect use of any technique can result in misleading conclusions

Summary

## Introduction:

Empirical fields, such as Natural Language Processing (NLP), must follow scientific principles for assessing hypotheses and drawing conclusions from experiments.- While S1 has higher accuracy than S2 in both cases, the gap is moderate and the datasets are of limited size
- Can this apparent difference in performance be explained by random chance, or do the authors have sufficient evidence to conclude that S1 is inherently different than S2 on these datasets?
## Results:

A radius of one percent for ROPE implies that improvements less than 1 percent are not considered notable.## Conclusion:

Using well-founded mechanisms for assessing the validity of hypotheses is crucial for any field that relies on empirical work.- The authors' survey indicates that the NLP community is not fully utilizing scientific methods geared towards such assessment, with only a relatively small number of papers using such methods, and most of them relying on p-value.
- The authors surfaced various issues and potential dangers of careless use and interpretations of different approaches.
- A researcher should pick the right approach according to their needs and intentions, with a proper understanding of the techniques.
- Incorrect use of any technique can result in misleading conclusions

- Table1: Performance of two systems (<a class="ref-link" id="cDevlin_et+al_2019_a" href="#rDevlin_et+al_2019_a">Devlin et al, 2019</a>; <a class="ref-link" id="cSun_et+al_2018_a" href="#rSun_et+al_2018_a">Sun et al, 2018</a>) on the ARC question-answering dataset (<a class="ref-link" id="cClark_et+al_2018_a" href="#rClark_et+al_2018_a">Clark et al, 2018</a>). ARC-easy & ARCchallenge have 2376 & 1172 instances, respectively. Acc.: accuracy as a percentage
- Table2: Various classes of methods for statistical assessment of hypotheses
- Table3: A comparison of different statistical methods for evaluating the credibility of a hypothesis given a set of observations. The total number of published papers in at the ACL-2018 conference is 439
- Table4: Select models supported by our package HyBayes at the time of this publication

Related work

- While there is an abundant discussion of significance testing in other fields, only a handful of NLP efforts address it. For instance, Chinchor (1992) defined the principles of using hypothesis testing in the context of NLP problems. Mostnotably, there are works studying various randomized tests (Koehn, 2004; Ojala and Garriga, 2010; Graham et al, 2014), or metric-specific tests (Evert, 2004). More recently, Dror et al (2018) and Dror and Reichart (2018) provide a thorough review of

3https://www.aclweb.org/anthology/events/acl-2018/

frequentist tests. While an important step in better informing the community, it covers a subset of statistical tools. Our work complements this effort by pointing out alternative tests.

Funding

- This work was partly supported by a gift from the Allen Institute for AI and by DARPA contracts FA8750-19-2-1004 and FA875019-2-0201

Reference

- Valentin Amrhein, Franzi Korner-Nievergelt, and Tobias Roth. 2017. The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research. PeerJ, 5:e3544.
- Taylor Berg-Kirkpatrick, David Burkett, and Dan Klein. 201An empirical investigation of statistical significance in NLP. In Proceedings of EMNLP, pages 995–1005.
- James O Berger and Thomas Sellke. 1987. Testing a point null hypothesis: The irreconcilability of p values and evidence. Journal of the American Statistical Association, 82(397):112–122.
- Nancy Chinchor. 1992. The statistical significance of the MUC-4 results. In Proceedings of the 4th conference on Message understanding, pages 30–50. Association for Computational Linguistics.
- Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? Try ARC, the AI2 Reasoning Challenge. CoRR, abs/1803.05457.
- Janez Demsar. 2008. On the appropriateness of statistical tests in machine learning. In Workshop on Evaluation Methods for Machine Learning in conjunction with ICML, page 65.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL, pages 4171– 4186.
- Zoltan Dienes. 200Understanding psychology as a science: An introduction to scientific and statistical inference. Macmillan International Higher Education.
- Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. 2018. The hitchhikers guide to testing statistical significance in natural language processing. In Proceedings of ACL, pages 1383–1392.
- Rotem Dror and Roi Reichart. 2018. Recommended statistical significance tests for NLP tasks. arXiv preprint arXiv:1809.01448.
- Stefan Evert. 2004. Significance tests for the evaluation of ranking methods. In Proceedings of COLING.
- Dani Gamerman and Hedibert F Lopes. 2006. Markov chain Monte Carlo: stochastic simulation for Bayesian inference. CRC Press.
- Andrew Gelman. 20The problem with p-values is how they’re used.
- Steven Goodman. 2008. A dirty dozen: Twelve pvalue misconceptions. Seminars in Hematology, 45(3):135–140.
- Yvette Graham, Nitika Mathur, and Timothy Baldwin. 2014. Randomized significance tests in machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 266–274.
- Rink Hoekstra, Richard D Morey, Jeffrey N Rouder, and Eric-Jan Wagenmakers. 2014. Robust misinterpretation of confidence intervals. Psychonomic bulletin & review, 21(5):1157–1164.
- Jeehyoung Kim and Heejung Bang. 2016. Three common misuses of p values. Dental hypotheses, 7(3):73.
- Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of EMNLP.
- John K Kruschke. 2010. Bayesian data analysis. Wiley Interdisciplinary Reviews: Cognitive Science, 1(5):658–676.
- John K Kruschke. 2018. Rejecting or accepting parameter values in bayesian estimation. Advances in Methods and Practices in Psychological Science, 1(2):270–280.
- John K Kruschke and Torrin M Liddell. 2018. The bayesian new statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a bayesian perspective. Psychonomic Bulletin & Review, 25(1):178–206.
- Charles C Liu and Murray Aitkin. 2008. Bayes factors: Prior sensitivity and model generalizability. Journal of Mathematical Psychology, 52(6):362–375.
- N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller. 1953. Equation of state calculations by fast computing machines. The journal of chemical physics, 21:1087.
- Markus Ojala and Gemma C Garriga. 2010. Permutation tests for studying classifier performance. JMLR, 11(Jun):1833–1863.
- Travis E Oliphant. 2006. A Bayesian perspective on estimating mean, variance, and standard-deviation from data. Technical report, Brigham Young University. https://scholarsarchive.byu.edu/facpub/278/.
- Jean-Baptist du Prel, Gerhard Hommel, Bernd Rohrig, and Maria Blettner. 2009. Confidence interval or p-value?: Part 4 of a series on evaluation of scientific publications. Deutsches Arzteblatt International, 106(19):335.
- Stefan Riezler and John T Maxwell. 2005. On some pitfalls in automatic evaluation and significance testing for mt. In Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 57– 64.
- Sandip Sinharay and Hal S Stern. 2002. On the sensitivity of Bayes factors to the prior distributions. The American Statistician, 56(3):196–201.
- Anders Søgaard, Anders Johannsen, Barbara Plank, Dirk Hovy, and Hector Martınez Alonso. 2014. What’s in a p-value in NLP? In Proceedings of CoNLL, pages 1–10.
- Kai Sun, Dian Yu, Dong Yu, and Claire Cardie. 2018. Improving machine reading comprehension with general reading strategies. In Proceedings of NAACL.
- David Trafimow and Michael Marks. 2015. Editorial. Basic and Applied Social Psychology, 37(1):1–2.
- Ronald L Wasserstein, Nicole A Lazar, et al. 2016. The ASAs statement on p-values: context, process, and purpose. The American Statistician, 70(2):129–133.
- Donna M Windish, Stephen J Huot, and Michael L Green. 2007. Medicine residents’ understanding of the biostatistics and results in the medical literature. Jama, 298(9):1010–1022.
- Here we use a one-sided z-test to compare s1 = 1721 out of 2376 vs s2 = 1637 out of 2376. We start with calculating the z-score: s1 = 1721, n1 = 2376 (3)

Full Text

Tags

Comments