For comparisons based on small samples, there is little reason to think that such an evaluation could reliably provide evidence of a significant improvement, and good reason to believe that improvements found to be significant will exaggerate or reverse the true effect
With Little Power Comes Great Responsibility
EMNLP 2020, pp.9263-9274, (2020)
Despite its importance to experimental design, statistical power (the probability that, given a real effect, an experiment will reject the null hypothesis) has largely been ignored by the NLP community. Underpowered experiments make it more difficult to discern the difference between statistical noise and meaningful model improvements, an...更多
下载 PDF 全文
- Despite its importance to empirical evaluation, relatively little attention has been paid to statistical power in NLP.
- Power is the probability that a statistical test will successfully detect a true effect.
- The authors will need multiple people to evaluate the systems.
- Once the authors have collected data, a statistical test will tell them if the authors can reject the null hypothesis the systems are good.
- Assuming the systems are not identical, statistical power is the probability that the experiment will return a significant result.
- Power depends on multiple factors, including the statistical test used, the significance threshold, true effect size, variance, and sample size.
- Note that if the authors do find a significant difference, this does not imply that the experiment had high power.2
- Despite its importance to empirical evaluation, relatively little attention has been paid to statistical power in NLP
- Power depends on multiple factors, including the statistical test used, the significance threshold, true effect size, variance, and sample size
- We introduce a novel approach to power analysis for machine translation and characterize power in experiments testing for differences in BLEU (§4)
- We have presented evidence that underpowered experiments are widespread in NLP
- For comparisons based on small samples, there is little reason to think that such an evaluation could reliably provide evidence of a significant improvement, and good reason to believe that improvements found to be significant will exaggerate or reverse the true effect
- Going forward, a combination of larger test sets, simple power analyses, and wider sharing of code, data, and experimental details will help to build the foundation for a higher standard of experimental methodology in NLP
- Proceeding with a test that is underpowered means that one is less likely to be able to draw any useful statistical conclusion from the experiment, and has contributed, in part, to the replication crisis in other fields (Button et al, 2013; Szucs and Ioannidis, 2017; Ioannidis et al, 2017).
- Recent progress in NLP has been extraordinarily rapid, sometimes at the cost of experimental rigor.
- The authors have presented evidence that underpowered experiments are widespread in NLP.
- Going forward, a combination of larger test sets, simple power analyses, and wider sharing of code, data, and experimental details will help to build the foundation for a higher standard of experimental methodology in NLP
- Table1: A contingency table representing the distribution of possible outcomes for two models (M1 and M2)
- Table2: Estimated minimum detectable effect (MDE) using a regression-based estimate of likely agreement with leaderboard SOTA as of May 6th, 2020. |∆acc| is the average improvement over baseline per task among surveyed papers that claimed SOTA. For future comparisons, unless the expected improvement is larger than the estimated MDE, an experiment is unlikely to be adequately powered, and researchers should instead choose a different (larger) dataset. Note that this likely applies to the vast majority of experiments on WNLI, MRPC, and SST-2, based on recent trends. † indicates that the SQuAD 2.0 average was based on leaderboard improvements, which weren’t necessarily reported in a publication. See Appendix E for full table and details
- Table3: Relevant parameters from four MT evaluations. TF are Transformer-based (<a class="ref-link" id="cOtt_et+al_2018_a" href="#rOtt_et+al_2018_a">Ott et al, 2018</a>; <a class="ref-link" id="cEdunov_et+al_2018_a" href="#rEdunov_et+al_2018_a">Edunov et al, 2018</a>; <a class="ref-link" id="cNg_et+al_2019_a" href="#rNg_et+al_2019_a">Ng et al, 2019</a>) and Conv are Convolutional models (<a class="ref-link" id="cGehring_et+al_2017_a" href="#rGehring_et+al_2017_a">Gehring et al, 2017</a>) from FAIRSEQ. Test sets are from WMT shared tasks for En-De translation. ∆B is the reported difference in BLEU, whereas P0 andb0 are estimated. * indicates ensembles
- Table4: A possible distribution corresponding to the case where models M1 and M2 will agree on 90% of examples (Pa) and M2 achieves a 2% improvement over M1 (∆acc). Note that the on-diagonal terms here will be dictated by the accuracy of M1 (or equivalently, by M2), but for our purposes, only need to be non-negative and sum to Pa for the sake of McNemar’s test, which only looks at the off-diagonal elements
- Table5: OLS Regression Results for predicting GLUE model overlap from baseline accuracy and effect size
- Table6: OLS Regression Results for predicting SQuAD 2.0 model overlap
- Table7: OLS regression for predicting effect size for GLUE tasks
- Table8: OLS Regression Results for predicting effect size from baseline accuracy for SQuAD 2.0 improvements
- Table9: The minimum detectable effect (MDE) for various datasets given the current top accuracy on the leaderboard on May 6th, 2020. See Appendix E for expanded details. How to use this table? Suppose you are building a model to get SOTA on any of these datasets. If you don’t have a reasonable expectation that your model will exceed the MDE, then it is not worth proceeding with the study on a dataset of this size and instead either more data should be collected or a different (larger) dataset used. MDE (<a class="ref-link" id="cLachenbruch_1992_a" href="#rLachenbruch_1992_a">Lachenbruch, 1992</a>) provides a mid-point and upper/lower bound assumptions using the most conservative and generous estimates of model agreement. MDE Binomial uses the binomial test as the assumed statistical test and calculates the MDE using the exact mechanism from Appendix E.5. See also discussion by <a class="ref-link" id="cArend_2019_a" href="#rArend_2019_a">Arend and Schafer (2019</a>). ∆ is the expected effect by fitting a regression to all SOTA improvement claims found in reviewed papers. |∆| (std.err., n) is the average improvement in surveyed papers that claimed SOTA and had a positive effect size reported for the dataset (with standard error and the number of papers in parentheses). † indicates that the SQuAD 2.0 average improvement was based on improvements to the SQuAD leaderboard, but weren’t necessarily reported as improvements in a publication. ∆SOT A is the gap between the SOTA model (ALBERT + DAAF + NAS) on GLUE and the next best model (ERNIE) – this was not included in the regression
- Table10: We examine the claims of SOTA improvement in surveyed GLUE papers and use a leave-one-out regression-based estimate of effect size and overlap to simulate how many authors would have found their study to be well-powered. We also examine how many of the observed effects were likely significant based on predicted model overlap. We note that if we use the observed effect in a post-hoc analysis, the proportion of studies falling below the MDE is even higher
- Table11: A contingency table representing the distribution of possible outcomes for two models (M1 and M2) on the instances of a single class of labels. The cells of this table should sum to 1.0 for each class
- Table12: Number of workers and items in each of our convenience sampled datasets
- Table13: Fit fixed effect coefficients for each model along with the residual model variance. If only one model is compared to a baseline, there is a value for intercept and β1. If more than one model, there is an additional parameter for each model. Because we use contrast coding, each coefficient can be interpreted as the difference from the grand mean
- Table14: Fit random effects standard deviations for worker. As in the equations above, σ0w is the worker intercept and the rest of the parameters are worker slopes for each model
- Table15: Fit random effects standard deviations for item. As in the equations above, σ0i is the item intercept and the rest of the parameters are item slopes for each model
- Table16: An example of high variance and low variance settings. The standard deviations correspond to the variance parameters for worker intercept, worker slope, item intercept, item slope, and sigma, respectively
- Proceeding with a test that is underpowered (i.e., too few subjects or items; often taken to mean less than 80% power; Cohen, 1962) means that one is less likely to be able to draw any useful statistical conclusion from the experiment, and has contributed, in part, to the replication crisis in other fields (Button et al, 2013; Szucs and Ioannidis, 2017; Ioannidis et al, 2017)
Given these parameters, we can assess the likely power and MDE for a typical model improvement against a given baseline accuracy level. To fit a regression to predict typical improvements to SOTA, we gather data from GLUE papers and manually label 119 accuracy comparisons and 57 claims of improvement (as denoted by bolding of a result and a claim of SOTA in text) across 14 papers (selected as being at or above the BERT score on the GLUE leaderboard with an accompanying paper). In regressing ∆acc on baseline accuracy and task, we achieve an R2 = 0.69, which is not a perfect fit, but still provides a prior on likely effect size
on models comparing to even weaker baselines, we would expect most future improvements to be even smaller. Thus, most future experiments involving these three datasets will not have adequate power to test for improvements over the current SOTA in the way that they are routinely used. Moreover, alternative analyses give even more pessimistic estimates of likely improvements relative to MDE, as described in Appendix E.4
To generalize across studies, we restrict our analysis to Likertscale comparisons, which was the most commonly reported type of evaluation. We extracted all cases where a new model was being compared to the best-performing baseline on one more metrics (117 comparisons from 41 papers) and normalized all ratings to be on a 0-1 scale. One takeaway from this meta-analysis is that the reported effect sizes (that is, difference between the novel model and the best-performing baseline) vary widely (s.d. = .12 on a [0, 1] scale)
From this analysis, we highlight a few key takeaways:. • Many human evaluation studies are likely underpowered: Using the “high variance” parameters (which are typical of most of the datasets we used), the most common design at EMNLP 2019 (3 workers, 100 items) is underpowered unless the effect size is quite large (0.2 or higher on the [0, 1] scale). • Even with low variance, typical designs are underpowered to detect small effects: Using our estimated parameters for the low variance setting, experiments will be underpowered to detect small effects (0.05 on the [0, 1] scale), unless an unusually large number of ratings per item are collected (10+ for 100 items)
- Hua Ai, Antoine Raux, Dan Bohus, Maxine Eskenazi, and Diane Litman. 2007. Comparing spoken dialog corpora collected with recruited subjects versus real users. In Proceedings SIGdial.
- Frank J. Anscombe. 1954. Fixed-sample-size analysis of sequential observations. Biometrics, 10:89–100.
- Matthias G. Arend and Thomas Schafer. 2019. Statistical power in two-level models: A tutorial based on Monte Carlo simulation. Psychological methods, 24(1):1–19.
- Erfan Sadeqi Azer, Daniel Khashabi, Ashish Sabharwal, and Dan Roth. 2020. Not all claims are created equal: Choosing the right statistical approach to assess hypotheses. In Proceedings of ACL.
- Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. 2006. The second PASCAL recognising textual entailment challenge. In Proceedings of the second PASCAL challenges workshop on recognising textual entailment.
- Dale J. Barr, Roger Levy, Christoph Scheepers, and Harry J. Tily. 2013. Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3):255–278.
- Colin B. Begg and Madhuchhanda Mazumdar. 1994. Operating characteristics of a rank correlation test for publication bias. Biometrics, 50(4):1088–1101.
- Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. 2009. The fifth PASCAL recognizing textual entailment challenge. In Proceedings of TAC.
- Taylor Berg-Kirkpatrick, David Burkett, and Dan Klein. 2012. An empirical investigation of statistical significance in NLP. In Proceedings of EMNLP.
- Katherine S. Button, John P. A. Ioannidis, Claire Mokrysz, Brian A. Nosek, Jonathan Flint, Emma S. J. Robinson, and Marcus R. Munafo. 2013. Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5):365–376.
- Boxing Chen and Colin Cherry. 2014. A systematic comparison of smoothing techniques for sentencelevel BLEU. In Proceedings of WMT.
- Kevin Clark, Minh-Thang Luong, Urvashi Khandelwal, Christopher D. Manning, and Quoc V. Le. 2019a. BAM! Born-again multi-task networks for natural language understanding. In Proceedings of ACL.
- Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2019b. ELECTRA: Pretraining text encoders as discriminators rather than generators. In Proceedings of ICLR.
- Jacob Cohen. 1962. The statistical power of abnormalsocial psychological research: A review. Journal of Abnormal and Social Psychology, 65(3):145–153.
- John E. Connett, Judith A. Smith, and Richard B. McHugh. 1987. Sample size and power for pairmatched case-control studies. Statistics in Medicine, 6(1):53–59.
- Ido Dagan, Oren Glickman, and Bernardo Magnini. 2005. The PASCAL recognising textual entailment challenge. In Proceedings of the Machine Learning Challenges Workshop.
- Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. 2020. Plug and play language models: A simple approach to controlled text generation. In Proceedings of ICLR.
- Janez Demsar. 2006. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res., 7:1–30.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 20BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL.
- Thomas G. Dietterich. 1998. Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation, 10(7):1895– 1923.
- Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A. Smith. 2019. Show your work: Improved reporting of experimental results. In Proceedings of EMNLP.
- William B. Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005).
- Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. 2018. The hitchhiker’s guide to testing statistical significance in natural language processing. In Proceedings of ACL.
- Stephen W. Duffy. 1984. Asymptotic and exact power for the McNemar test and its analogue with R controls per case. Biometrics, 40:1005–1015.
- Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding back-translation at scale. In Proceedings of EMNLP.
- Morten W. Fagerland, Stian Lydersen, and Petter Laake. 2013. The McNemar test for binary matched-pairs data: mid-p and asymptotic are better than exact conditional. BMC Medical Research Methodology, 13.
- Cristina Garbacea, Samuel Carton, Shiyan Yan, and Qiaozhu Mei. 2019. Judge the judges: A large-scale evaluation study of neural language models for online review generation. In Proceedings of EMNLP.
- Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional sequence to sequence learning. In Proceedings of ICML.
- Andrew Gelman. 2019. Don’t calculate post-hoc power using observed estimate of effect size. Annals of Surgery, 269(1):e9–e10.
- Andrew Gelman and John Carlin. 2014. Beyond power calculations: Assessing type S (sign) and type M (magnitude) errors. Perspectives on Psychological Science, 9(6):641–651.
- Andrew Gelman and Jennifer Hill. 2006. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.
- Andrew Gelman and Eric Loken. 2013. The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time.
- Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. 2007. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing.
- Yvette Graham, Nitika Mathur, and Timothy Baldwin. 2014. Randomized significance tests in machine translation. In Proceedings of WMT.
- Tatsunori B Hashimoto, Hugh Zhang, and Percy Liang. 2019. Unifying human and statistical evaluation for natural language generation. In Proceedings of NAACL.
- John M. Hoenig and Dennis M. Heisey. 2001. The abuse of power: The pervasive fallacy of power calculations for data analysis. The American Statistician, 55(1):19–24.
- Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. 2020. The curious case of neural text degeneration. In Proceedings of ICLR.
- John P. A. Ioannidis. 2019. What have we (not) learnt from millions of scientific papers with P values? The American Statistician, 73(sup1):20–25.
- John P. A. Ioannidis, T. D. Stanley, and Hristos Doucouliagos. 2017. The power of bias in economics research. The Economic Journal, 127(605):F236– F265.
- Shankar Iyer, Nikhil Dandekar, and Kornel Csernai. 2017. First Quora dataset release: Question pairs.
- Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of EMNLP.
- Helena C. Kraemer and Christine Blasey. 2015. How Many Subjects?: Statistical Power Analysis in Research. SAGE.
- Peter A Lachenbruch. 1992. On the sample size for studies based upon McNemar’s test. Statistics in Medicine, 11(11):1521–1525.
- Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A lite BERT for self-supervised learning of language representations. In Proceedings of ICLR.
- Chris van der Lee, Albert Gatt, Emiel van Miltenburg, Sander Wubben, and Emiel Krahmer. 2019. Best practices for the human evaluation of automatically generated text. In Proceedings of INLG.
- Hector J. Levesque, Ernest Davis, and Leora Morgenstern. 2012. The Winograd schema challenge. In Proceedings of KR.
- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. ROBERTA: A robustly optimized BERT pretraining approach. Computing Research Repository, arXiv:1907.11692.
- R. Thomas McCoy, Junghyun Min, and Tal Linzen. 2019. BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance. Computing Research Repository, arXiv:1911.02969.
- Blakeley B. McShane, David Gal, Andrew Gelman, Christian Robert, and Jennifer L. Tackett. 2019. Abandon statistical significance. The American Statistician, 73(sup1):235–245.
- Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, and Sergey Edunov. 2019. Facebook FAIR’s WMT19 news translation task submission. In Proceedings of WMT.
- Daniel J. O’Keefe. 2007. Brief report: Post hoc power, observed power, a priori power, retrospective power, prospective power, achieved power: Sorting out appropriate uses of statistical power analyses. Communication Methods and Measures, 1(4):291–299.
- Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. FAIRSEQ: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL.
- Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. 2018. Scaling neural machine translation. In Proceedings of WMT.
- Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of ACL.
- Jason Phang, Thibault Fevry, and Samuel R Bowman. 2018. Sentence encoders on STILTs: Supplementary training on intermediate labeled-data tasks. Computing Research Repository, arXiv:1811.01088.
- Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of WMT.
- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the limits of transfer learning with a unified text-totext transformer. Computing Research Repository, arXiv:1910.10683.
- Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of ACL.
- Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of EMNLP.
- Stefan Riezler and John T. Maxwell. 2005. On some pitfalls in automatic evaluation and significance testing for MT. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.
- Jacob Westfall, David A. Kenny, and Charles M. Judd. 2014. Statistical power and optimal design in experiments in which samples of participants respond to samples of stimuli. Journal of Experimental Pyschology: General, 143(5):2020–2045.
- Adina Williams, Nikita Nangia, and Samuel R. Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of NAACL.
- Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R. Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. In Proceedings of NeurIPS.
- Georgios N. Yannakakis and Hector P. Martınez. 2015. Ratings are overrated! Frontiers in ICT, 2.
- Jeffrey D. Scargle. 1999. Publication bias: The “filedrawer” problem in scientific inference. arXiv, arXiv:physics/9909033.
- James J. Schlesselman. 1982. Case-control studies: Design, conduct, analysis. Oxford University Press.
- Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. 2019. Green AI. Computing Research Repository, arXiv:1907.10597.
- Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of EMNLP.
- Anders Søgaard, Anders Johannsen, Barbara Plank, Dirk Hovy, and Hector Martınez Alonso. 2014. What’s in a p-value in NLP? In Proceedings CoNLL.
- Samy Suissa and Jonathan J. Shuster. 1991. The 2 x 2 matched-pairs trial: Exact unconditional design and analysis. Biometrics, 47(2):361–372.
- Denes Szucs and John P. A. Ioannidis. 2017. Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature. PLoS biology, 15(3).
- Eric-Jan Wagenmakers. 2007. A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14:779–804.
- Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the Workshop on BlackboxNLP.
- Null hypothesis significance testing: In this paper, we work within the framework of null hypothesis significance testing (NHST). NHST is not free from problems, in that certain systematic processes within the practice of scientific research and publishing can undermine its advantages, many of which have been explored in the literature (Gelman and Loken, 2013; Ioannidis, 2019; McShane et al., 2019). Nevertheless, it would be premature to discard the entire paradigm, and we believe there is still some value in considering power within NHST for several reasons.
- First, despite its flaws, NHST remains a commonly used experimental framework in NLP research. Whether implicit of explicit, most experimental comparisons in the NLP literature have the structure of an experiment in the NHST framework, where having equivalent performance to an existing baseline is treated as a null hypothesis and the new model is argued to be significantly better (the typical case) or significantly worse (far rarer). But, whereas many fields that run experiments have standardized procedures for assessing statistical significance, NLP papers vary as to how formally they use a hypothesis testing framework to evaluate their results (Berg-Kirkpatrick et al., 2012; van der Lee et al., 2019; Azer et al., 2020).
- Finally, there is also a great need for additional clarity with respect to precisely what claims are being made by NLP papers. In this work, we are primarily focused on claims made about trained models (i.e. in testing whether one particular instantiation of a model is significantly better than a particular instantiation of another model). It is, of course, also important to consider broader claims that might be made, such as about expected performance or computational budget (Dodge et al., 2019; Schwartz et al., 2019), and everything we have to say can be extended to incorporate such considerations. For the purpose of clarity, however, we restrict ourselves to the simplest sort of statistical claim.
- Importantly, proper experiment design requires specifying these parameters in advance of data collection, or otherwise using a valid stopping rule. One can always obtain a significant result by progressively collecting data until a significant result is found (“sampling to a foregone conclusion”), but this is not a valid procedure (Anscombe, 1954; Wagenmakers, 2007). Similarly, post-hoc power analysis, using estimates derived from the experiment itself, provides no additional information beyond a transformation of the observed p-value, and is thus not recommended (though see below).
- Expanding on the algorithm in Figure 2, a simulation-based power analysis involves the following: 1. First, determine the statistical test, T, which will be used. For the example of comparing models depicted in Figure 1, we will use the binomial test to compare the systems (Dror et al., 2018).
- 2. Come up with a generative process which could be used to generate data like that which we will collect. In this step, we need to make assumptions about the comparison of interest. Since the binomial test requires only the counts of how many people prefer each system, we need to specify a prior on generating those counts. For example, we might assume that 60% of people will prefer system B, so the generative process will be cB ∼ Binomial(p = 0.6, n), where n is the total number of people to be sampled.
- 3. Choose a value of n for which we want to calculate power. Repeatedly (e.g., 10,000 times) draw many samples from our assumed generative process for that size of n.
- 4. For each simulated dataset of size n, run the chosen statistical test to check if difference between the observed counts is significant, and compute the proportion that are found to be significant. This is our estimate of power.
- 4. Type-M error ≈