AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
In predictive cancer prognosis and mortality prediction for instance, scientists and clinicians rely on test error confidence intervals based on CV and other repeated sample splitting estimators to avoid spurious findings and improve reproducibility

Cross-validation Confidence Intervals for Test Error

NIPS 2020, (2020)

被引用0|浏览38
EI
下载 PDF 全文
引用
微博一下

摘要

This work develops central limit theorems for cross-validation and consistent estimators of its asymptotic variance under weak stability conditions on the learning algorithm. Together, these results provide practical, asymptotically-exact confidence intervals for $k$-fold test error and valid, powerful hypothesis tests of whether one le...更多
0
简介
  • Cross-validation (CV) [48, 25] is a de facto standard for estimating the test error of a prediction rule.
  • The authors' contributions To meet these needs, the authors characterize the asymptotic distribution of CV error and develop consistent estimates of its variance under weak stability conditions on the learning al-
  • The L2 asymptotic linearity condition (2.2) holds with hn(z) = E[hn(z, Z1:n(1−1/k))] if the loss stability satisfies γloss(hn) = o(σn2 /n).
重点内容
  • Cross-validation (CV) [48, 25] is a de facto standard for estimating the test error of a prediction rule
  • By partitioning a dataset into k equal-sized validation sets, fitting a prediction rule with each validation set held out, evaluating each prediction rule on its corresponding held-out set, and averaging the k error estimates, CV produces an unbiased estimate of the test error with lower variance than a single train-validation split could provide
  • In predictive cancer prognosis and mortality prediction for instance, scientists and clinicians rely on test error confidence intervals based on CV and other repeated sample splitting estimators to avoid spurious findings and improve reproducibility [41, 44]
  • The difficulty comes from the dependence across the k averaged error estimates: if the estimates were independent, one could derive an asymptotically exact confidence interval for test error using a standard central limit theorem
  • We prove in Section 2 that k-fold CV error is asymptotically normal around its test error under an abstract asymptotic linearity condition
  • In Appendix G, we present a simple learning task for which our central limit theorem provably holds with σn2 converging to a non-zero constant, but the central limit theorem in [5, Eq (15)] is inapplicable because the variance parameter σn2 is infinite
结果
  • In Appendix G, the authors detail a simple learning problem in which the loss stability is infinite but Theorems 1 and 3 together provide a valid CLT with convergent variance σn2 .
  • When k is constant, as in 10-fold CV, the conditional variance assumptions in Section 3.2 are weaker still and hold even for algorithms with infinite loss stability.
  • The authors' results allow for asymmetric learning algorithms, accommodate growing, vanishing, and non-convergent variance parameters σn2 , and do not require the second-order mean-square stability condition (3.6).
  • A primary application of the central limit theorems is the construction of asymptotically-exact confidence intervals (CIs) for the unknown k-fold test error.
  • 1] requires hn symmetric in its training points, convergence of the variance parameter σn2 (3.5) to a non-zero constant, control over a fourth-moment analogue of mean-square stability γ4(hn) = o(σn4 /n2) instead of the smaller fourth-moment loss stability γ4(hn), and the more restrictive fourth-moment condition E[(hn(Z0, Z1:m)/σn)4] = O(1).5 By Proposition 2, their assumptions further imply that σn2 converges to a non-zero constant.
  • This mean-square stability condition is especially mild when k = Ω(n) and ensures that two training sets differing in only n/k points produce prediction rules with comparable test losses.
  • The authors compare the test error confidence intervals (4.1) and tests for algorithm improvement (4.2) with the most popular alternatives from the literature: the hold-out test described in [5, Eq (17)] based on a single train-validation split, the cross-validated t-test of [21], the repeated trainvalidation t-test of [42], and the 5 × 2-fold CV test of [21].6 These procedures are commonly used and admit both two-sided CIs and one-sided tests, but, unlike the proposals, none except the hold-out method are known to be valid.
结论
  • 5.1 Confidence intervals for test error In Appendix L.1, the authors compare the coverage and width of each procedure’s 95% CI for each of the described algorithms, datasets, and training set sizes.
  • The authors' central limit theorems and consistent variance estimators provide new, valid tools for testing algorithm improvement and generating test error intervals under algorithmic stability.
  • Another promising direction for future work is developing analogous tools for the expected test error E[Rn] instead of the k-fold test error Rn; Austern and Zhou [5] provide significant progress in this direction, but more work, on variance estimation, is needed
总结
  • Cross-validation (CV) [48, 25] is a de facto standard for estimating the test error of a prediction rule.
  • The authors' contributions To meet these needs, the authors characterize the asymptotic distribution of CV error and develop consistent estimates of its variance under weak stability conditions on the learning al-
  • The L2 asymptotic linearity condition (2.2) holds with hn(z) = E[hn(z, Z1:n(1−1/k))] if the loss stability satisfies γloss(hn) = o(σn2 /n).
  • In Appendix G, the authors detail a simple learning problem in which the loss stability is infinite but Theorems 1 and 3 together provide a valid CLT with convergent variance σn2 .
  • When k is constant, as in 10-fold CV, the conditional variance assumptions in Section 3.2 are weaker still and hold even for algorithms with infinite loss stability.
  • The authors' results allow for asymmetric learning algorithms, accommodate growing, vanishing, and non-convergent variance parameters σn2 , and do not require the second-order mean-square stability condition (3.6).
  • A primary application of the central limit theorems is the construction of asymptotically-exact confidence intervals (CIs) for the unknown k-fold test error.
  • 1] requires hn symmetric in its training points, convergence of the variance parameter σn2 (3.5) to a non-zero constant, control over a fourth-moment analogue of mean-square stability γ4(hn) = o(σn4 /n2) instead of the smaller fourth-moment loss stability γ4(hn), and the more restrictive fourth-moment condition E[(hn(Z0, Z1:m)/σn)4] = O(1).5 By Proposition 2, their assumptions further imply that σn2 converges to a non-zero constant.
  • This mean-square stability condition is especially mild when k = Ω(n) and ensures that two training sets differing in only n/k points produce prediction rules with comparable test losses.
  • The authors compare the test error confidence intervals (4.1) and tests for algorithm improvement (4.2) with the most popular alternatives from the literature: the hold-out test described in [5, Eq (17)] based on a single train-validation split, the cross-validated t-test of [21], the repeated trainvalidation t-test of [42], and the 5 × 2-fold CV test of [21].6 These procedures are commonly used and admit both two-sided CIs and one-sided tests, but, unlike the proposals, none except the hold-out method are known to be valid.
  • 5.1 Confidence intervals for test error In Appendix L.1, the authors compare the coverage and width of each procedure’s 95% CI for each of the described algorithms, datasets, and training set sizes.
  • The authors' central limit theorems and consistent variance estimators provide new, valid tools for testing algorithm improvement and generating test error intervals under algorithmic stability.
  • Another promising direction for future work is developing analogous tools for the expected test error E[Rn] instead of the k-fold test error Rn; Austern and Zhou [5] provide significant progress in this direction, but more work, on variance estimation, is needed
相关工作
  • Despite the ubiquity of CV, we are only aware of three prior efforts to characterize the precise distribution of cross-validation error. The cross-validation CLT of Dudoit and van der Laan [22] requires considerably stronger assumptions than our own and is not paired with the consistent estimate of variance needed to construct a valid confidence interval or test. LeDell et al [35] derive both a CLT and a consistent estimate of variance for CV, but these apply only to the area under the ROC curve (AUC) performance measure. Finally, in very recent work, Austern and Zhou [5] derive a CLT and a consistent estimate of variance for CV under more stringent assumptions than our own. We compare our results with each of these works in detail in Section 3.3. We note also that another work [36] aims to test the difference in test error between two learning algorithms using cross-validation but only proves the validity of their procedure for a single train-validation split rather than for CV.
引用论文
  • (2015). FlightDelays dataset. https://www.kaggle.com/usdot/flight-delays.
    Findings
  • Abou-Moustafa, K. and Szepesvari, C. (2019a). An exponential tail bound for Lq stable learning rules. In Garivier, A. and Kale, S., editors, Proceedings of the 30th International Conference on Algorithmic Learning Theory, volume 98 of Proceedings of Machine Learning Research, pages 31–63, Chicago, Illinois. PMLR.
    Google ScholarLocate open access versionFindings
  • Abou-Moustafa, K. and Szepesvari, C. (2019b). An exponential tail bound for the deleted estimate. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, pages 42–50.
    Google ScholarLocate open access versionFindings
  • Arsov, N., Pavlovski, M., and Kocarev, L. (2019). Stability of decision trees and logistic regression. arXiv preprint arXiv:1903.00816v1.
    Findings
  • Austern, M. and Zhou, W. (2020). Asymptotics of Cross-Validation. arXiv preprint arXiv:2001.11111v2.
    Findings
  • Baldi, P., Sadowski, P., and Whiteson, D. (2014a). Searching for exotic particles in high-energy physics with deep learning. Nature Communications, 5.
    Google ScholarLocate open access versionFindings
  • Baldi, P., Sadowski, P., and Whiteson, D. (2014b). Higgs dataset. https://archive.ics.uci.edu/ml/datasets/HIGGS.
    Findings
  • Beirami, A., Razaviyayn, M., Shahrampour, S., and Tarokh, V. (2017). On optimal generalizability in parametric learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS17, pages 3455–3465, Red Hook, NY, USA. Curran Associates Inc.
    Google ScholarLocate open access versionFindings
  • Bengio, Y. and Grandvalet, Y. (2004). No unbiased estimator of the variance of k-fold cross validation. Journal of Machine Learning Research, 5:1089–1105.
    Google ScholarLocate open access versionFindings
  • Billingsley, P. (1995). Probability and Measure, Third Edition.
    Google ScholarFindings
  • Blum, A., Kalai, A., and Langford, J. (1999). Beating the hold-out: Bounds for k-fold and progressive cross-validation. In Proc. COLT, pages 203–208.
    Google ScholarLocate open access versionFindings
  • Boucheron, S., Bousquet, O., Lugosi, G., and Massart, P. (2005). Moment inequalities for functions of independent random variables. Annals of Probability, 33(2):514–560.
    Google ScholarLocate open access versionFindings
  • Bouckaert, R. R. and Frank, E. (2004). Evaluating the replicability of significance tests for comparing learning algorithms. In In PAKDD, pages 3–12. Springer.
    Google ScholarLocate open access versionFindings
  • Bousquet, O. and Elisseeff, A. (2002). Stability and generalization. Journal of Machine Learning Research, 2:499–526.
    Google ScholarLocate open access versionFindings
  • Brown, L. D., Cai, T. T., and DasGupta, A. (2001). Interval estimation for a binomial proportion. Statistical Science, 16(2):101–133.
    Google ScholarLocate open access versionFindings
  • Celisse, A. and Guedj, B. (2016). Stability revisited: new generalisation bounds for the Leaveone-Out. arXiv preprint arXiv:1608.06412v1.
    Findings
  • Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 16, pages 785–794, New York, NY, USA. Association for Computing Machinery.
    Google ScholarLocate open access versionFindings
  • Demsar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30.
    Google ScholarLocate open access versionFindings
  • Devroye, L. and Wagner, T. (1979a). Distribution-free inequalities for the deleted and holdout error estimates. IEEE Transactions on Information Theory, 25(2):202–207.
    Google ScholarLocate open access versionFindings
  • Devroye, L. and Wagner, T. (1979b). Distribution-free performance bounds for potential function rules. IEEE Transactions on Information Theory, 25(5):601–604.
    Google ScholarLocate open access versionFindings
  • Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7):1895–1923.
    Google ScholarLocate open access versionFindings
  • Dudoit, S. and van der Laan, M. J. (2005). Asymptotics of cross-validated risk estimation in estimator selection and performance assessment. Statistical Methodology, 2(2):131–154.
    Google ScholarLocate open access versionFindings
  • Durrett, R. (2019). Probability: Theory and Examples, Version 5.
    Google ScholarFindings
  • Elisseeff, A., Evgeniou, T., and Pontil, M. (2005). Stability of randomized learning algorithms. Journal of Machine Learning Research, 6:55–79.
    Google ScholarLocate open access versionFindings
  • Geisser, S. (1975). The predictive sample reuse method with applications. Journal of the American Statistical Association, 70(350):320–328.
    Google ScholarLocate open access versionFindings
  • Ghosh, S., Stephenson, W. T., Nguyen, T. D., Deshpande, S. K., and Broderick, T. (2020). Approximate Cross-Validation for Structured Models. arXiv preprint arXiv:2006.12669v1.
    Findings
  • Giordano, R., Stephenson, W., Liu, R., Jordan, M., and Broderick, T. (2019). A swiss army infinitesimal jackknife. In Chaudhuri, K. and Sugiyama, M., editors, Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages 1139–1147. PMLR.
    Google ScholarLocate open access versionFindings
  • Hardt, M., Recht, B., and Singer, Y. (2016). Train faster, generalize better: Stability of stochastic gradient descent. In Proceedings of the 33rd International Conference on Machine Learning - Volume 48, ICML16, pages 1225–1234. JMLR.org.
    Google ScholarLocate open access versionFindings
  • Jiang, W., Varma, S., and Simon, R. (2008). Calculating Confidence Intervals for Prediction Error in Microarray Classification Using Resampling. Statistical Applications in Genetics and Molecular Biology, 7(1).
    Google ScholarLocate open access versionFindings
  • Kale, S., Kumar, R., and Vassilvitskii, S. (2011). Cross-validation and mean-square stability. In Proceedings of the Second Symposium on Innovations in Computer Science (ICS2011). Citeseer.
    Google ScholarLocate open access versionFindings
  • Kearns, M. and Ron, D. (1999). Algorithmic stability and sanity-check bounds for leave-oneout cross-validation. Neural Computation, 11(6):1427–1453.
    Google ScholarLocate open access versionFindings
  • Koh, P. W., Ang, K.-S., Teo, H. H. K., and Liang, P. (2019). On the Accuracy of Influence Functions for Measuring Group Effects. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS19, pages 5254–5264.
    Google ScholarLocate open access versionFindings
  • Kumar, R., Lokshtanov, D., Vassilvitskii, S., and Vattani, A. (2013). Near-optimal bounds for cross-validation via loss stability. In International Conference on Machine Learning, pages 27–35.
    Google ScholarLocate open access versionFindings
  • Kutin, S. and Niyogi, P. (2002). Almost-everywhere algorithmic stability and generalization error. In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence, UAI02, pages 275–282, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
    Google ScholarLocate open access versionFindings
  • LeDell, E., Petersen, M., and van der Laan, M. (2015). Computationally efficient confidence intervals for cross-validated area under the roc curve estimates. Electronic Journal of Statistics, 9(1):1583–1607.
    Google ScholarLocate open access versionFindings
  • Lei, J. (2019). Cross-validation with confidence. Journal of the American Statistical Association, pages 1–20.
    Google ScholarLocate open access versionFindings
  • Lim, T.-S., Loh, W.-Y., and Shih, Y.-S. (2000). A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Mach. Learn., 40(3):203228.
    Google ScholarLocate open access versionFindings
  • Markatou, M., Tian, H., Biswas, S., and Hripcsak, G. (2005). Analysis of variance of crossvalidation estimators of the generalization error. Journal of Machine Learning Research, 6:1127– 1168.
    Google ScholarLocate open access versionFindings
  • McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12:153–157.
    Google ScholarLocate open access versionFindings
  • Meyer, P.-A. (1966). Probability and Potentials. Blaisdell Publishing Co, N.Y.
    Google ScholarFindings
  • Michiels, S., Koscielny, S., and Hill, C. (2005). Prediction of cancer outcome with microarrays: a multiple random validation strategy. The Lancet, 365(9458):488–492.
    Google ScholarLocate open access versionFindings
  • Nadeau, C. and Bengio, Y. (2003). Inference for the generalization error. Machine Learning, 52(3):239–281.
    Google ScholarLocate open access versionFindings
  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
    Google ScholarLocate open access versionFindings
  • Pirracchio, R., Petersen, M. L., Carone, M., Rigon, M. R., Chevret, S., and van der Laan, M. J. (2015). Mortality prediction in intensive care units with the Super ICU Learner Algorithm (SICULA): a population-based study. The Lancet Respiratory Medicine, 3(1):42–52.
    Google ScholarLocate open access versionFindings
  • Rad, K. R. and Maleki, A. (2020). A scalable estimate of the out-of-sample prediction error via approximate leave-one-out cross-validation. Journal of the Royal Statistical Society: Series B (Statistical Methodology).
    Google ScholarLocate open access versionFindings
  • Steele, J. M. (1986). An Efron-Stein inequality for nonsymmetric statistics. Annals of Statistics, 14(2):753–758.
    Google ScholarLocate open access versionFindings
  • Stephenson, W. and Broderick, T. (2020). Approximate cross-validation in high dimensions with guarantees. In Chiappa, S. and Calandra, R., editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 2424–2434, Online. PMLR.
    Google ScholarLocate open access versionFindings
  • Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society. Series B (Methodological), 36(2):111–147.
    Google ScholarLocate open access versionFindings
  • Wilson, A., Kasy, M., and Mackey, L. (2020). Approximate cross-validation: Guarantees for model assessment and selection. In Chiappa, S. and Calandra, R., editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 4530–4540, Online. PMLR.
    Google ScholarLocate open access versionFindings
  • Wilson, E. B. (1927). Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22(158):209–212.
    Google ScholarLocate open access versionFindings
  • 0. Proposition 1 now follows directly from Proposition 3 with the choices:
    Google ScholarFindings
  • 0. Therefore, Var(E[hn(Z0, D)
    Google ScholarFindings
  • 1. Indeed, thanks to (I.7) and (I.8), this will lead to E[|σn2,in,approx − σn2 |] ≤
    Google ScholarFindings
  • 0. To prove our result, we specify the case of interest bn = n. First, nP(|Xn,1| > n) ≤
    Google ScholarFindings
  • 0. Proof Let fn(z) = f (nz)1[z > T /n], and note that, for any z ≥ 0, fn(z) → 0 as n → ∞. Then
    Google ScholarFindings
  • 2. For any ε > 0, let δ > 0 such that for any event A satisfying P(A) ≤ δ, supn supn
    Google ScholarFindings
  • 0. Therefore, by (I.2), we have (σn2,in − σn2 )/σn2 →L1 0, i.e. σn2,in/σn2 →L1 1, whenever the sequence of (hn (Z0 )
    Google ScholarLocate open access versionFindings
  • 1. Our 10-fold CV CLT-based test, with σn being either σn,in (Theorem 4) or σn,out (Theorem 5). The curve with σn,in is not displayed in our plots since the results are almost identical to those for σn,out and the curves are overlapping.
    Google ScholarFindings
  • 2. Hold-out test described, for instance, in Austern and Zhou [5, Eq. (17)].
    Google ScholarFindings
  • 3. Cross-validated t-test of Dietterich [21], 10 folds.
    Google ScholarFindings
  • 4. Repeated train-validation t-test of Nadeau and Bengio [42], 10 repetitions of 90-10 trainvalidation splits.
    Google ScholarFindings
  • 5. Corrected repeated train-validation t-test of Nadeau and Bengio [42], 10 repetitions of 90-10 train-validation splits.
    Google ScholarFindings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科