AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
In Section 3, we establish two novel upper bounds on generalization error using the same index and super sample structure exploited by Steinke and Zakynthinou, and we show that both of our bounds are tighter than the bound based on CMIDk

Sharpened Generalization Bounds based on Conditional Mutual Information and an Application to Noisy, Iterative Algorithms

NIPS 2020, (2020)

引用28|浏览64
EI
下载 PDF 全文
引用
微博一下

摘要

The information-theoretic framework of Russo and J. Zou (2016) and Xu and Raginsky (2017) provides bounds on the generalization error of a learning algorithm in terms of the mutual information between the algorithm's output and the training sample. In this work, we study the proposal, by Steinke and Zakynthinou (2020), to reason about t...更多

代码

数据

简介
  • Let D be an unknown distribution on a space Z, and let W be a set of parameters that index a set of predictors with a bounded loss function : Z × W → [0, 1].
  • The authors study bounds on generalization error in terms of information-theoretic measures of dependence between the data and the output of the learning algorithm.
  • Veeravalli, 2019 obtain a tighter bound by replacing IOMID(A) with the mutual information between W and a single training data point.
重点内容
  • Let D be an unknown distribution on a space Z, and let W be a set of parameters that index a set of predictors with a bounded loss function : Z × W → [0, 1]
  • The basic result in this line of work is that the generalization error can be bounded in terms of the mutual information I(W ; S) between the data and the learned parameter, a quantity that has been called the information usage or input–output mutual information of A with respect to D, which we denote by IOMID(A)
  • In Section 3, we establish two novel upper bounds on generalization error using the same index and super sample structure exploited by Steinke and Zakynthinou, and we show that both of our bounds are tighter than the bound based on CMIDk (A)
  • In Section 4, we provide a general recipe for constructing generalization error bounds for noisy, iterative algorithms using the generalization bound proposed in Section 3
  • Our main results (Theorems 2.1 and 2.2) show that for any learning algorithm and any data distribution, conditional mutual information provides a tighter measure of dependence than mutual information, and that one can recover the mutual-information–based bounds in the limit, at least for finite parameter spaces
  • We present two novel generalization bounds and show that they provide a tighter characterization of the generalization error compared to Theorem 1.3 by Steinke and Zakynthinou (2020)
结果
  • The authors' main results (Theorems 2.1 and 2.2) show that for any learning algorithm and any data distribution, conditional mutual information provides a tighter measure of dependence than mutual information, and that one can recover the mutual-information–based bounds in the limit, at least for finite parameter spaces.
  • Theorem 3.1 bounds the expected generalization error in terms of the mutual information between the output parameter and a random subsequence of the indices U, given the super-sample.
  • In Theorem 3.4, the authors derive a generalization bound that is constructed in terms of the mutual information between each individual element of U and the output of the learning algorithm, W .
  • The authors present the following well-known result that allows one to bound mutual information by the expectation of the KL divergence of a conditional distribution (“posterior”) with respect to a “prior”.
  • Given another random element Z, it follows immediately by the disintegration theorem (Kallenberg, 2006, Thm. 6.4) that, for all Z-measurable random probability measures P on the same space as Y , IZ(X;Y ) ≤ EZ[KL(PX,Z[Y ] P)] a.s., with a.s. equality for P = EZ[PX,Z[Y ]] = PZ[Y ].
  • The authors demonstrate that the generalization bound in Theorem 3.1 can be upper bounded using KL(Q P), where the prior P has access to the information in the training set, i.e., S.
  • The KL divergence based on P can exploit the information in the training set to obtain tighter bounds on the mutual information.
结论
  • The authors formally state the chain rule for KL divergence that is the main ingredient of the method to obtain generalization error bounds for iterative algorithms.
  • For the case with m = 1, the authors provide a tighter bound compared to Eq (43) by showing that one can pull the expectation over both UJc and J outside the concave square-root function.
  • Veeravalli, 2019; Negrea et al, 2019; Li, Luo, and Qiao, 2020) by choosing a non-constant θ , the generalization bound exploits the optimization trajectory as well as data to tighten the generalization bound.
基金
  • JN is supported by an NSERC Vanier Canada Graduate Scholarship, and by the Vector Institute
  • DMR is supported by an NSERC Discovery Grant and an Ontario Early Researcher Award
引用论文
  • Te Sun, H. (1978). “Nonnegative entropy measures of multivariate symmetric correlations”. Information and Control 36, pp. 133–156.
    Google ScholarLocate open access versionFindings
  • Gelfand, S. B. and S. K. Mitter (1991). “Recursive stochastic algorithms for global optimization in Rd”. SIAM Journal on Control and Optimization 29.5, pp. 999–1018.
    Google ScholarLocate open access versionFindings
  • Kallenberg, O. (2006). Foundations of modern probability. Springer Science & Business Media.
    Google ScholarFindings
  • Boucheron, S., G. Lugosi, and P. Massart (2013). Concentration inequalities: A nonasymptotic theory of independence. Oxford university press.
    Google ScholarLocate open access versionFindings
  • Russo, D. and J. Zou (2015). How much does your data exploration overfit? Controlling bias via information usage. arXiv: 1511.05219.
    Findings
  • Raginsky, M., A. Rakhlin, M. Tsao, Y. Wu, and A. Xu (2016). “Information-theoretic analysis of stability and bias of learning algorithms”. In: 2016 IEEE Information Theory Workshop (ITW). IEEE, pp. 26–30.
    Google ScholarLocate open access versionFindings
  • Russo, D. and J. Zou (2016). “Controlling Bias in Adaptive Data Analysis Using Information Theory”. In: Proceedings of the 19th International Conference on Artificial Intelligence and Statistics. Ed. by A. Gretton and C. C. Robert. Vol. 51. Proceedings of Machine Learning Research. Cadiz, Spain: PMLR, pp. 1232–1240.
    Google ScholarLocate open access versionFindings
  • Jiao, J., Y. Han, and T. Weissman (2017). “Dependence measures bounding the exploration bias for general measurements”. In: IEEE International Symposium on Information Theory.
    Google ScholarLocate open access versionFindings
  • Shokri, R., M. Stronati, C. Song, and V. Shmatikov (2017). “Membership inference attacks against machine learning models”. In: 2017 IEEE Symposium on Security and Privacy (SP). IEEE, pp. 3–18.
    Google ScholarLocate open access versionFindings
  • Xu, A. and M. Raginsky (2017). “Information-theoretic analysis of generalization capability of learning algorithms”. In: Advances in Neural Information Processing Systems, pp. 2524–2533.
    Google ScholarLocate open access versionFindings
  • A. Lopez and V. Jog (2018). “Generalization error bounds using Wasserstein distances”. In: IEEE Information Theory Workshop.
    Google ScholarLocate open access versionFindings
  • Asadi, A., E. Abbe, and S. Verdú (2018). “Chaining mutual information and tightening generalization bounds”. In: Advances in Neural Information Processing Systems, pp. 7234– 7243.
    Google ScholarLocate open access versionFindings
  • Bassily, R., S. Moran, I. Nachum, J. Shafer, and A. Yehudayoff (2018). “Learners that Use Little Information”. In: Algorithmic Learning Theory, pp. 25–55.
    Google ScholarLocate open access versionFindings
  • Pensia, A., V. Jog, and P.-L. Loh (2018). “Generalization error bounds for noisy, iterative algorithms”. In: 2018 IEEE International Symposium on Information Theory (ISIT), pp. 546–550.
    Google ScholarLocate open access versionFindings
  • Bu, Y., S. Zou, and V. V. Veeravalli (2019). “Tightening mutual information based bounds on generalization error”. In: 2019 IEEE International Symposium on Information Theory (ISIT). IEEE, pp. 587–591.
    Google ScholarLocate open access versionFindings
  • Durrett, R. (2019). Probability: theory and examples. Vol. 49. Cambridge university press.
    Google ScholarFindings
  • Negrea, J., M. Haghifam, G. K. Dziugaite, A. Khisti, and D. M. Roy (2019). “Information-Theoretic Generalization Bounds for SGLD via Data-Dependent Estimates”. In: Advances in Neural Information Processing Systems, pp. 11013–11023.
    Google ScholarLocate open access versionFindings
  • Li, J., X. Luo, and M. Qiao (2020). “On Generalization Error Bounds of Noisy Gradient Methods for Non-Convex Learning”. In: International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Steinke, T. and L. Zakynthinou (2020). “Reasoning About Generalization via Conditional Mutual Information”. arXiv: 2001.09122.
    Findings
  • Some interpretation of our result is helpful. Consider an adversary who has access to the supersample Z(k) and wishes to identify the training set that was used for the training after observing the output of a learning algorithm W. Our result here showed that the CMI upperbounds the success probability of every adversary. Also, recall that the CMI upper bounds the expected generalization error. In the literature of data privacy in machine learning, this problem is known as Membership Attack (Shokri et al., 2017), and it is empirically observed that a machine learning model leaks information about its training set when the generalization error is large (Shokri et al., 2017). Our result in this section provide a formal connection between generalization and this specific membership attack problem.
    Google ScholarLocate open access versionFindings
  • For any two random measures P(Z(2),UJc, J) and Q(Z(2),U) on W, the Donsker–Varadhan variational formula (Boucheron, Lugosi, and Massart, 2013, Prop. 4.15) and the disintegration theorem (Kallenberg, 2006, Thm. 6.4), give that with probability one
    Google ScholarLocate open access versionFindings
0
您的评分 :

暂无评分

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn