PAC-Bayesian Bound for the Conditional Value at Risk

NIPS 2020, (2020)

Cited by: 0|Views21
EI
Weibo:
We note that the only steps in the proof of our main bound in Theorem 1 that are specific to Conditional Value at Risk are Lemmas 2 and 3, and so the question is whether our overall approach can be extended to other coherent risk measures to achieve

Abstract:

Conditional Value at Risk (CVaR) is a family of "coherent risk measures" which generalize the traditional mathematical expectation. Widely used in mathematical finance, it is garnering increasing interest in machine learning, e.g., as an alternate approach to regularization, and as a means for ensuring fairness. This paper presents a ge...More

Code:

Data:

ZH
Full Text
Bibtex
Weibo
Introduction
  • The goal in statistical learning is to learn hypotheses that generalize well, which is typically formalized by seeking to minimize the expected risk associated with a given loss function.
  • When instantiated with bounded random variables, their concentration inequalities have sub-optimal dependence in α.
  • As a by-product of the analysis, the authors derive a new way of obtaining concentration bounds for the conditional value at risk by reducing the problem to estimating expectations using empirical means.
Highlights
  • The goal in statistical learning is to learn hypotheses that generalize well, which is typically formalized by seeking to minimize the expected risk associated with a given loss function
  • In Section 3, we recall the statistical learning setting and present our PAC-Bayesian bound for Conditional Value at Risk (CVAR)
  • Let X be an arbitrary set, and f ∶ X → R be some fixed measurable function
  • We derived a first PAC-Bayesian bound for CVAR by reducing the task of estimating CVAR to that of merely estimating an expectation
  • We note that the only steps in the proof of our main bound in Theorem 1 that are specific to CVAR are Lemmas 2 and 3, and so the question is whether our overall approach can be extended to other coherent risk measures to achieve (4)
  • In Appendix B, we discuss how our results may be extended to a rich class of coherent risk measures known as φ-entropic risk measures
Results
  • Even though not explicitly done before, a PAC-Bayesian bound of the form (4) can be derived for a risk measure R using an existing technique due to McAllester [2003] as soon as, for any fixed hypothesis h, the difference R[l(h, X)] − R[l(h, X)] is sub-exponential with a sufficiently fast tail decay as a function of n.
  • This type of improved PAC-Bayesian bound, where the empirical error appears multiplying the complexity term inside the square-root, has been derived for the expected risk in works such as [Catoni, 2007, Langford and Shawe-Taylor, 2003, Maurer, 2004, Seeger, 2002]; these are arguably the state-of-theart generalization bounds.
  • In Subsection 4.2, the authors introduce an auxiliary random variable Y whose expectation equals Cα[Z] (as in (8)) and whose empirical mean is bounded from above by the estimator Cα[Z] introduced in Subsection 4.1—this enables the reduction described at the end of Section 3.
  • Note that in Lemma 9 the authors have assumed that Z is a zero-mean random variable, and so the authors still need to do some work to derive a concentration inequality for Ĉα[Z].
  • When Z is a σ-sub-Gaussian random variable with σ > 0, an immediate consequence of Theorem 10 is that by setting t = 2σ2 ln(2 δ) in (25), the authors get that, with probability at least 1 − 2δ, Cα[Z] −
  • Generalization bounds of the form (4) for unbounded but sub-Gaussian or sub-exponential l(h, X), h ∈ H, can be obtained using the PAC-Bayesian analysis of [McAllester, 2003, Theorem 1] and the concentration inequalities in Theorem 10.
  • The authors conjecture that the dependence on α in the concentration bounds of Theorem 10 can be improved by swapping α2 for α in the argument of the exponentials; in the sub-Gaussian case, this would move α inside the square-root on the RHS of (26).
Conclusion
  • The inequality in (28) essentially replaces the range of the random variable Z typically present under the square-root in other concentration bounds [Brown, 2007, Wang and Gao, 2010] by the smaller quantity Cα[Z].
  • The authors note that the only steps in the proof of the main bound in Theorem 1 that are specific to CVAR are Lemmas 2 and 3, and so the question is whether the overall approach can be extended to other coherent risk measures to achieve (4).
  • These CRMs are often used in the context of robust optimization Namkoong and Duchi [2017], and are perfect candidates to consider in the context of this paper
Summary
  • The goal in statistical learning is to learn hypotheses that generalize well, which is typically formalized by seeking to minimize the expected risk associated with a given loss function.
  • When instantiated with bounded random variables, their concentration inequalities have sub-optimal dependence in α.
  • As a by-product of the analysis, the authors derive a new way of obtaining concentration bounds for the conditional value at risk by reducing the problem to estimating expectations using empirical means.
  • Even though not explicitly done before, a PAC-Bayesian bound of the form (4) can be derived for a risk measure R using an existing technique due to McAllester [2003] as soon as, for any fixed hypothesis h, the difference R[l(h, X)] − R[l(h, X)] is sub-exponential with a sufficiently fast tail decay as a function of n.
  • This type of improved PAC-Bayesian bound, where the empirical error appears multiplying the complexity term inside the square-root, has been derived for the expected risk in works such as [Catoni, 2007, Langford and Shawe-Taylor, 2003, Maurer, 2004, Seeger, 2002]; these are arguably the state-of-theart generalization bounds.
  • In Subsection 4.2, the authors introduce an auxiliary random variable Y whose expectation equals Cα[Z] (as in (8)) and whose empirical mean is bounded from above by the estimator Cα[Z] introduced in Subsection 4.1—this enables the reduction described at the end of Section 3.
  • Note that in Lemma 9 the authors have assumed that Z is a zero-mean random variable, and so the authors still need to do some work to derive a concentration inequality for Ĉα[Z].
  • When Z is a σ-sub-Gaussian random variable with σ > 0, an immediate consequence of Theorem 10 is that by setting t = 2σ2 ln(2 δ) in (25), the authors get that, with probability at least 1 − 2δ, Cα[Z] −
  • Generalization bounds of the form (4) for unbounded but sub-Gaussian or sub-exponential l(h, X), h ∈ H, can be obtained using the PAC-Bayesian analysis of [McAllester, 2003, Theorem 1] and the concentration inequalities in Theorem 10.
  • The authors conjecture that the dependence on α in the concentration bounds of Theorem 10 can be improved by swapping α2 for α in the argument of the exponentials; in the sub-Gaussian case, this would move α inside the square-root on the RHS of (26).
  • The inequality in (28) essentially replaces the range of the random variable Z typically present under the square-root in other concentration bounds [Brown, 2007, Wang and Gao, 2010] by the smaller quantity Cα[Z].
  • The authors note that the only steps in the proof of the main bound in Theorem 1 that are specific to CVAR are Lemmas 2 and 3, and so the question is whether the overall approach can be extended to other coherent risk measures to achieve (4).
  • These CRMs are often used in the context of robust optimization Namkoong and Duchi [2017], and are perfect candidates to consider in the context of this paper
Related work
  • Deviation bounds for CVAR were first presented by Brown [2007]. However, their approach only applies to bounded continuous random variables, and their lower deviation bound has a sub-optimal dependence on the level α. Wang and Gao [2010] later refined their analysis to recover the “correct” dependence in α, albeit their technique still requires a two-sided bound on the random variable Z. Thomas and Learned-Miller [2019] derived new concentration inequalities for CVAR with a very sharp empirical performance, even though the dependence on α in their bound is sub-optimal. Further, they only require a one-sided bound on Z, without a continuity assumption.

    Kolla et al [2019] were the first to provide concentration bounds for CVAR when the random variable Z is unbounded, but is either sub-Gaussian or sub-exponential. Bhat and Prashanth [2019] used a bound on the Wasserstein distance between true and empirical cumulative distribution functions to substantially tighten the bounds of Kolla et al [2019] when Z has finite exponential or kth-order moments; they also apply their results to other coherent risk measures. However, when instantiated with bounded random variables, their concentration inequalities have sub-optimal dependence in α.
Study subjects and analysis
guarantees: 5
In addition to Zh in (20), define Yh ∶= l(h, X) ⋅ E[q⋆ X] and Yh,i ∶= l(h, Xi) ⋅ E[q⋆ Xi], for i ∈ [n]. Then, if we set γ = ηn and Rh = EP [Yh] − ∑ni=1 Yh,i n − ηκ(η α)Cα[Zh] α, Lemma 5 guarantees that E[exp(γRh)] ≤ 1, and so by Lemma 6 we get the following result: Theorem 7. Let α, δ ∈ (0, 1), and η ∈ [0, α]

Reference
  • Karim T. Abou-Moustafa and Csaba Szepesvári. An exponential tail bound for lq stable learning rules. In Aurélien Garivier and Satyen Kale, editors, Algorithmic Learning Theory, ALT 2019, 22-24 March 2019, Chicago, Illinois, USA, volume 98 of Proceedings of Machine Learning Research, pages 31–63. PMLR, 2019. URL http://proceedings.mlr.press/v98/abou-moustafa19a.html.
    Locate open access versionFindings
  • Amir Ahmadi-Javid. Entropic value-at-risk: A new coherent risk measure. Journal of Optimization Theory and Applications, 155(3):1105–1123, 2012.
    Google ScholarLocate open access versionFindings
  • Maurice Allais. Le comportement de l’homme rationnel devant le risque: critique des postulats et axiomes de l’école américaine. Econometrica: Journal of the Econometric Society, pages 503– 546, 1953.
    Google ScholarLocate open access versionFindings
  • Philippe Artzner, Freddy Delbaen, Jean-Marc Eber, and David Heath. Coherent measures of risk. Mathematical finance, 9(3):203–228, 1999.
    Google ScholarLocate open access versionFindings
  • Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
    Google ScholarLocate open access versionFindings
  • Sanjay P Bhat and LA Prashanth. Concentration of risk measures: A Wasserstein distance approach. In Advances in Neural Information Processing Systems, pages 11739–11748, 2019.
    Google ScholarLocate open access versionFindings
  • Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
    Google ScholarFindings
  • Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of machine learning research, 2(Mar):499–526, 2002.
    Google ScholarLocate open access versionFindings
  • Olivier Bousquet, Yegor Klochkov, and Nikita Zhivotovskiy. Sharper bounds for uniformly stable algorithms. CoRR, abs/1910.07833, 201URL http://arxiv.org/abs/1910.07833.
    Findings
  • David B. Brown. Large deviations bounds for estimating conditional value-at-risk. Operations Research Letters, 35(6):722–730, 2007.
    Google ScholarLocate open access versionFindings
  • Olivier Catoni. PAC-Bayesian supervised classification: the thermodynamics of statistical learning. Lecture Notes-Monograph Series. IMS, 2007.
    Google ScholarFindings
  • Alain Celisse and Benjamin Guedj. Stability revisited: new generalisation bounds for the leave-oneout. arXiv preprint arXiv:1608.06412, 2016.
    Findings
  • Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
    Google ScholarFindings
  • Youhua (Frank) Chen, Minghui Xu, and Zhe George Zhang. A Risk-Averse Newsvendor Model Under the CVaR Criterion. Operations Research, 57(4):1040–1044, 2009. ISSN 0030364X, 15265463. URL http://www.jstor.org/stable/25614814.
    Locate open access versionFindings
  • Algorithms for CVaR Optimization in MDPs. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3509–3517. Curran Associates, Inc., 2014.
    Google ScholarLocate open access versionFindings
  • http://papers.nips.cc/paper/5246-algorithms-for-cvar-optimization-in-mdps.pdf.
    Findings
  • I. Csiszár. I-divergence geometry of probability distributions and minimization problems. Annals of Probability, 3:146–158, 1975.
    Google ScholarLocate open access versionFindings
  • M. D. Donsker and S. R. S. Varadhan. Asymptotic evaluation of certain Markov process expectations for large time — III. Communications on pure and applied Mathematics, 29(4):389–461, 1976.
    Google ScholarLocate open access versionFindings
  • John Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. arXiv preprint arXiv:1810.08750, 2018.
    Findings
  • Daniel Ellsberg. Risk, ambiguity, and the savage axioms. The quarterly journal of economics, pages 643–669, 1961.
    Google ScholarLocate open access versionFindings
  • Benjamin Guedj. A Primer on PAC-Bayesian Learning. https://arxiv.org/abs/1901.05353.
    Findings
  • Xiaoguang Huo and Feng Fu. Risk-aware multi-armed bandit problem with application to portfolio selection. Royal Society Open Science, 4(11), 2017.
    Google ScholarLocate open access versionFindings
  • Ravi Kumar Kolla, LA Prashanth, Sanjay P Bhat, and Krishna Jagannathan. Concentration bounds for empirical conditional value-at-risk: The unbounded case. Operations Research Letters, 47(1): 16–20, 2019.
    Google ScholarLocate open access versionFindings
  • John Langford and John Shawe-Taylor. PAC-Bayes & margins. In Advances in Neural Information Processing Systems, pages 439–446, 2003.
    Google ScholarLocate open access versionFindings
  • Andreas Maurer. A note on the PAC-Bayesian theorem. arXiv preprint cs/0411099, 2004.
    Google ScholarFindings
  • Andreas Maurer and Massimiliano Pontil. Empirical Bernstein bounds and sample variance penalization. In Proceedings COLT 2009, 2009.
    Google ScholarLocate open access versionFindings
  • David A. McAllester. PAC-Bayesian Stochastic Model Selection. In Machine Learning, volume 51, pages 5–21, 2003.
    Google ScholarLocate open access versionFindings
  • Zakaria Mhammedi, Peter Grünwald, and Benjamin Guedj. PAC-Bayes Un-Expected Bernstein Inequality. In Advances in Neural Information Processing Systems, pages 12180–12191, 2019.
    Google ScholarLocate open access versionFindings
  • Tetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and Toshiyuki Tanaka. Nonparametric return distribution approximation for reinforcement learning. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pages 799– 806, 2010.
    Google ScholarLocate open access versionFindings
  • Hongseok Namkoong and John C. Duchi. Variance-based regularization with convex objectives. In Advances in Neural Information Processing Systems, pages 2971–2980, 2017.
    Google ScholarLocate open access versionFindings
  • Springer US, Boston, MA, 2000. ISBN 978-1-4757-3150-7. doi: 10.1007/ 978-1-4757-3150-7_15. URL https://doi.org/10.1007/978-1-4757-3150-7_15.
    Findings
  • Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2817–2826. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • LA Prashanth and Mohammad Ghavamzadeh. Actor-critic algorithms for risk-sensitive MDPs. In Advances in neural information processing systems, pages 252–260, 2013.
    Google ScholarLocate open access versionFindings
  • R Tyrrell Rockafellar and Stan Uryasev. The fundamental risk quadrangle in risk management, optimization and statistical estimation. Surveys in Operations Research and Management Science, 18(1-2):33–53, 2013.
    Google ScholarLocate open access versionFindings
  • R Tyrrell Rockafellar, Stanislav Uryasev, et al. Optimization of conditional value-at-risk. Journal of Risk, 2(3):21–41, 2000.
    Google ScholarLocate open access versionFindings
  • Matthias Seeger. PAC-Bayesian generalization error bounds for Gaussian process classification. Journal of Machine Learning Research, 3:233–269, 2002.
    Google ScholarLocate open access versionFindings
  • Akiko Takeda and Takafumi Kanamori. A robust approach based on conditional value-at-risk measure to statistical learning problems. European Journal of Operational Research, 198(1): 287 – 296, 2009. ISSN 0377-2217. doi: https://doi.org/10.1016/j.ejor.2008.07.027. URL http://www.sciencedirect.com/science/article/pii/S0377221708005614.
    Locate open access versionFindings
  • Akiko Takeda and Masashi Sugiyama. ν-support vector machine as conditional value-at-risk minimization. In Proceedings of the 25th international conference on Machine learning, pages 1056– 1063, 2008.
    Google ScholarLocate open access versionFindings
  • Aviv Tamar, Yonatan Glassner, and Shie Mannor. Optimizing the CVaR via sampling. In TwentyNinth AAAI Conference on Artificial Intelligence, 2015.
    Google ScholarLocate open access versionFindings
  • Philip Thomas and Erik Learned-Miller. Concentration inequalities for conditional value at risk. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97, pages 6225–6233, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
    Google ScholarLocate open access versionFindings
  • Ilya O. Tolstikhin and Yevgeny Seldin. Pac-bayes-empirical-bernstein inequality. In Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pages 109–117, 2013. URL http://papers.nips.cc/paper/4903-pac-bayes-empirical-bernstein-inequality.
    Locate open access versionFindings
  • M. J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019.
    Google ScholarFindings
  • Ying Wang and Fuqing Gao. Deviation inequalities for an estimator of the conditional value-at-risk. Operations Research Letters, 38(3):236–239, 2010.
    Google ScholarLocate open access versionFindings
  • Robert Williamson and Aditya Menon. Fairness risk measures. In International Conference on Machine Learning, pages 6786–6797, 2019.
    Google ScholarLocate open access versionFindings
  • where (30) is due to {x ∈ R φ(x) < +∞} = [0, 1 α], and (31) follows by setting μ ∶= η − γ and noting that the inf in (30) is always attained at a point (η, γ) ∈ R2≥0 satisfying η ⋅ γ = 0, in which case η + γ = μ; this is true because by the positivity of ǫn, if η, γ > 0, then (η + γ)ǫn can always be made smaller while keeping the difference η −γ fixed. Finally, since the primal problem is feasible—q = π is a feasible solution—there is no duality gap (see the proof of [Beck and Teboulle, 2003, Theorem
    Google ScholarFindings
  • The inequality in (15) follows from (32) and the fact that μ = Z(⌈nα⌉) (see proof of [Brown, 2007, Proposition 4.1]).
    Google ScholarFindings
  • [Cesa-Bianchi and Lugosi, 2006, Lemma A.5] and [Mhammedi et al., 2019, Proposition 10-
    Google ScholarFindings
  • Proof Let Z = Z − E[Z]. Suppose that Z is (σ, b)-sub-exponential. In this case, by Lemma 9 the random variable Y ∶= Z ⋅ E[q⋆ Z] satisfies (24), and so by [Wainwright, 2019, Theorem 2.19], we have
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments