# PAC-Bayesian Bound for the Conditional Value at Risk

NIPS 2020, (2020)

EI

Weibo:

Abstract:

Conditional Value at Risk (CVaR) is a family of "coherent risk measures" which generalize the traditional mathematical expectation. Widely used in mathematical finance, it is garnering increasing interest in machine learning, e.g., as an alternate approach to regularization, and as a means for ensuring fairness. This paper presents a ge...More

Code:

Data:

ZH

Introduction

- The goal in statistical learning is to learn hypotheses that generalize well, which is typically formalized by seeking to minimize the expected risk associated with a given loss function.
- When instantiated with bounded random variables, their concentration inequalities have sub-optimal dependence in α.
- As a by-product of the analysis, the authors derive a new way of obtaining concentration bounds for the conditional value at risk by reducing the problem to estimating expectations using empirical means.

Highlights

- The goal in statistical learning is to learn hypotheses that generalize well, which is typically formalized by seeking to minimize the expected risk associated with a given loss function
- In Section 3, we recall the statistical learning setting and present our PAC-Bayesian bound for Conditional Value at Risk (CVAR)
- Let X be an arbitrary set, and f ∶ X → R be some fixed measurable function
- We derived a first PAC-Bayesian bound for CVAR by reducing the task of estimating CVAR to that of merely estimating an expectation
- We note that the only steps in the proof of our main bound in Theorem 1 that are specific to CVAR are Lemmas 2 and 3, and so the question is whether our overall approach can be extended to other coherent risk measures to achieve (4)
- In Appendix B, we discuss how our results may be extended to a rich class of coherent risk measures known as φ-entropic risk measures

Results

- Even though not explicitly done before, a PAC-Bayesian bound of the form (4) can be derived for a risk measure R using an existing technique due to McAllester [2003] as soon as, for any fixed hypothesis h, the difference R[l(h, X)] − R[l(h, X)] is sub-exponential with a sufficiently fast tail decay as a function of n.
- This type of improved PAC-Bayesian bound, where the empirical error appears multiplying the complexity term inside the square-root, has been derived for the expected risk in works such as [Catoni, 2007, Langford and Shawe-Taylor, 2003, Maurer, 2004, Seeger, 2002]; these are arguably the state-of-theart generalization bounds.
- In Subsection 4.2, the authors introduce an auxiliary random variable Y whose expectation equals Cα[Z] (as in (8)) and whose empirical mean is bounded from above by the estimator Cα[Z] introduced in Subsection 4.1—this enables the reduction described at the end of Section 3.
- Note that in Lemma 9 the authors have assumed that Z is a zero-mean random variable, and so the authors still need to do some work to derive a concentration inequality for Ĉα[Z].
- When Z is a σ-sub-Gaussian random variable with σ > 0, an immediate consequence of Theorem 10 is that by setting t = 2σ2 ln(2 δ) in (25), the authors get that, with probability at least 1 − 2δ, Cα[Z] −
- Generalization bounds of the form (4) for unbounded but sub-Gaussian or sub-exponential l(h, X), h ∈ H, can be obtained using the PAC-Bayesian analysis of [McAllester, 2003, Theorem 1] and the concentration inequalities in Theorem 10.
- The authors conjecture that the dependence on α in the concentration bounds of Theorem 10 can be improved by swapping α2 for α in the argument of the exponentials; in the sub-Gaussian case, this would move α inside the square-root on the RHS of (26).

Conclusion

- The inequality in (28) essentially replaces the range of the random variable Z typically present under the square-root in other concentration bounds [Brown, 2007, Wang and Gao, 2010] by the smaller quantity Cα[Z].
- The authors note that the only steps in the proof of the main bound in Theorem 1 that are specific to CVAR are Lemmas 2 and 3, and so the question is whether the overall approach can be extended to other coherent risk measures to achieve (4).
- These CRMs are often used in the context of robust optimization Namkoong and Duchi [2017], and are perfect candidates to consider in the context of this paper

Summary

- The goal in statistical learning is to learn hypotheses that generalize well, which is typically formalized by seeking to minimize the expected risk associated with a given loss function.
- When instantiated with bounded random variables, their concentration inequalities have sub-optimal dependence in α.
- As a by-product of the analysis, the authors derive a new way of obtaining concentration bounds for the conditional value at risk by reducing the problem to estimating expectations using empirical means.
- Even though not explicitly done before, a PAC-Bayesian bound of the form (4) can be derived for a risk measure R using an existing technique due to McAllester [2003] as soon as, for any fixed hypothesis h, the difference R[l(h, X)] − R[l(h, X)] is sub-exponential with a sufficiently fast tail decay as a function of n.
- This type of improved PAC-Bayesian bound, where the empirical error appears multiplying the complexity term inside the square-root, has been derived for the expected risk in works such as [Catoni, 2007, Langford and Shawe-Taylor, 2003, Maurer, 2004, Seeger, 2002]; these are arguably the state-of-theart generalization bounds.
- In Subsection 4.2, the authors introduce an auxiliary random variable Y whose expectation equals Cα[Z] (as in (8)) and whose empirical mean is bounded from above by the estimator Cα[Z] introduced in Subsection 4.1—this enables the reduction described at the end of Section 3.
- Note that in Lemma 9 the authors have assumed that Z is a zero-mean random variable, and so the authors still need to do some work to derive a concentration inequality for Ĉα[Z].
- When Z is a σ-sub-Gaussian random variable with σ > 0, an immediate consequence of Theorem 10 is that by setting t = 2σ2 ln(2 δ) in (25), the authors get that, with probability at least 1 − 2δ, Cα[Z] −
- Generalization bounds of the form (4) for unbounded but sub-Gaussian or sub-exponential l(h, X), h ∈ H, can be obtained using the PAC-Bayesian analysis of [McAllester, 2003, Theorem 1] and the concentration inequalities in Theorem 10.
- The authors conjecture that the dependence on α in the concentration bounds of Theorem 10 can be improved by swapping α2 for α in the argument of the exponentials; in the sub-Gaussian case, this would move α inside the square-root on the RHS of (26).
- The inequality in (28) essentially replaces the range of the random variable Z typically present under the square-root in other concentration bounds [Brown, 2007, Wang and Gao, 2010] by the smaller quantity Cα[Z].
- The authors note that the only steps in the proof of the main bound in Theorem 1 that are specific to CVAR are Lemmas 2 and 3, and so the question is whether the overall approach can be extended to other coherent risk measures to achieve (4).
- These CRMs are often used in the context of robust optimization Namkoong and Duchi [2017], and are perfect candidates to consider in the context of this paper

Related work

- Deviation bounds for CVAR were first presented by Brown [2007]. However, their approach only applies to bounded continuous random variables, and their lower deviation bound has a sub-optimal dependence on the level α. Wang and Gao [2010] later refined their analysis to recover the “correct” dependence in α, albeit their technique still requires a two-sided bound on the random variable Z. Thomas and Learned-Miller [2019] derived new concentration inequalities for CVAR with a very sharp empirical performance, even though the dependence on α in their bound is sub-optimal. Further, they only require a one-sided bound on Z, without a continuity assumption.

Kolla et al [2019] were the first to provide concentration bounds for CVAR when the random variable Z is unbounded, but is either sub-Gaussian or sub-exponential. Bhat and Prashanth [2019] used a bound on the Wasserstein distance between true and empirical cumulative distribution functions to substantially tighten the bounds of Kolla et al [2019] when Z has finite exponential or kth-order moments; they also apply their results to other coherent risk measures. However, when instantiated with bounded random variables, their concentration inequalities have sub-optimal dependence in α.

Study subjects and analysis

guarantees: 5

In addition to Zh in (20), define Yh ∶= l(h, X) ⋅ E[q⋆ X] and Yh,i ∶= l(h, Xi) ⋅ E[q⋆ Xi], for i ∈ [n]. Then, if we set γ = ηn and Rh = EP [Yh] − ∑ni=1 Yh,i n − ηκ(η α)Cα[Zh] α, Lemma 5 guarantees that E[exp(γRh)] ≤ 1, and so by Lemma 6 we get the following result: Theorem 7. Let α, δ ∈ (0, 1), and η ∈ [0, α]

Reference

- Karim T. Abou-Moustafa and Csaba Szepesvári. An exponential tail bound for lq stable learning rules. In Aurélien Garivier and Satyen Kale, editors, Algorithmic Learning Theory, ALT 2019, 22-24 March 2019, Chicago, Illinois, USA, volume 98 of Proceedings of Machine Learning Research, pages 31–63. PMLR, 2019. URL http://proceedings.mlr.press/v98/abou-moustafa19a.html.
- Amir Ahmadi-Javid. Entropic value-at-risk: A new coherent risk measure. Journal of Optimization Theory and Applications, 155(3):1105–1123, 2012.
- Maurice Allais. Le comportement de l’homme rationnel devant le risque: critique des postulats et axiomes de l’école américaine. Econometrica: Journal of the Econometric Society, pages 503– 546, 1953.
- Philippe Artzner, Freddy Delbaen, Jean-Marc Eber, and David Heath. Coherent measures of risk. Mathematical finance, 9(3):203–228, 1999.
- Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
- Sanjay P Bhat and LA Prashanth. Concentration of risk measures: A Wasserstein distance approach. In Advances in Neural Information Processing Systems, pages 11739–11748, 2019.
- Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
- Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of machine learning research, 2(Mar):499–526, 2002.
- Olivier Bousquet, Yegor Klochkov, and Nikita Zhivotovskiy. Sharper bounds for uniformly stable algorithms. CoRR, abs/1910.07833, 201URL http://arxiv.org/abs/1910.07833.
- David B. Brown. Large deviations bounds for estimating conditional value-at-risk. Operations Research Letters, 35(6):722–730, 2007.
- Olivier Catoni. PAC-Bayesian supervised classification: the thermodynamics of statistical learning. Lecture Notes-Monograph Series. IMS, 2007.
- Alain Celisse and Benjamin Guedj. Stability revisited: new generalisation bounds for the leave-oneout. arXiv preprint arXiv:1608.06412, 2016.
- Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
- Youhua (Frank) Chen, Minghui Xu, and Zhe George Zhang. A Risk-Averse Newsvendor Model Under the CVaR Criterion. Operations Research, 57(4):1040–1044, 2009. ISSN 0030364X, 15265463. URL http://www.jstor.org/stable/25614814.
- Algorithms for CVaR Optimization in MDPs. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3509–3517. Curran Associates, Inc., 2014.
- http://papers.nips.cc/paper/5246-algorithms-for-cvar-optimization-in-mdps.pdf.
- I. Csiszár. I-divergence geometry of probability distributions and minimization problems. Annals of Probability, 3:146–158, 1975.
- M. D. Donsker and S. R. S. Varadhan. Asymptotic evaluation of certain Markov process expectations for large time — III. Communications on pure and applied Mathematics, 29(4):389–461, 1976.
- John Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. arXiv preprint arXiv:1810.08750, 2018.
- Daniel Ellsberg. Risk, ambiguity, and the savage axioms. The quarterly journal of economics, pages 643–669, 1961.
- Benjamin Guedj. A Primer on PAC-Bayesian Learning. https://arxiv.org/abs/1901.05353.
- Xiaoguang Huo and Feng Fu. Risk-aware multi-armed bandit problem with application to portfolio selection. Royal Society Open Science, 4(11), 2017.
- Ravi Kumar Kolla, LA Prashanth, Sanjay P Bhat, and Krishna Jagannathan. Concentration bounds for empirical conditional value-at-risk: The unbounded case. Operations Research Letters, 47(1): 16–20, 2019.
- John Langford and John Shawe-Taylor. PAC-Bayes & margins. In Advances in Neural Information Processing Systems, pages 439–446, 2003.
- Andreas Maurer. A note on the PAC-Bayesian theorem. arXiv preprint cs/0411099, 2004.
- Andreas Maurer and Massimiliano Pontil. Empirical Bernstein bounds and sample variance penalization. In Proceedings COLT 2009, 2009.
- David A. McAllester. PAC-Bayesian Stochastic Model Selection. In Machine Learning, volume 51, pages 5–21, 2003.
- Zakaria Mhammedi, Peter Grünwald, and Benjamin Guedj. PAC-Bayes Un-Expected Bernstein Inequality. In Advances in Neural Information Processing Systems, pages 12180–12191, 2019.
- Tetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and Toshiyuki Tanaka. Nonparametric return distribution approximation for reinforcement learning. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pages 799– 806, 2010.
- Hongseok Namkoong and John C. Duchi. Variance-based regularization with convex objectives. In Advances in Neural Information Processing Systems, pages 2971–2980, 2017.
- Springer US, Boston, MA, 2000. ISBN 978-1-4757-3150-7. doi: 10.1007/ 978-1-4757-3150-7_15. URL https://doi.org/10.1007/978-1-4757-3150-7_15.
- Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2817–2826. JMLR. org, 2017.
- LA Prashanth and Mohammad Ghavamzadeh. Actor-critic algorithms for risk-sensitive MDPs. In Advances in neural information processing systems, pages 252–260, 2013.
- R Tyrrell Rockafellar and Stan Uryasev. The fundamental risk quadrangle in risk management, optimization and statistical estimation. Surveys in Operations Research and Management Science, 18(1-2):33–53, 2013.
- R Tyrrell Rockafellar, Stanislav Uryasev, et al. Optimization of conditional value-at-risk. Journal of Risk, 2(3):21–41, 2000.
- Matthias Seeger. PAC-Bayesian generalization error bounds for Gaussian process classification. Journal of Machine Learning Research, 3:233–269, 2002.
- Akiko Takeda and Takafumi Kanamori. A robust approach based on conditional value-at-risk measure to statistical learning problems. European Journal of Operational Research, 198(1): 287 – 296, 2009. ISSN 0377-2217. doi: https://doi.org/10.1016/j.ejor.2008.07.027. URL http://www.sciencedirect.com/science/article/pii/S0377221708005614.
- Akiko Takeda and Masashi Sugiyama. ν-support vector machine as conditional value-at-risk minimization. In Proceedings of the 25th international conference on Machine learning, pages 1056– 1063, 2008.
- Aviv Tamar, Yonatan Glassner, and Shie Mannor. Optimizing the CVaR via sampling. In TwentyNinth AAAI Conference on Artificial Intelligence, 2015.
- Philip Thomas and Erik Learned-Miller. Concentration inequalities for conditional value at risk. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97, pages 6225–6233, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
- Ilya O. Tolstikhin and Yevgeny Seldin. Pac-bayes-empirical-bernstein inequality. In Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pages 109–117, 2013. URL http://papers.nips.cc/paper/4903-pac-bayes-empirical-bernstein-inequality.
- M. J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019.
- Ying Wang and Fuqing Gao. Deviation inequalities for an estimator of the conditional value-at-risk. Operations Research Letters, 38(3):236–239, 2010.
- Robert Williamson and Aditya Menon. Fairness risk measures. In International Conference on Machine Learning, pages 6786–6797, 2019.
- where (30) is due to {x ∈ R φ(x) < +∞} = [0, 1 α], and (31) follows by setting μ ∶= η − γ and noting that the inf in (30) is always attained at a point (η, γ) ∈ R2≥0 satisfying η ⋅ γ = 0, in which case η + γ = μ; this is true because by the positivity of ǫn, if η, γ > 0, then (η + γ)ǫn can always be made smaller while keeping the difference η −γ fixed. Finally, since the primal problem is feasible—q = π is a feasible solution—there is no duality gap (see the proof of [Beck and Teboulle, 2003, Theorem
- The inequality in (15) follows from (32) and the fact that μ = Z(⌈nα⌉) (see proof of [Brown, 2007, Proposition 4.1]).
- [Cesa-Bianchi and Lugosi, 2006, Lemma A.5] and [Mhammedi et al., 2019, Proposition 10-
- Proof Let Z = Z − E[Z]. Suppose that Z is (σ, b)-sub-exponential. In this case, by Lemma 9 the random variable Y ∶= Z ⋅ E[q⋆ Z] satisfies (24), and so by [Wainwright, 2019, Theorem 2.19], we have

Tags

Comments