AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We propose and analyze algorithms for distributionally robust optimization of convex losses with conditional value at risk and $\chi^2$ divergence uncertainty sets

Large-Scale Methods for Distributionally Robust Optimization

NIPS 2020, (2020)

Cited by: 7|Views41
EI
Full Text
Bibtex
Weibo

Abstract

We propose and analyze algorithms for distributionally robust optimization of convex losses with conditional value at risk (CVaR) and $\chi^2$ divergence uncertainty sets. We prove that our algorithms require a number of gradient evaluations independent of training set size and number of parameters, making them suitable for large-scale ...More
0
Introduction
  • The growing role of machine learning in high-stakes decision-making raises the need to train reliable models that perform robustly across subpopulations and environments [10, 24, 61, 50, 31, 46, 34].
  • Some of the results extend to more general φ-divergence balls [63]
  • Minimizers of these objectives enjoy favorable statistical properties [17, 29], but finding them is more challenging than standard ERM.
  • Stochastic gradient methods solve ERM with a number of ∇ computations independent of both N , the support size of P0, and d, the dimension of x
  • These guarantees do not directly apply to DRO because the supremum over Q in (1) makes cheap sampling-based gradient estimates biased.
  • As a consequence, existing techniques for minimizing the χ2 objective [1, 15, 2, 4, 40, 17]
Highlights
  • The growing role of machine learning in high-stakes decision-making raises the need to train reliable models that perform robustly across subpopulations and environments [10, 24, 61, 50, 31, 46, 34]
  • The literature considers several uncertainty sets [2, 4, 6, 22], and we focus on two particular choices: (a) the set of distributions with bounded likelihood ratio to P0, so that L becomes the conditional value at risk (CVaR) [51, 58], and (b) the set of distributions with bounded χ2 divergence to P0 [2, 12]
  • To obtain algorithms with improved oracle complexities, in Section 4 we present a theoretically more efficient multi-level Monte Carlo (MLMC) [26, 27] gradient estimator which is a slight modification of the general technique of Blanchet and Glynn [7]
  • To the best of our knowledge, the latter is the largest Distributionally robust optimization (DRO) problem solved to date. In both experiments DRO provides generalization improvements over empirical risk minimization (ERM), and we show that our stochastic gradient estimators require far fewer ∇ computations—between 9× and 36×—than full-batch methods
  • Our code, which is available at https://github.com/daniellevy/fast-dro/, implements our gradient estimators in PyTorch [48] and combines them seamlessly with the framework’s optimizers; we show an example code snippet in Appendix F.3
  • For ImageNet the effect is more modest: in the worst-performing 10 classes we observe improvements of 5–10% in log loss, as well as a roughly 4 point improvement in accuracy. These improvements, come at the cost of degradation in average performance: the average loss increases by up to 10% and the average accuracy drops by roughly 1 point
  • We conclude the discussion by briefly touching upon the improvement that DRO yields in terms of generalization metrics; we provide additional detail in Appendix F.5
Methods
  • The authors' main focus is measuring how the total work in solving the DRO problems depends on different gradient estimators.
  • To ensure that the authors operate in practically meaningful settings, the experiments involve heterogeneous data, and the authors tune the DRO with high probability, the authors can use the median of a logarithmic number.
  • Digits: Penalized-χ2 with λ = 0.05.
  • ImageNet: CVaR with α = 0.1.
  • ImageNet: Constrained-χ2 with ρ = 1.0.
  • ImageNet: Penalized-χ2 with λ = 0.4 Batch n = 10.
Results
  • For ImageNet the effect is more modest: in the worst-performing 10 classes the authors observe improvements of 5–10% in log loss, as well as a roughly 4 point improvement in accuracy.
  • These improvements, come at the cost of degradation in average performance: the average loss increases by up to 10% and the average accuracy drops by roughly 1 point
Conclusion
  • The authors' analysis in Section 3.1 bounds the suboptimality of solutions resulting from using a mini-batch estimators with batch size n, showing it must vanish as n increases.
  • This may appear confusing, since the MLMC convergence guarantees are optimal while the mini-batch estimator achieves the optimal rate only under certain assumptions
  • Recall, that these assumptions are smoothness of the loss and—for CVaR—sufficiently rapid decay of the bias floor, which the authors verify empirically.This work provides rigorous convergence guarantees for solving large-scale convex φ-divergence DRO problems with stochastic gradient methods, laying out a foundation for their use in practice; the authors conclude it by highlighting two directions for further research.
Tables
  • Table1: Number of ∇ evaluations to obtain E[L(x; P0)] − infx ∈X L(x ; P0) ≤ when P0 is uniform on N training points. For simplicity we omit the Lipschitz constant of , the size of the domain X , and logarithmic factors
  • Table2: Comparison wallclock time (in minutes) of the different algorithms, in terms of time per epoch and time to reach within 2% of the best training loss. In the last two columns, we report the number of epochs required to reach within 2% of the best training loss. We report ∞ for configurations that do not reach the sub-optimality goal for the duration of the experiment, and omit standard deviations when then they are 0
  • Table3: Parameter settings for Theorem 1. For Lkl-CVaR we take λ log(
  • Table4: Stepsizes for the experiments we present in this work. We use momentum 0.9 for all configurations except MLMC, where we do not use momentum. We select the stepsizes according to the ‘coarse-to-fine’ strategy we describe in this section
  • Table5: Empirical complexity for the Digits experiment in terms of number of epochs required to reach within 2% of the optimal training objective value, averaged across 5 seeds ± one standard deviation. (For the full-batch experiments we only ran one seed). The “speed-up” column gives the ratio between the full batch complexity and the best mini-batch complexity
  • Table6: Empirical complexity for the ImageNet experiment in terms of number of epochs required to reach within 2% of the optimal training objective value, averaged across 5 seeds ± one standard deviation, whenever it is not zero. (For the full-batch experiments we only ran one seed). The “speedup” column gives the ratio between the full batch complexity and the best mini-batch complexity
Download tables as Excel
Related work
  • Distributionally robust optimization grows from the robust optimization literature in operations research [2, 1, 3, 4], and the fundamental uncertainty about the data distribution at test time makes its application to machine learning natural. Experiments in the papers [40, 23, 17, 29, 13, 35] show promising results for CVaR (2) and χ2-constrained (3) DRO, while other works highlight the importance of incorporating additional constraints into the uncertainty set definition [33, 19, 47, 53]. Below, we review the prior art on solving these DRO problems at scale.

    Full-batch subgradient method. When P0 has support of size N it is possible to compute a subgradient of the objective L(x; P0) by evaluating (x; si) and ∇ (x; si) for i = 1, . . . , N , computing the q ∈ ∆N attaining the supremum (1), whence g = N i=1 qi∇

    (x; si) is a subgradient of L at x.

    As the Lipschitz constant of L is at most that of , we may use these subgradients in the subgradient method [44] and find an approximate solution in order −2 steps. This requires order N −2 evaluations of ∇ , regardless of the uncertainty set.
Funding
  • DL, YC and JCD were supported by the NSF under CAREER Award CCF-1553086 and HDR 1934578 (the Stanford Data Science Collaboratory) and Office of Naval Research YIP Award N00014-19-2288
  • YC was supported by the Stanford Graduate Fellowship
  • AS is supported by the NSF CAREER Award CCF-1844855
Study subjects and analysis
guarantees: 4
This scheme relies on MLMC estimators for both the gradient ∇Lχ2-pen and the derivative of Lχ2-pen with respect to λ. Proposition 4 guarantees that the second moment of our gradient estimators remain bounded by a quantity that depends logarithmically on n. For these estimators, Proposition 3 thus directly provides complexity guarantees to minimize LCVaR and Lχ2-pen

Reference
  • A. Ben-Tal, L. E. Ghaoui, and A. Nemirovski. Robust Optimization. Princeton University Press, 2009.
    Google ScholarFindings
  • A. Ben-Tal, D. den Hertog, A. D. Waegenaere, B. Melenberg, and G. Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341– 357, 2013.
    Google ScholarLocate open access versionFindings
  • D. Bertsimas, D. Brown, and C. Caramanis. Theory and applications of robust optimization. SIAM Review, 53(3):464–501, 2011.
    Google ScholarLocate open access versionFindings
  • D. Bertsimas, V. Gupta, and N. Kallus. Data-driven robust optimization. Mathematical Programming, Series A, 167(2):235–292, 2018.
    Google ScholarLocate open access versionFindings
  • J. Blanchet and Y. Kang. Semi-supervised Learning Based on Distributionally Robust Optimization, chapter 1, pages 1–33. John Wiley & Sons, Ltd, 2020. ISBN 9781119721871.
    Google ScholarFindings
  • J. Blanchet, Y. Kang, and K. Murthy. Robust Wasserstein profile inference and applications to machine learning. Journal of Applied Probability, 56(3):830–857, 2019.
    Google ScholarLocate open access versionFindings
  • J. H. Blanchet and P. W. Glynn. Unbiased Monte Carlo for optimization and functions of expectations via multi-level randomization. In 2015 Winter Simulation Conference (WSC), pages 3656–366IEEE, 2015.
    Google ScholarLocate open access versionFindings
  • S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: a Nonasymptotic Theory of Independence. Oxford University Press, 2013.
    Google ScholarFindings
  • G. Braun, C. Guzmán, and S. Pokutta. Lower bounds on the oracle complexity of nonsmooth convex optimization via information theory. IEEE Transactions on Information Theory, 63(7), 2017.
    Google ScholarLocate open access versionFindings
  • J. Buolamwini and T. Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency, pages 77–91, 2018.
    Google ScholarLocate open access versionFindings
  • N. Cressie and T. R. Read. Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society, Series B, pages 440–464, 1984.
    Google ScholarLocate open access versionFindings
  • I. Csiszár. Information-type measures of difference of probability distributions and indirect observation. Studia Scientifica Mathematica Hungary, 2:299–318, 1967.
    Google ScholarLocate open access versionFindings
  • S. Curi, K. Levy, S. Jegelka, A. Krause, et al. Adaptive sampling for stochastic risk-averse learning. arXiv:1910.12511 [cs.LG], 2019.
    Findings
  • T. E. de Campos, B. R. Babu, and M. Varma. Character recognition in natural images. In Proceedings of the Fourth International Conference on Computer Vision Theory and Applications, February 2009.
    Google ScholarLocate open access versionFindings
  • E. Delage and Y. Ye. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations Research, 58(3):595–612, 2010.
    Google ScholarLocate open access versionFindings
  • J. C. Duchi. Introductory lectures on stochastic convex optimization. In The Mathematics of Data, IAS/Park City Mathematics Series. American Mathematical Society, 2018.
    Google ScholarLocate open access versionFindings
  • J. C. Duchi and H. Namkoong. Learning models with uniform performance via distributionally robust optimization. Annals of Statistics, to appear, 2020.
    Google ScholarLocate open access versionFindings
  • J. C. Duchi, P. L. Bartlett, and M. J. Wainwright. Randomized smoothing for stochastic optimization. SIAM Journal on Optimization, 22(2):674–701, 2012.
    Google ScholarLocate open access versionFindings
  • J. C. Duchi, T. Hashimoto, and H. Namkoong. Distributionally robust losses against mixture covariate shifts. 2019.
    Google ScholarFindings
  • R. Durrett. Probability: theory and examples, volume 49. Cambridge University Press, 2019.
    Google ScholarFindings
  • B. Efron and C. Stein. The jackknife estimate of variance. The Annals of Statistics, 9(3): 586–596, 1981.
    Google ScholarLocate open access versionFindings
  • P. M. Esfahani and D. Kuhn. Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, Series A, 171(1–2):115–166, 2018.
    Google ScholarLocate open access versionFindings
  • Y. Fan, S. Lyu, Y. Ying, and B. Hu. Learning with average top-k loss. In Advances in Neural Information Processing Systems 30, pages 497–505, 2017.
    Google ScholarLocate open access versionFindings
  • A. Fuster, P. Goldsmith-Pinkham, T. Ramadorai, and A. Walther. Predictably unequal? the effects of machine learning on credit markets. Social Science Research Network: 3072038, 2018.
    Google ScholarLocate open access versionFindings
  • S. Ghosh, M. Squillante, and E. Wollega. Efficient stochastic gradient descent for distributionally robust learning. arXiv:1805.08728 [stats.ML], 2018.
    Findings
  • M. B. Giles. Multilevel Monte Carlo path simulation. Operations research, 56(3):607–617, 2008.
    Google ScholarLocate open access versionFindings
  • M. B. Giles. Multilevel Monte Carlo methods. Acta Numerica, 24:259–328, 2015.
    Google ScholarLocate open access versionFindings
  • C. Guzmán and A. Nemirovski. On lower complexity bounds for large-scale smooth convex optimization. Journal of Complexity, 31(1):1–14, 2015.
    Google ScholarLocate open access versionFindings
  • T. Hashimoto, M. Srivastava, H. Namkoong, and P. Liang. Fairness without demographics in repeated loss minimization. In Proceedings of the 35th International Conference on Machine Learning, 2018.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770– 778, 2016.
    Google ScholarLocate open access versionFindings
  • D. Hendrycks and T. Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In Proceedings of the Seventh International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • J. Hiriart-Urruty and C. Lemaréchal. Convex Analysis and Minimization Algorithms I. Springer, New York, 1993.
    Google ScholarFindings
  • W. Hu, G. Niu, I. Sato, and M. Sugiayma. Does distributionally robust supervised learning give robust classifiers? In Proceedings of the 35th International Conference on Machine Learning, 2018.
    Google ScholarLocate open access versionFindings
  • N. Kalra and S. M. Paddock. Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability? Transportation Research Part A: Policy and Practice, 94:182–193, 2016.
    Google ScholarLocate open access versionFindings
  • K. Kawaguchi and H. Lu. Ordered SGD: A new stochastic optimization framework for empirical risk minimization. In Proceedings of the 23nd International Conference on Artificial Intelligence and Statistics, 2020.
    Google ScholarLocate open access versionFindings
  • [37] G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, Series A, 133(1–2):365–397, 2012.
    Google ScholarLocate open access versionFindings
  • [38] Y. LeCun, L. D. Jackel, L. Bottou, A. Brunot, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, U. A. Muller, E. Sackinger, P. Simard, and V. Vapnik. Comparison of learning algorithms for handwritten digit recognition. In International Conference on Artificial Neural Networks, pages 53–60, 1995.
    Google ScholarLocate open access versionFindings
  • [39] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995.
    Google ScholarFindings
  • [40] H. Namkoong and J. C. Duchi. Stochastic gradient methods for distributionally robust optimization with f -divergences. In Advances in Neural Information Processing Systems 29, 2016.
    Google ScholarLocate open access versionFindings
  • [41] A. Nemirovski and D. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley, 1983.
    Google ScholarFindings
  • [42] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
    Google ScholarLocate open access versionFindings
  • [43] Y. Nesterov. A method of solving a convex programming problem with convergence rate O(1/k2). Soviet Mathematics Doklady, 27(2):372–376, 1983.
    Google ScholarLocate open access versionFindings
  • [44] Y. Nesterov. Introductory Lectures on Convex Optimization. Kluwer Academic Publishers, 2004.
    Google ScholarFindings
  • [45] Y. Nesterov. Smooth minimization of nonsmooth functions. Mathematical Programming, Series A, 103:127–152, 2005.
    Google ScholarLocate open access versionFindings
  • [46] L. Oakden-Rayner, J. Dunnmon, G. Carneiro, and C. Ré. Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. In Proceedings of the ACM Conference on Health, Inference, and Learning, pages 151–159, 2020.
    Google ScholarLocate open access versionFindings
  • [47] Y. Oren, S. Sagawa, T. Hashimoto, and P. Liang. Distributionally robust language modeling. In Empirical Methods in Natural Language Processing (EMNLP), 2019.
    Google ScholarLocate open access versionFindings
  • [48] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In Neural Information Processing Systems (NIPS) Workshop on Automatic Differentiation, 2017.
    Google ScholarLocate open access versionFindings
  • [49] J. Pitman. Probability. Springer-Verlag, 1993.
    Google ScholarFindings
  • [50] B. Recht, R. Roelofs, L. Schmidt, and V. Shankar. Do ImageNet classifiers generalize to ImageNet? In Proceedings of the 36th International Conference on Machine Learning, 2019.
    Google ScholarLocate open access versionFindings
  • [51] R. T. Rockafellar and S. Uryasev. Optimization of conditional value-at-risk. Journal of Risk, 2:21–42, 2000.
    Google ScholarLocate open access versionFindings
  • [52] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
    Google ScholarLocate open access versionFindings
  • [53] S. Sagawa, P. W. Koh, T. B. Hashimoto, and P. Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In Proceedings of the Eighth International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • [54] S. Shalev-Shwartz and Y. Singer. Convex repeated games and fenchel duality. In Advances in Neural Information Processing Systems 19, 2006.
    Google ScholarLocate open access versionFindings
  • [55] S. Shalev-Shwartz and Y. Wexler. Minimizing the maximal loss: How and why? In Proceedings of the 33rd International Conference on Machine Learning, 2016.
    Google ScholarLocate open access versionFindings
  • [56] O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the 30th International Conference on Machine Learning, pages 71–79, 2013.
    Google ScholarLocate open access versionFindings
  • [57] A. Shapiro. Distributionally robust stochastic programming. SIAM Journal on Optimization, 27(4):2258–2275, 2017.
    Google ScholarLocate open access versionFindings
  • [58] A. Shapiro, D. Dentcheva, and A. Ruszczyński. Lectures on Stochastic Programming: Modeling and Theory. SIAM and Mathematical Programming Society, 2009.
    Google ScholarLocate open access versionFindings
  • [59] A. Sinha, H. Namkoong, and J. Duchi. Certifying some distributional robustness with principled adversarial training. In Proceedings of the Sixth International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • [60] M. Staib and S. Jegelka. Distributionally robust optimization and generalization in kernel methods. In Advances in Neural Information Processing Systems 32, pages 9134–9144, 2019.
    Google ScholarLocate open access versionFindings
  • [61] A. Torralba and A. A. Efros. Unbiased look at dataset bias. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1521–1528. IEEE, 2011.
    Google ScholarLocate open access versionFindings
  • [62] A. A. Trindade, S. Uryasev, A. Shapiro, and G. Zrazhevsky. Financial prediction with constrained tail risk. Journal of Banking & Finance, 31(11):3524–3538, 2007.
    Google ScholarLocate open access versionFindings
  • [63] T. van Erven and P. Harremoës. Rényi divergence and Kullback-Leibler divergence. IEEE Transactions on Information Theory, 60(7):3797–3820, 2014.
    Google ScholarLocate open access versionFindings
  • [64] M. J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press, 2019.
    Google ScholarFindings
  • [65] S. Wang, W. Guo, H. Narasimhan, A. Cotter, M. Gupta, and M. I. Jordan. Robust optimization for fairness with noisy protected groups. arXiv:2002.09343 [cs.LG], 2020.
    Findings
  • [66] B. Yu. Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam, pages 423–435. SpringerVerlag, 1997.
    Google ScholarFindings
  • [67] M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the Twentieth International Conference on Machine Learning, 2003.
    Google ScholarLocate open access versionFindings
  • 0. We find it more convenient to work with ψ(t)
    Google ScholarFindings
  • 4. The objectives
    Google ScholarFindings
  • 2. By the expression (34) for ∇L we have
    Google ScholarFindings
  • 1. We have the following bound on the second moment of the estimator, E MF
    Google ScholarLocate open access versionFindings
  • 0. Let us now upper bound Dk. To that end, we define q = 2q1k/2 + δ where δ ∈ Rk/2 is a fixed-sign vector such that qlies in ∆k/2. More precisely, if q 1 > 1, δ decreases the mass of the largest coordinate until q(1)
    Google ScholarFindings
  • 2. Where we used (i) that x and λ are feasible points in the joint minimization of fρ(x, λ) over x ∈ X and λ ∈ [λ, B/ρ]; (ii) the convexity of f in λ; and (iii) the fact that D(x, λ ) ≥ 0 and λ ≤ λ.
    Google ScholarLocate open access versionFindings
Author
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科