## AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically

Go Generating

## AI Traceability

AI parses the academic lineage of this thesis

Generate MRT

## AI Insight

AI extracts a summary of this paper

Weibo:
We propose and analyze algorithms for distributionally robust optimization of convex losses with conditional value at risk and $\chi^2$ divergence uncertainty sets

# Large-Scale Methods for Distributionally Robust Optimization

NIPS 2020, (2020)

Cited by: 7|Views41
EI
Full Text
Bibtex
Weibo

Abstract

We propose and analyze algorithms for distributionally robust optimization of convex losses with conditional value at risk (CVaR) and $\chi^2$ divergence uncertainty sets. We prove that our algorithms require a number of gradient evaluations independent of training set size and number of parameters, making them suitable for large-scale ...More

Code:

Data:

0
Introduction
• The growing role of machine learning in high-stakes decision-making raises the need to train reliable models that perform robustly across subpopulations and environments [10, 24, 61, 50, 31, 46, 34].
• Some of the results extend to more general φ-divergence balls [63]
• Minimizers of these objectives enjoy favorable statistical properties [17, 29], but finding them is more challenging than standard ERM.
• Stochastic gradient methods solve ERM with a number of ∇ computations independent of both N , the support size of P0, and d, the dimension of x
• These guarantees do not directly apply to DRO because the supremum over Q in (1) makes cheap sampling-based gradient estimates biased.
• As a consequence, existing techniques for minimizing the χ2 objective [1, 15, 2, 4, 40, 17]
Highlights
• The growing role of machine learning in high-stakes decision-making raises the need to train reliable models that perform robustly across subpopulations and environments [10, 24, 61, 50, 31, 46, 34]
• The literature considers several uncertainty sets [2, 4, 6, 22], and we focus on two particular choices: (a) the set of distributions with bounded likelihood ratio to P0, so that L becomes the conditional value at risk (CVaR) [51, 58], and (b) the set of distributions with bounded χ2 divergence to P0 [2, 12]
• To obtain algorithms with improved oracle complexities, in Section 4 we present a theoretically more efficient multi-level Monte Carlo (MLMC) [26, 27] gradient estimator which is a slight modification of the general technique of Blanchet and Glynn [7]
• To the best of our knowledge, the latter is the largest Distributionally robust optimization (DRO) problem solved to date. In both experiments DRO provides generalization improvements over empirical risk minimization (ERM), and we show that our stochastic gradient estimators require far fewer ∇ computations—between 9× and 36×—than full-batch methods
• Our code, which is available at https://github.com/daniellevy/fast-dro/, implements our gradient estimators in PyTorch [48] and combines them seamlessly with the framework’s optimizers; we show an example code snippet in Appendix F.3
• For ImageNet the effect is more modest: in the worst-performing 10 classes we observe improvements of 5–10% in log loss, as well as a roughly 4 point improvement in accuracy. These improvements, come at the cost of degradation in average performance: the average loss increases by up to 10% and the average accuracy drops by roughly 1 point
• We conclude the discussion by briefly touching upon the improvement that DRO yields in terms of generalization metrics; we provide additional detail in Appendix F.5
Methods
• The authors' main focus is measuring how the total work in solving the DRO problems depends on different gradient estimators.
• To ensure that the authors operate in practically meaningful settings, the experiments involve heterogeneous data, and the authors tune the DRO with high probability, the authors can use the median of a logarithmic number.
• Digits: Penalized-χ2 with λ = 0.05.
• ImageNet: CVaR with α = 0.1.
• ImageNet: Constrained-χ2 with ρ = 1.0.
• ImageNet: Penalized-χ2 with λ = 0.4 Batch n = 10.
Results
• For ImageNet the effect is more modest: in the worst-performing 10 classes the authors observe improvements of 5–10% in log loss, as well as a roughly 4 point improvement in accuracy.
• These improvements, come at the cost of degradation in average performance: the average loss increases by up to 10% and the average accuracy drops by roughly 1 point
Conclusion
• The authors' analysis in Section 3.1 bounds the suboptimality of solutions resulting from using a mini-batch estimators with batch size n, showing it must vanish as n increases.
• This may appear confusing, since the MLMC convergence guarantees are optimal while the mini-batch estimator achieves the optimal rate only under certain assumptions
• Recall, that these assumptions are smoothness of the loss and—for CVaR—sufficiently rapid decay of the bias floor, which the authors verify empirically.This work provides rigorous convergence guarantees for solving large-scale convex φ-divergence DRO problems with stochastic gradient methods, laying out a foundation for their use in practice; the authors conclude it by highlighting two directions for further research.
Tables
• Table1: Number of ∇ evaluations to obtain E[L(x; P0)] − infx ∈X L(x ; P0) ≤ when P0 is uniform on N training points. For simplicity we omit the Lipschitz constant of , the size of the domain X , and logarithmic factors
• Table2: Comparison wallclock time (in minutes) of the different algorithms, in terms of time per epoch and time to reach within 2% of the best training loss. In the last two columns, we report the number of epochs required to reach within 2% of the best training loss. We report ∞ for configurations that do not reach the sub-optimality goal for the duration of the experiment, and omit standard deviations when then they are 0
• Table3: Parameter settings for Theorem 1. For Lkl-CVaR we take λ log(
• Table4: Stepsizes for the experiments we present in this work. We use momentum 0.9 for all configurations except MLMC, where we do not use momentum. We select the stepsizes according to the ‘coarse-to-fine’ strategy we describe in this section
• Table5: Empirical complexity for the Digits experiment in terms of number of epochs required to reach within 2% of the optimal training objective value, averaged across 5 seeds ± one standard deviation. (For the full-batch experiments we only ran one seed). The “speed-up” column gives the ratio between the full batch complexity and the best mini-batch complexity
• Table6: Empirical complexity for the ImageNet experiment in terms of number of epochs required to reach within 2% of the optimal training objective value, averaged across 5 seeds ± one standard deviation, whenever it is not zero. (For the full-batch experiments we only ran one seed). The “speedup” column gives the ratio between the full batch complexity and the best mini-batch complexity
Related work
• Distributionally robust optimization grows from the robust optimization literature in operations research [2, 1, 3, 4], and the fundamental uncertainty about the data distribution at test time makes its application to machine learning natural. Experiments in the papers [40, 23, 17, 29, 13, 35] show promising results for CVaR (2) and χ2-constrained (3) DRO, while other works highlight the importance of incorporating additional constraints into the uncertainty set definition [33, 19, 47, 53]. Below, we review the prior art on solving these DRO problems at scale.

Full-batch subgradient method. When P0 has support of size N it is possible to compute a subgradient of the objective L(x; P0) by evaluating (x; si) and ∇ (x; si) for i = 1, . . . , N , computing the q ∈ ∆N attaining the supremum (1), whence g = N i=1 qi∇

(x; si) is a subgradient of L at x.

As the Lipschitz constant of L is at most that of , we may use these subgradients in the subgradient method [44] and find an approximate solution in order −2 steps. This requires order N −2 evaluations of ∇ , regardless of the uncertainty set.
Funding
• DL, YC and JCD were supported by the NSF under CAREER Award CCF-1553086 and HDR 1934578 (the Stanford Data Science Collaboratory) and Office of Naval Research YIP Award N00014-19-2288
• YC was supported by the Stanford Graduate Fellowship
• AS is supported by the NSF CAREER Award CCF-1844855
Study subjects and analysis
guarantees: 4
This scheme relies on MLMC estimators for both the gradient ∇Lχ2-pen and the derivative of Lχ2-pen with respect to λ. Proposition 4 guarantees that the second moment of our gradient estimators remain bounded by a quantity that depends logarithmically on n. For these estimators, Proposition 3 thus directly provides complexity guarantees to minimize LCVaR and Lχ2-pen

Reference
• A. Ben-Tal, L. E. Ghaoui, and A. Nemirovski. Robust Optimization. Princeton University Press, 2009.
• A. Ben-Tal, D. den Hertog, A. D. Waegenaere, B. Melenberg, and G. Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341– 357, 2013.
• D. Bertsimas, D. Brown, and C. Caramanis. Theory and applications of robust optimization. SIAM Review, 53(3):464–501, 2011.
• D. Bertsimas, V. Gupta, and N. Kallus. Data-driven robust optimization. Mathematical Programming, Series A, 167(2):235–292, 2018.
• J. Blanchet and Y. Kang. Semi-supervised Learning Based on Distributionally Robust Optimization, chapter 1, pages 1–33. John Wiley & Sons, Ltd, 2020. ISBN 9781119721871.
• J. Blanchet, Y. Kang, and K. Murthy. Robust Wasserstein profile inference and applications to machine learning. Journal of Applied Probability, 56(3):830–857, 2019.
• J. H. Blanchet and P. W. Glynn. Unbiased Monte Carlo for optimization and functions of expectations via multi-level randomization. In 2015 Winter Simulation Conference (WSC), pages 3656–366IEEE, 2015.
• S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: a Nonasymptotic Theory of Independence. Oxford University Press, 2013.
• G. Braun, C. Guzmán, and S. Pokutta. Lower bounds on the oracle complexity of nonsmooth convex optimization via information theory. IEEE Transactions on Information Theory, 63(7), 2017.
• J. Buolamwini and T. Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency, pages 77–91, 2018.
• N. Cressie and T. R. Read. Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society, Series B, pages 440–464, 1984.
• I. Csiszár. Information-type measures of difference of probability distributions and indirect observation. Studia Scientifica Mathematica Hungary, 2:299–318, 1967.
• S. Curi, K. Levy, S. Jegelka, A. Krause, et al. Adaptive sampling for stochastic risk-averse learning. arXiv:1910.12511 [cs.LG], 2019.
• T. E. de Campos, B. R. Babu, and M. Varma. Character recognition in natural images. In Proceedings of the Fourth International Conference on Computer Vision Theory and Applications, February 2009.
• E. Delage and Y. Ye. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations Research, 58(3):595–612, 2010.
• J. C. Duchi. Introductory lectures on stochastic convex optimization. In The Mathematics of Data, IAS/Park City Mathematics Series. American Mathematical Society, 2018.
• J. C. Duchi and H. Namkoong. Learning models with uniform performance via distributionally robust optimization. Annals of Statistics, to appear, 2020.
• J. C. Duchi, P. L. Bartlett, and M. J. Wainwright. Randomized smoothing for stochastic optimization. SIAM Journal on Optimization, 22(2):674–701, 2012.
• J. C. Duchi, T. Hashimoto, and H. Namkoong. Distributionally robust losses against mixture covariate shifts. 2019.
• R. Durrett. Probability: theory and examples, volume 49. Cambridge University Press, 2019.
• B. Efron and C. Stein. The jackknife estimate of variance. The Annals of Statistics, 9(3): 586–596, 1981.
• P. M. Esfahani and D. Kuhn. Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, Series A, 171(1–2):115–166, 2018.
• Y. Fan, S. Lyu, Y. Ying, and B. Hu. Learning with average top-k loss. In Advances in Neural Information Processing Systems 30, pages 497–505, 2017.
• A. Fuster, P. Goldsmith-Pinkham, T. Ramadorai, and A. Walther. Predictably unequal? the effects of machine learning on credit markets. Social Science Research Network: 3072038, 2018.
• S. Ghosh, M. Squillante, and E. Wollega. Efficient stochastic gradient descent for distributionally robust learning. arXiv:1805.08728 [stats.ML], 2018.
• M. B. Giles. Multilevel Monte Carlo path simulation. Operations research, 56(3):607–617, 2008.
• M. B. Giles. Multilevel Monte Carlo methods. Acta Numerica, 24:259–328, 2015.
• C. Guzmán and A. Nemirovski. On lower complexity bounds for large-scale smooth convex optimization. Journal of Complexity, 31(1):1–14, 2015.
• T. Hashimoto, M. Srivastava, H. Namkoong, and P. Liang. Fairness without demographics in repeated loss minimization. In Proceedings of the 35th International Conference on Machine Learning, 2018.
• K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770– 778, 2016.
• D. Hendrycks and T. Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In Proceedings of the Seventh International Conference on Learning Representations, 2019.
• J. Hiriart-Urruty and C. Lemaréchal. Convex Analysis and Minimization Algorithms I. Springer, New York, 1993.
• W. Hu, G. Niu, I. Sato, and M. Sugiayma. Does distributionally robust supervised learning give robust classifiers? In Proceedings of the 35th International Conference on Machine Learning, 2018.
• N. Kalra and S. M. Paddock. Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability? Transportation Research Part A: Policy and Practice, 94:182–193, 2016.
• K. Kawaguchi and H. Lu. Ordered SGD: A new stochastic optimization framework for empirical risk minimization. In Proceedings of the 23nd International Conference on Artificial Intelligence and Statistics, 2020.
• [37] G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, Series A, 133(1–2):365–397, 2012.
• [38] Y. LeCun, L. D. Jackel, L. Bottou, A. Brunot, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, U. A. Muller, E. Sackinger, P. Simard, and V. Vapnik. Comparison of learning algorithms for handwritten digit recognition. In International Conference on Artificial Neural Networks, pages 53–60, 1995.
• [39] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995.
• [40] H. Namkoong and J. C. Duchi. Stochastic gradient methods for distributionally robust optimization with f -divergences. In Advances in Neural Information Processing Systems 29, 2016.
• [41] A. Nemirovski and D. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley, 1983.
• [42] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
• [43] Y. Nesterov. A method of solving a convex programming problem with convergence rate O(1/k2). Soviet Mathematics Doklady, 27(2):372–376, 1983.
• [44] Y. Nesterov. Introductory Lectures on Convex Optimization. Kluwer Academic Publishers, 2004.
• [45] Y. Nesterov. Smooth minimization of nonsmooth functions. Mathematical Programming, Series A, 103:127–152, 2005.
• [46] L. Oakden-Rayner, J. Dunnmon, G. Carneiro, and C. Ré. Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. In Proceedings of the ACM Conference on Health, Inference, and Learning, pages 151–159, 2020.
• [47] Y. Oren, S. Sagawa, T. Hashimoto, and P. Liang. Distributionally robust language modeling. In Empirical Methods in Natural Language Processing (EMNLP), 2019.
• [48] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In Neural Information Processing Systems (NIPS) Workshop on Automatic Differentiation, 2017.
• [49] J. Pitman. Probability. Springer-Verlag, 1993.
• [50] B. Recht, R. Roelofs, L. Schmidt, and V. Shankar. Do ImageNet classifiers generalize to ImageNet? In Proceedings of the 36th International Conference on Machine Learning, 2019.
• [51] R. T. Rockafellar and S. Uryasev. Optimization of conditional value-at-risk. Journal of Risk, 2:21–42, 2000.
• [52] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
• [53] S. Sagawa, P. W. Koh, T. B. Hashimoto, and P. Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In Proceedings of the Eighth International Conference on Learning Representations, 2020.
• [54] S. Shalev-Shwartz and Y. Singer. Convex repeated games and fenchel duality. In Advances in Neural Information Processing Systems 19, 2006.
• [55] S. Shalev-Shwartz and Y. Wexler. Minimizing the maximal loss: How and why? In Proceedings of the 33rd International Conference on Machine Learning, 2016.
• [56] O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the 30th International Conference on Machine Learning, pages 71–79, 2013.
• [57] A. Shapiro. Distributionally robust stochastic programming. SIAM Journal on Optimization, 27(4):2258–2275, 2017.
• [58] A. Shapiro, D. Dentcheva, and A. Ruszczyński. Lectures on Stochastic Programming: Modeling and Theory. SIAM and Mathematical Programming Society, 2009.
• [59] A. Sinha, H. Namkoong, and J. Duchi. Certifying some distributional robustness with principled adversarial training. In Proceedings of the Sixth International Conference on Learning Representations, 2018.
• [60] M. Staib and S. Jegelka. Distributionally robust optimization and generalization in kernel methods. In Advances in Neural Information Processing Systems 32, pages 9134–9144, 2019.
• [61] A. Torralba and A. A. Efros. Unbiased look at dataset bias. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1521–1528. IEEE, 2011.
• [62] A. A. Trindade, S. Uryasev, A. Shapiro, and G. Zrazhevsky. Financial prediction with constrained tail risk. Journal of Banking & Finance, 31(11):3524–3538, 2007.
• [63] T. van Erven and P. Harremoës. Rényi divergence and Kullback-Leibler divergence. IEEE Transactions on Information Theory, 60(7):3797–3820, 2014.
• [64] M. J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press, 2019.
• [65] S. Wang, W. Guo, H. Narasimhan, A. Cotter, M. Gupta, and M. I. Jordan. Robust optimization for fairness with noisy protected groups. arXiv:2002.09343 [cs.LG], 2020.
• [66] B. Yu. Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam, pages 423–435. SpringerVerlag, 1997.
• [67] M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the Twentieth International Conference on Machine Learning, 2003.
• 0. We find it more convenient to work with ψ(t)
• 4. The objectives
• 2. By the expression (34) for ∇L we have
• 1. We have the following bound on the second moment of the estimator, E MF
• 0. Let us now upper bound Dk. To that end, we define q = 2q1k/2 + δ where δ ∈ Rk/2 is a fixed-sign vector such that qlies in ∆k/2. More precisely, if q 1 > 1, δ decreases the mass of the largest coordinate until q(1)
• 2. Where we used (i) that x and λ are feasible points in the joint minimization of fρ(x, λ) over x ∈ X and λ ∈ [λ, B/ρ]; (ii) the convexity of f in λ; and (iii) the fact that D(x, λ ) ≥ 0 and λ ≤ λ.
Author
Daniel Levy