We introduce a novel paradigm of non-parametric local conditional estimation based on distributionally robust optimization
Distributionally Robust Local Non-parametric Conditional Estimation
NIPS 2020, (2020)
Conditional estimation given specific covariate values (i.e., local conditional estimation or functional estimation) is ubiquitously useful with applications in engineering, social and natural sciences. Existing data-driven non-parametric estimators mostly focus on structured homogeneous data (e.g., weakly independent and stationary dat...更多
下载 PDF 全文
- The authors consider the estimation of conditional statistics of a response variable, Y ∈ Rm, given the value of a predictor or covariate X ∈ Rn.
- The authors propose the following distributionally robust local conditional estimation problem min β
- 2. The authors demonstrate that when the ambiguity set is a type-∞ Wasserstein ball around the empirical measure, the proposed min-max estimation problem can be efficiently solved in many applicable settings, including notably the local conditional mean and quantile estimation.
- We consider the estimation of conditional statistics of a response variable, Y ∈ Rm, given the value of a predictor or covariate X ∈ Rn
- 22], where the minimization is taken over the space of all measurable functions from Rn to Rm
- We introduce a novel paradigm of non-parametric local conditional estimation based on distributionally robust optimization
- We demonstrate that when the ambiguity set is a type-∞ Wasserstein ball around the empirical measure, the proposed min-max estimation problem can be efficiently solved in many applicable settings, including notably the local conditional mean and quantile estimation
- Since our contribution is primarily on introducing a novel conceptual paradigm powered by distributionally robust optimization (DRO), we focus on discussing well-understood estimators that encompass most of the conceptual ideas used to mitigate the challenges exposed earlier
- The conditional mean estimation problem is challenging when x0 is close to the jump points of the density function p(x), that is at x0 = 0.3 or x0 = 0.7, because the data are gathered unequally in the neighborhoods
- 2 Local Conditional Estimate using Type-∞ Wasserstein Ambiguity Set
- To solve the estimation problem (2), the authors study the worst-case conditional expected loss function f (β)
- The distributionally robust local conditional estimation problem (2) is equivalent to the second-order cone program min λ s.
- The values of α calculated in Theorem 2.3 are of indicative nature: αi = 1 if it is optimal to perturb the sample point i to compute the worst-case conditional expected loss.
- The conditional mean estimation problem is challenging when x0 is close to the jump points of the density function p(x), that is at x0 = 0.3 or x0 = 0.7, because the data are gathered unequally in the neighborhoods.
- By decomposing the measure Q using the set of probability measures πi and exploiting the definition of the type-∞ Wasserstein distance as in the proof of Proposition 2.2, the authors have
- Let I and I1 be the index sets defined as in (4a)-(4b), the value f (β) is equal to the optimal value of a fractional linear program f (β) = max i∈The author vis (β)αi i∈The author αis
- Before proving Proposition 2.5, the authors need the following two results which asserts the analytical optimal value of maximizing a convex quadratic functions over a norm ball.
- To facilitate the proof of Lemma B.3, the authors define the following conditional ambiguity set induced by B∞ ρ as
- Where the last constraint defining the set Bx0,γ(B∞ ρ ) is from the dis-integration of the joint measure into a marginal distribution and the corresponding conditional distributions [39, Theorem 9.2.2].
- The proof of Lemma B.3 relies on the following two results which assert the convexity of the joint ambiguity set B∞ ρ and its induced conditional ambiguity set Bx0,γ(B∞ ρ ).
- By the definition of the conditional ambiguity set Bx0,γ(B∞ ρ ), it suffices to prove the equivalence min sup β∈Rm μ0∈Bx0,γ (B∞ ρ )
- The authors elaborate here on the procedure of applying a golden-section search to solve a one-dimensional local conditional estimation with a convex loss function .
- Table1: Median of hyper-parameters (H.P.) obtained with cross-validation
- Table2: Comparison of expected outof-sample classification accuracy (in %
- Table3: Median of hyper-parameters (H.P.) for synthetic data experiment obtained with crossvalidation
- One can argue that every single prediction task in machine learning ultimately relates to conditional estimation. So, attempting to provide a full literature survey on non-parametric conditional estimation is an impossible task. Since our contribution is primarily on introducing a novel conceptual paradigm powered by DRO, we focus on discussing well-understood estimators that encompass most of the conceptual ideas used to mitigate the challenges exposed earlier.
The challenges of conditioning on zero probability events and the fact that x0 may not be a part of the sample are addressed based on the idea of averaging around a neighborhood of the point of interest and smoothing. This gives rise to estimators such as k-NN (see, for example, ), and kernel density estimators, including, for instance the Nadaraya-Watson estimator ([31, 43]) and the Epanechnikov estimator , among others. Additional averaging methods include, for example, random forests  and Classification and Regression Trees (CARTs, ), see also  for other techniques.
- Material in this paper is based upon work supported by the Air Force Office of Scientific Research under award number FA9550-20-1-0397
- Additional support is gratefully acknowledged from NSF grants 1915967, 1820942, 1838676 and from the China Merchant Bank. A Additional Experiment Results A.1
Finally, we report on an experiment that challenges the capacity of both N-W and DRCME estimators to be resilient to adversarial corruption of the test images. This is done by exposing the two estimators to images from the training set (N = 100) that have been corrupted in a way that makes them resemble the closest differently-labeled image in the set.2. Figure 5 presents several visual examples of the progressively corrupted images and the resulting N-W and DRCME estimations
- C. D. Aliprantis and K. C. Border. Infinite Dimensional Analysis: A Hitchhiker’s Guide. Springer, 2006.
- F. Alizadeh and D. Goldfarb. Second-order cone programming. Mathematical Programming, 95:3–51, 2003.
- D. P. Bertsekas. Control of Uncertain Systems with a Set-Membership Description of Uncertainty. PhD thesis, Massachusetts Institute of Technology, 1971.
- D. Bertsimas, V. Gupta, and N. Kallus. Data-driven robust optimization. Mathematical Programming, 167(2):235–292, 2018.
- D. Bertsimas, C. McCord, and B. Sturt. Dynamic optimization with side information. arXiv preprint arXiv:1907.07307, 2019.
- D. Bertsimas, S. Shtern, and B. Sturt. Two-stage sample robust optimization. arXiv preprint arXiv:1907.07142, 2019.
- R. Bhattacharjee and K. Chaudhuri. When are non-parametric methods robust? In International Conference on Machine Learning, 2020.
- J. Blanchet and K. Murthy. Quantifying distributional model risk via optimal transport. Mathematics of Operations Research, 44(2):565–600, 2019.
- L. Breiman. Random forests. Machine Learning, 45:5–32, 2001.
- L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth and Brooks, 1984.
- A. Charnes and W. W. Cooper. Programming with linear fractional functionals. Naval Research Logistics Quarterly, 9(3-4):181–186, 1962.
- E. Delage and Y. Ye. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations Research, 58(3):595–612, 2010.
- L. Devroye. The uniform convergence of nearest neighbor regression function estimators and their application in optimization. IEEE Transactions on Information Theory, 24(2):142–151, 1978.
- V. A. Epanechnikov. Non-parametric estimation of a multivariate probability density. Theory of Probability & Its Applications, 14(1):153–158, 1969.
- R. Flamary and N. Courty. Pot python optimal transport library, 2017.
- R. Gao and A. J. Kleywegt. Distributionally robust stochastic optimization with Wasserstein distance. arXiv preprint arXiv:1604.02199, 2016.
- N. García Trillos and D. Slepcev. On the rate of convergence of empirical measures in ∞transportation distance. Canadian Journal of Mathematics, 67(6):1358–1383, 2015.
- C. Givens and R. Shortt. A class of Wasserstein metrics for probability distributions. The Michigan Mathematical Journal, 31(2):231–240, 1984.
- I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In Proceedings of the Third International Conference on Learning Representations, 2015.
- T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2009.
-  M. Khoury and D. Hadfield-Menell. On the geometry of adversarial examples. arXiv preprint arXiv:1811.00525, 2018.
-  S. Kruk and H. Wolkowicz. Pseudolinear programming. SIAM Review, 41(4):795–805, 1999.
-  D. Kuhn, P. M. Esfahani, V. A. Nguyen, and S. Shafieezadeh-Abadeh. Wasserstein distributionally robust optimization: Theory and applications in machine learning. INFORMS TutORials in Operations Research, pages 130–166, 2019.
-  A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial machine learning at scale. In Proceedings of the Fifth International Conference on Learning Representations, 2017.
-  Y. LeCun and C. Cortes. The MNIST Database of Handwritten Digits, 1998 (accessed May 28, 2020).
-  X. Li, Y. Chen, Y. He, and H. Xue. Advknn: Adversarial attacks on k-nearest neighbor classifiers with approximate gradients. arXiv preprint arXiv:1911.06591, 2019.
-  A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. In Proceedings of the Sixth International Conference on Learning Representations, 2018.
-  P. Mohajerin Esfahani and D. Kuhn. Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171(1-2):115–166, 2018.
-  MOSEK ApS. MOSEK Optimizer API for Python 9.2.10, 2019.
-  E. A. Nadaraya. On estimating regression. Theory of Probability & Its Applications, 9(1):141– 142, 1964.
-  H. Namkoong and J. C. Duchi. Variance-based regularization with convex objectives. In Advances in Neural Information Processing Systems 30, pages 2971–2980, 2017.
-  V. A. Nguyen, D. Kuhn, and P. Mohajerin Esfahani. Distributionally robust inverse covariance estimation: The Wasserstein shrinkage estimator. arXiv preprint arXiv:1805.07194, 2018.
-  A. Raghunathan, J. Steinhardt, and P. Liang. Certified defenses against adversarial examples. In International Conference on Learning Representations, 2018.
-  S. Shafieezadeh-Abadeh, D. Kuhn, and P. M. Esfahani. Regularization via mass transportation. Journal of Machine Learning Research, 20(103):1–68, 2019.
-  A. Sinha, H. Namkoong, and J. Duchi. Certifiable distributional robustness with principled adversarial training. In International Conference on Learning Representations, 2018.
-  M. Sion. On general minimax theorems. Pacific Journal of Mathematics, 8(1):171–176, 1958.
-  C. J. Stone. Consistent nonparametric regression. Annals of Statistics, 5(4):595–620, 1977.
-  D. Stroock. Probability Theory: An Analytic View. Cambridge University Press, 2011.
-  F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. D. McDaniel. Ensemble adversarial training: Attacks and defenses. In Proceedings of the Sixth International Conference on Learning Representations, 2018.
-  L. Wang, X. Liu, J. Yi, Z.-H. Zhou, and C.-J. Hsieh. Evaluating the robustness of nearest neighbor classifiers: A primal-dual perspective. arXiv preprint arXiv:1906.03972, 2019.
-  Y. Wang, S. Jha, and K. Chaudhuri. Analyzing the robustness of nearest neighbors to adversarial examples. In International Conference on Machine Learning, pages 5133–5142, 2018.
-  G. S. Watson. Smooth regression analysis. Sankhya: The Indian Journal of Statistics, Series A, pages 359–372, 1964.
-  W. Xie. Tractable reformulations of distributionally robust two-stage stochastic programs with ∞-Wasserstein distance. arXiv preprint arXiv:1908.08454, 2019.
-  Y.-Y. Yang, C. Rashtchian, Y. Wang, and K. Chaudhuri. Robustness for non-parametric classification: A generic attack and defense. In International Conference on Artificial Intelligence and Statistics, 2020.
-  G. Zhao and Y. Ma. Robust nonparametric kernel regression estimator. Statistics & Probability Letters, 116:72–79, 2016.