# High-Dimensional Robust Mean Estimation via Gradient Descent

ICML 2020, 2020.

Keywords:

non convex formulationgood samplepositive semidefiniterobust estimationProjected Gradient DescentMore(7+)

Weibo:

Abstract:

We study the problem of high-dimensional robust mean estimation in the presence of a constant fraction of adversarial outliers. A recent line of work has provided sophisticated polynomial-time algorithms for this problem with dimension-independent error guarantees for a range of natural distribution families. In this work, we show tha...More

Code:

Data:

Introduction

- Learning in the presence of outliers is an important goal in machine learning that has become a pressing challenge in a number of high-dimensional data analysis applications, including data poisoning attacks [BNJT10, BNL12, SKL17] and exploratory analysis of real datasets with natural outliers, e.g., in biology [RPW+02, PLJD10, LAT+08].
- The adversary is allowed to inspect the samples, remove up to ǫN of them and replace them with arbitrary points
- This modified set of N points is given as input to the algorithm.
- Note that the spectral norm is not a differentiable function and the authors need an alternative definition of stationarity
- To address this issue, by the definition of spectral norm, the authors can define a function F (w, u) = u⊤Σwu that takes two parameters as input: the weights w ∈ RN and a unit vector u ∈ Rd. The authors' non-convex objective minw f (w) := Σw 2 is equivalent to solving the minimax problem minw maxu F (w, u).
- The authors say that w ∈ K is a first-order stationary point if there exists some u ∈ arg maxv F (w, v) such that (∇wF (w, u))⊤(w − w) ≥ 0 for all w ∈ K

Highlights

- Learning in the presence of outliers is an important goal in machine learning that has become a pressing challenge in a number of high-dimensional data analysis applications, including data poisoning attacks [BNJT10, BNL12, SKL17] and exploratory analysis of real datasets with natural outliers, e.g., in biology [RPW+02, PLJD10, LAT+08]
- In Section 3, we prove our main structural result showing that any stationary point of the spectral norm objective yields a good solution
- In Section 4, we show that gradient descent converges to an approximate stationary point and yields a good solution in a polynomial number of iterations
- In Appendix C, we prove structural and algorithmic results for the softmax objective, showing that any approximate stationary point of the softmax objective yields a good solution, and we can find an approximate stationary point using projected gradient descent in a polynomial number of iterations
- Our proof is carried out in two steps: (1) We establish a structural lemma which states that every stationary point w must satisfy a bimodal subgradient property; (2) We show any point satisfying such property must have a small objective value
- The main technical contribution of this paper is in showing that any approximate stationary point of our non-convex objective suffices to solve the underlying learning problem

Conclusion

- The main conceptual contribution of this work is to establish an intriguing connection between algorithmic high-dimensional robust statistics and non-convex optimization.
- The authors showed that high-dimensional robust mean estimation can be efficiently solved by directly applying a firstorder method to a natural non-convex formulation of the problem.
- The main technical contribution of this paper is in showing that any approximate stationary point of the non-convex objective suffices to solve the underlying learning problem.
- The authors note that the upper bound is fairly loose and the authors did not make an explicit effort to optimize the polynomial dependence

Summary

## Introduction:

Learning in the presence of outliers is an important goal in machine learning that has become a pressing challenge in a number of high-dimensional data analysis applications, including data poisoning attacks [BNJT10, BNL12, SKL17] and exploratory analysis of real datasets with natural outliers, e.g., in biology [RPW+02, PLJD10, LAT+08].- The adversary is allowed to inspect the samples, remove up to ǫN of them and replace them with arbitrary points
- This modified set of N points is given as input to the algorithm.
- Note that the spectral norm is not a differentiable function and the authors need an alternative definition of stationarity
- To address this issue, by the definition of spectral norm, the authors can define a function F (w, u) = u⊤Σwu that takes two parameters as input: the weights w ∈ RN and a unit vector u ∈ Rd. The authors' non-convex objective minw f (w) := Σw 2 is equivalent to solving the minimax problem minw maxu F (w, u).
- The authors say that w ∈ K is a first-order stationary point if there exists some u ∈ arg maxv F (w, v) such that (∇wF (w, u))⊤(w − w) ≥ 0 for all w ∈ K
## Conclusion:

The main conceptual contribution of this work is to establish an intriguing connection between algorithmic high-dimensional robust statistics and non-convex optimization.- The authors showed that high-dimensional robust mean estimation can be efficiently solved by directly applying a firstorder method to a natural non-convex formulation of the problem.
- The main technical contribution of this paper is in showing that any approximate stationary point of the non-convex objective suffices to solve the underlying learning problem.
- The authors note that the upper bound is fairly loose and the authors did not make an explicit effort to optimize the polynomial dependence

Related work

- The algorithmic question of designing efficient robust mean estimators in high-dimensions has been extensively studied in recent years. After the initial papers [DKK+16, LRV16], a number of works [DKK+17, SCV18, CDG18, DHL19, DL19, CDGW19] have obtained algorithms with improved asymptotic worst-case runtimes that work under weaker distributional assumptions on the good data. Moreover, efficient high-dimensional robust mean estimators have been used as primitives for robustly solving a range of machine learning tasks that can be expressed as stochastic optimization problems [PSBR18, DKK+19a].

We compare our approach with the works of [CDG18] and [DHL19] that give the asymptotically fastest known algorithms for robust mean estimation. At a high-level, [CDG18], building on the convex programming relaxation of [DKK+16], proposed a primal-dual approach for robust mean estimation that reduces the problem to a poly-logarithmic number of packing and covering SDPs.

Each such SDP is known to be solvable in time O(N d), using mirror descent [ALO16, PTZ16]. [DHL19] build on the iterative spectral approach of [DKK+16]. That work uses the matrix multiplicative weights update method with a specific regularization and dimension-reduction to improve the worst-case runtime.

Reference

- Z. Allen-Zhu, Y. Lee, and L. Orecchia. Using optimization to obtain a widthindependent, parallel, simpler, and faster positive SDP solver. In Proc. 27th Annual Symposium on Discrete Algorithms (SODA), pages 1824–1831, 2016.
- S. Balakrishnan, S. S. Du, J. Li, and A. Singh. Computationally efficient robust sparse estimation in high dimensions. In Proc. 30th Annual Conference on Learning Theory, pages 169–212, 2017.
- A. Beck. First-Order Methods in Optimization. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2017.
- [BNJT10] M. Barreno, B. Nelson, A. D. Joseph, and J. D. Tygar. The security of machine learning. Machine Learning, 81(2):121–148, 2010.
- B. Biggio, B. Nelson, and P. Laskov. Poisoning attacks against support vector machines. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, 2012.
- Y. Cheng, I. Diakonikolas, and R. Ge. High-dimensional robust mean estimation in nearly-linear time. CoRR, abs/1811.09380, 2018. Conference version in SODA 2019, p. 2755-2771.
- [CDGW19] Y. Cheng, I. Diakonikolas, R. Ge, and D. P. Woodruff. Faster algorithms for highdimensional robust covariance estimation. In Conference on Learning Theory, COLT 2019, pages 727–757, 2019.
- E. J. Candes, X. Li, and M. Soltanolkotabi. Phase retrieval via Wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985–2007, 2015.
- D. Davis and D. Drusvyatskiy. Stochastic subgradient method converges at the rate o(k−1/4) on weakly convex functions. arXiv preprint arXiv:1802.02988, 2018.
- Y. Dong, S. B. Hopkins, and J. Li. Quantum entropy scoring for fast robust mean estimation and improved outlier detection. CoRR, abs/1906.11366, 2019. Conference version in NeurIPS 2019.
- I. Diakonikolas and D. M. Kane. Recent advances in algorithmic high-dimensional robust statistics. CoRR, abs/1911.05911, 2019.
- [DKK+16] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Robust estimators in high dimensions without the computational intractability. In Proc. 57th IEEE Symposium on Foundations of Computer Science (FOCS), pages 655–664, 2016.
- [DKK+17] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Being robust (in high dimensions) can be practical. In Proc. 34th International Conference on Machine Learning (ICML), pages 999–1008, 2017.
- [DKK+19a] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, J. Steinhardt, and A. Stewart. SEVER: A robust meta-algorithm for stochastic optimization. In Proc. 36th International Conference on Machine Learning (ICML), pages 1596–1606, 2019.
- [DKK+19b] I. Diakonikolas, S. Karmalkar, D. Kane, E. Price, and A. Stewart. Outlier-robust highdimensional sparse estimation via iterative filtering. In Advances in Neural Information Processing Systems 33, NeurIPS 2019, pages 10688–10699, 2019.
- I. Diakonikolas, D. M. Kane, and A. Stewart. Statistical query lower bounds for robust estimation of high-dimensional Gaussians and Gaussian mixtures. In Proc. 58th IEEE Symposium on Foundations of Computer Science (FOCS), pages 73–84, 2017.
- I. Diakonikolas, W. Kong, and A. Stewart. Efficient algorithms and lower bounds for robust linear regression. In Proc. 30th Annual Symposium on Discrete Algorithms (SODA), pages 2745–2754, 2019.
- J. Depersin and G. Lecue. Robust subgaussian estimation of a mean vector in nearly linear time. CoRR, abs/1906.03058, 2019.
- D. Drusvyatskiy. The proximal point method revisited. arXiv:1712.06038, 2017.
- [GHJY15] R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. In Conference on Learning Theory, pages 797–842, 2015.
- H. Hassani, M. Soltanolkotabi, and A. Karbasi. Gradient methods for submodular maximization. In Advances in Neural Information Processing Systems, pages 5841– 5851, 2017.
- P. J. Huber. Robust estimation of a location parameter. Ann. Math. Statist., 35(1):73– 101, 03 1964.
- P. Jain and P. Kar. Non-convex optimization for machine learning. Foundations and Trends R in Machine Learning, 10(3-4):142–336, 2017.
- C. Jin, P. Netrapalli, and M. I. Jordan. What is local optimality in nonconvexnonconcave minimax optimization? arXiv preprint arXiv:1902.00618, 2019.
- [KKM18] A. Klivans, P. Kothari, and R. Meka. Efficient algorithms for outlier-robust regression. In Proc. 31st Annual Conference on Learning Theory (COLT), pages 1420–1430, 2018.
- [KMO10] R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from a few entries. IEEE transactions on information theory, 56(6):2980–2998, 2010.
- J.Z. Li, D.M. Absher, H. Tang, A.M. Southwick, A.M. Casto, S. Ramachandran, H.M. Cann, G.S. Barsh, M. Feldman, L.L. Cavalli-Sforza, and R.M. Myers. Worldwide human relationships inferred from genome-wide patterns of variation. Science, 319:1100– 1104, 2008.
- K. A. Lai, A. B. Rao, and S. Vempala. Agnostic estimation of mean and covariance. In Proc. 57th IEEE Symposium on Foundations of Computer Science (FOCS), pages 665–674, 2016.
- P. Loh and M. J. Wainwright. High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. In Advances in Neural Information Processing Systems, pages 2726–2734, 2011.
- P. Paschou, J. Lewis, A. Javed, and P. Drineas. Ancestry informative markers for finescale individual assignment to worldwide populations. Journal of Medical Genetics, 47:835–847, 2010.
- [PSBR18] A. Prasad, A. S. Suggala, S. Balakrishnan, and P. Ravikumar. Robust estimation via robust gradient estimation. arXiv preprint arXiv:1802.06485, 2018.
- R. Peng, K. Tangwongsan, and P. Zhang. Faster and simpler width-independent parallel algorithms for positive semidefinite programming. arXiv preprint arXiv:1201.5135v3, 2016.
- [Roc70] R. T. Rockafellar. Convex analysis. Number 28. Princeton university press, 1970.
- R. T. Rockafellar. Favorable classes of lipschitz continuous functions in subgradient optimization. 1981.
- R. T. Rockafellar. Convex Analysis. Princeton Landmarks in Mathematics and Physics. Princeton University Press, 2015.
- [RPW+02] N. Rosenberg, J. Pritchard, J. Weber, H. Cann, K. Kidd, L.A. Zhivotovsky, and M.W. Feldman. Genetic structure of human populations. Science, 298:2381–2385, 2002.
- J. Steinhardt, M. Charikar, and G. Valiant. Resilience: A criterion for learning in the presence of arbitrary outliers. In Proc. 9th Innovations in Theoretical Computer Science Conference (ITCS), pages 45:1–45:21, 2018.
- J. Steinhardt, P. Wei Koh, and P. S. Liang. Certified defenses for data poisoning attacks. In Advances in Neural Information Processing Systems 30, pages 3520–3532, 2017.
- [TBS+15] S. Tu, R. Boczar, M. Simchowitz, M. Soltanolkotabi, and B. Recht. Low-rank solutions of linear matrix equations via Procrustes flow. arXiv preprint arXiv:1507.03566, 2015.
- J. W. Tukey. A survey of sampling from contaminated distributions. Contributions to probability and statistics, 2:448–485, 1960.
- R. M. Wilcox. Exponential operators and parameter differentiation in quantum physics. Journal of Mathematical Physics, 8(4):962–982, 1967.

Tags

Comments