High-Dimensional Robust Mean Estimation via Gradient Descent

Soltanolkotabi Mahdi
Soltanolkotabi Mahdi

ICML 2020, 2020.

Cited by: 0|Bibtex|Views20|Links
Keywords:
non convex formulationgood samplepositive semidefiniterobust estimationProjected Gradient DescentMore(7+)
Weibo:
In Section 4, we show that gradient descent converges to an approximate stationary point and yields a good solution in a polynomial number of iterations

Abstract:

We study the problem of high-dimensional robust mean estimation in the presence of a constant fraction of adversarial outliers. A recent line of work has provided sophisticated polynomial-time algorithms for this problem with dimension-independent error guarantees for a range of natural distribution families. In this work, we show tha...More

Code:

Data:

Introduction
  • Learning in the presence of outliers is an important goal in machine learning that has become a pressing challenge in a number of high-dimensional data analysis applications, including data poisoning attacks [BNJT10, BNL12, SKL17] and exploratory analysis of real datasets with natural outliers, e.g., in biology [RPW+02, PLJD10, LAT+08].
  • The adversary is allowed to inspect the samples, remove up to ǫN of them and replace them with arbitrary points
  • This modified set of N points is given as input to the algorithm.
  • Note that the spectral norm is not a differentiable function and the authors need an alternative definition of stationarity
  • To address this issue, by the definition of spectral norm, the authors can define a function F (w, u) = u⊤Σwu that takes two parameters as input: the weights w ∈ RN and a unit vector u ∈ Rd. The authors' non-convex objective minw f (w) := Σw 2 is equivalent to solving the minimax problem minw maxu F (w, u).
  • The authors say that w ∈ K is a first-order stationary point if there exists some u ∈ arg maxv F (w, v) such that (∇wF (w, u))⊤(w − w) ≥ 0 for all w ∈ K
Highlights
  • Learning in the presence of outliers is an important goal in machine learning that has become a pressing challenge in a number of high-dimensional data analysis applications, including data poisoning attacks [BNJT10, BNL12, SKL17] and exploratory analysis of real datasets with natural outliers, e.g., in biology [RPW+02, PLJD10, LAT+08]
  • In Section 3, we prove our main structural result showing that any stationary point of the spectral norm objective yields a good solution
  • In Section 4, we show that gradient descent converges to an approximate stationary point and yields a good solution in a polynomial number of iterations
  • In Appendix C, we prove structural and algorithmic results for the softmax objective, showing that any approximate stationary point of the softmax objective yields a good solution, and we can find an approximate stationary point using projected gradient descent in a polynomial number of iterations
  • Our proof is carried out in two steps: (1) We establish a structural lemma which states that every stationary point w must satisfy a bimodal subgradient property; (2) We show any point satisfying such property must have a small objective value
  • The main technical contribution of this paper is in showing that any approximate stationary point of our non-convex objective suffices to solve the underlying learning problem
Conclusion
  • The main conceptual contribution of this work is to establish an intriguing connection between algorithmic high-dimensional robust statistics and non-convex optimization.
  • The authors showed that high-dimensional robust mean estimation can be efficiently solved by directly applying a firstorder method to a natural non-convex formulation of the problem.
  • The main technical contribution of this paper is in showing that any approximate stationary point of the non-convex objective suffices to solve the underlying learning problem.
  • The authors note that the upper bound is fairly loose and the authors did not make an explicit effort to optimize the polynomial dependence
Summary
  • Introduction:

    Learning in the presence of outliers is an important goal in machine learning that has become a pressing challenge in a number of high-dimensional data analysis applications, including data poisoning attacks [BNJT10, BNL12, SKL17] and exploratory analysis of real datasets with natural outliers, e.g., in biology [RPW+02, PLJD10, LAT+08].
  • The adversary is allowed to inspect the samples, remove up to ǫN of them and replace them with arbitrary points
  • This modified set of N points is given as input to the algorithm.
  • Note that the spectral norm is not a differentiable function and the authors need an alternative definition of stationarity
  • To address this issue, by the definition of spectral norm, the authors can define a function F (w, u) = u⊤Σwu that takes two parameters as input: the weights w ∈ RN and a unit vector u ∈ Rd. The authors' non-convex objective minw f (w) := Σw 2 is equivalent to solving the minimax problem minw maxu F (w, u).
  • The authors say that w ∈ K is a first-order stationary point if there exists some u ∈ arg maxv F (w, v) such that (∇wF (w, u))⊤(w − w) ≥ 0 for all w ∈ K
  • Conclusion:

    The main conceptual contribution of this work is to establish an intriguing connection between algorithmic high-dimensional robust statistics and non-convex optimization.
  • The authors showed that high-dimensional robust mean estimation can be efficiently solved by directly applying a firstorder method to a natural non-convex formulation of the problem.
  • The main technical contribution of this paper is in showing that any approximate stationary point of the non-convex objective suffices to solve the underlying learning problem.
  • The authors note that the upper bound is fairly loose and the authors did not make an explicit effort to optimize the polynomial dependence
Related work
  • The algorithmic question of designing efficient robust mean estimators in high-dimensions has been extensively studied in recent years. After the initial papers [DKK+16, LRV16], a number of works [DKK+17, SCV18, CDG18, DHL19, DL19, CDGW19] have obtained algorithms with improved asymptotic worst-case runtimes that work under weaker distributional assumptions on the good data. Moreover, efficient high-dimensional robust mean estimators have been used as primitives for robustly solving a range of machine learning tasks that can be expressed as stochastic optimization problems [PSBR18, DKK+19a].

    We compare our approach with the works of [CDG18] and [DHL19] that give the asymptotically fastest known algorithms for robust mean estimation. At a high-level, [CDG18], building on the convex programming relaxation of [DKK+16], proposed a primal-dual approach for robust mean estimation that reduces the problem to a poly-logarithmic number of packing and covering SDPs.

    Each such SDP is known to be solvable in time O(N d), using mirror descent [ALO16, PTZ16]. [DHL19] build on the iterative spectral approach of [DKK+16]. That work uses the matrix multiplicative weights update method with a specific regularization and dimension-reduction to improve the worst-case runtime.
Reference
  • Z. Allen-Zhu, Y. Lee, and L. Orecchia. Using optimization to obtain a widthindependent, parallel, simpler, and faster positive SDP solver. In Proc. 27th Annual Symposium on Discrete Algorithms (SODA), pages 1824–1831, 2016.
    Google ScholarLocate open access versionFindings
  • S. Balakrishnan, S. S. Du, J. Li, and A. Singh. Computationally efficient robust sparse estimation in high dimensions. In Proc. 30th Annual Conference on Learning Theory, pages 169–212, 2017.
    Google ScholarLocate open access versionFindings
  • A. Beck. First-Order Methods in Optimization. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2017.
    Google ScholarFindings
  • [BNJT10] M. Barreno, B. Nelson, A. D. Joseph, and J. D. Tygar. The security of machine learning. Machine Learning, 81(2):121–148, 2010.
    Google ScholarLocate open access versionFindings
  • B. Biggio, B. Nelson, and P. Laskov. Poisoning attacks against support vector machines. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, 2012.
    Google ScholarLocate open access versionFindings
  • Y. Cheng, I. Diakonikolas, and R. Ge. High-dimensional robust mean estimation in nearly-linear time. CoRR, abs/1811.09380, 2018. Conference version in SODA 2019, p. 2755-2771.
    Findings
  • [CDGW19] Y. Cheng, I. Diakonikolas, R. Ge, and D. P. Woodruff. Faster algorithms for highdimensional robust covariance estimation. In Conference on Learning Theory, COLT 2019, pages 727–757, 2019.
    Google ScholarLocate open access versionFindings
  • E. J. Candes, X. Li, and M. Soltanolkotabi. Phase retrieval via Wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985–2007, 2015.
    Google ScholarLocate open access versionFindings
  • D. Davis and D. Drusvyatskiy. Stochastic subgradient method converges at the rate o(k−1/4) on weakly convex functions. arXiv preprint arXiv:1802.02988, 2018.
    Findings
  • Y. Dong, S. B. Hopkins, and J. Li. Quantum entropy scoring for fast robust mean estimation and improved outlier detection. CoRR, abs/1906.11366, 2019. Conference version in NeurIPS 2019.
    Findings
  • I. Diakonikolas and D. M. Kane. Recent advances in algorithmic high-dimensional robust statistics. CoRR, abs/1911.05911, 2019.
    Findings
  • [DKK+16] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Robust estimators in high dimensions without the computational intractability. In Proc. 57th IEEE Symposium on Foundations of Computer Science (FOCS), pages 655–664, 2016.
    Google ScholarLocate open access versionFindings
  • [DKK+17] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Being robust (in high dimensions) can be practical. In Proc. 34th International Conference on Machine Learning (ICML), pages 999–1008, 2017.
    Google ScholarLocate open access versionFindings
  • [DKK+19a] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, J. Steinhardt, and A. Stewart. SEVER: A robust meta-algorithm for stochastic optimization. In Proc. 36th International Conference on Machine Learning (ICML), pages 1596–1606, 2019.
    Google ScholarLocate open access versionFindings
  • [DKK+19b] I. Diakonikolas, S. Karmalkar, D. Kane, E. Price, and A. Stewart. Outlier-robust highdimensional sparse estimation via iterative filtering. In Advances in Neural Information Processing Systems 33, NeurIPS 2019, pages 10688–10699, 2019.
    Google ScholarLocate open access versionFindings
  • I. Diakonikolas, D. M. Kane, and A. Stewart. Statistical query lower bounds for robust estimation of high-dimensional Gaussians and Gaussian mixtures. In Proc. 58th IEEE Symposium on Foundations of Computer Science (FOCS), pages 73–84, 2017.
    Google ScholarLocate open access versionFindings
  • I. Diakonikolas, W. Kong, and A. Stewart. Efficient algorithms and lower bounds for robust linear regression. In Proc. 30th Annual Symposium on Discrete Algorithms (SODA), pages 2745–2754, 2019.
    Google ScholarLocate open access versionFindings
  • J. Depersin and G. Lecue. Robust subgaussian estimation of a mean vector in nearly linear time. CoRR, abs/1906.03058, 2019.
    Findings
  • D. Drusvyatskiy. The proximal point method revisited. arXiv:1712.06038, 2017.
    Findings
  • [GHJY15] R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. In Conference on Learning Theory, pages 797–842, 2015.
    Google ScholarLocate open access versionFindings
  • H. Hassani, M. Soltanolkotabi, and A. Karbasi. Gradient methods for submodular maximization. In Advances in Neural Information Processing Systems, pages 5841– 5851, 2017.
    Google ScholarLocate open access versionFindings
  • P. J. Huber. Robust estimation of a location parameter. Ann. Math. Statist., 35(1):73– 101, 03 1964.
    Google ScholarLocate open access versionFindings
  • P. Jain and P. Kar. Non-convex optimization for machine learning. Foundations and Trends R in Machine Learning, 10(3-4):142–336, 2017.
    Google ScholarLocate open access versionFindings
  • C. Jin, P. Netrapalli, and M. I. Jordan. What is local optimality in nonconvexnonconcave minimax optimization? arXiv preprint arXiv:1902.00618, 2019.
    Findings
  • [KKM18] A. Klivans, P. Kothari, and R. Meka. Efficient algorithms for outlier-robust regression. In Proc. 31st Annual Conference on Learning Theory (COLT), pages 1420–1430, 2018.
    Google ScholarLocate open access versionFindings
  • [KMO10] R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from a few entries. IEEE transactions on information theory, 56(6):2980–2998, 2010.
    Google ScholarLocate open access versionFindings
  • J.Z. Li, D.M. Absher, H. Tang, A.M. Southwick, A.M. Casto, S. Ramachandran, H.M. Cann, G.S. Barsh, M. Feldman, L.L. Cavalli-Sforza, and R.M. Myers. Worldwide human relationships inferred from genome-wide patterns of variation. Science, 319:1100– 1104, 2008.
    Google ScholarLocate open access versionFindings
  • K. A. Lai, A. B. Rao, and S. Vempala. Agnostic estimation of mean and covariance. In Proc. 57th IEEE Symposium on Foundations of Computer Science (FOCS), pages 665–674, 2016.
    Google ScholarLocate open access versionFindings
  • P. Loh and M. J. Wainwright. High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. In Advances in Neural Information Processing Systems, pages 2726–2734, 2011.
    Google ScholarLocate open access versionFindings
  • P. Paschou, J. Lewis, A. Javed, and P. Drineas. Ancestry informative markers for finescale individual assignment to worldwide populations. Journal of Medical Genetics, 47:835–847, 2010.
    Google ScholarLocate open access versionFindings
  • [PSBR18] A. Prasad, A. S. Suggala, S. Balakrishnan, and P. Ravikumar. Robust estimation via robust gradient estimation. arXiv preprint arXiv:1802.06485, 2018.
    Findings
  • R. Peng, K. Tangwongsan, and P. Zhang. Faster and simpler width-independent parallel algorithms for positive semidefinite programming. arXiv preprint arXiv:1201.5135v3, 2016.
    Findings
  • [Roc70] R. T. Rockafellar. Convex analysis. Number 28. Princeton university press, 1970.
    Google ScholarFindings
  • R. T. Rockafellar. Favorable classes of lipschitz continuous functions in subgradient optimization. 1981.
    Google ScholarFindings
  • R. T. Rockafellar. Convex Analysis. Princeton Landmarks in Mathematics and Physics. Princeton University Press, 2015.
    Google ScholarFindings
  • [RPW+02] N. Rosenberg, J. Pritchard, J. Weber, H. Cann, K. Kidd, L.A. Zhivotovsky, and M.W. Feldman. Genetic structure of human populations. Science, 298:2381–2385, 2002.
    Google ScholarLocate open access versionFindings
  • J. Steinhardt, M. Charikar, and G. Valiant. Resilience: A criterion for learning in the presence of arbitrary outliers. In Proc. 9th Innovations in Theoretical Computer Science Conference (ITCS), pages 45:1–45:21, 2018.
    Google ScholarLocate open access versionFindings
  • J. Steinhardt, P. Wei Koh, and P. S. Liang. Certified defenses for data poisoning attacks. In Advances in Neural Information Processing Systems 30, pages 3520–3532, 2017.
    Google ScholarLocate open access versionFindings
  • [TBS+15] S. Tu, R. Boczar, M. Simchowitz, M. Soltanolkotabi, and B. Recht. Low-rank solutions of linear matrix equations via Procrustes flow. arXiv preprint arXiv:1507.03566, 2015.
    Findings
  • J. W. Tukey. A survey of sampling from contaminated distributions. Contributions to probability and statistics, 2:448–485, 1960.
    Google ScholarLocate open access versionFindings
  • R. M. Wilcox. Exponential operators and parameter differentiation in quantum physics. Journal of Mathematical Physics, 8(4):962–982, 1967.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments