# Learning Polynomials of Few Relevant Dimensions

COLT, pp. 1161-1227, 2020.

Cited by: 0|Views18
EI
Weibo:
We introduce a new filtered PCA approach to get a warm start for the true subspace and use geodesic SGD to boost to arbitrary accuracy; our techniques may be of independent interest, especially for problems dealing with subspace recovery or analyzing SGD on manifolds

Abstract:

Polynomial regression is a basic primitive in learning and statistics. In its most basic form the goal is to fit a degree $d$ polynomial to a response variable $y$ in terms of an $n$-dimensional input vector $x$. This is extremely well-studied with many applications and has sample and runtime complexity $\Theta(n^d)$. Can one achieve be...More

Code:

Data:

Full Text
Bibtex
Weibo
Introduction
• Consider the classical polynomial regression problem in learning and statistics. In its most basic form, the authors receive samples of the form (x, y) with x ∈ Rn coming from some distribution and y is P (x) for a degree at most d polynomial in x.
• Given samples (x, y = P (x)) where x ∼ N (0, Idn), and P is an unknown degreed, rank-r polynomial, can one approximately recover the subspace defining P efficiently?
Highlights
• Consider the classical polynomial regression problem in learning and statistics
• We introduce a new filtered PCA approach to get a warm start for the true subspace and use geodesic SGD to boost to arbitrary accuracy; our techniques may be of independent interest, especially for problems dealing with subspace recovery or analyzing SGD on manifolds
• We argue that the top eigenvector of the above matrix will have most of its mass in U ∗ and this gives us our vector vl+1
• We argue that with high probability, the dominant term given by Y is large and the error from Taylor approximation is small
• While we have already seen that Lemma 7.2 is needed to prove Theorem 7.1, Lemmas 7.6 and 7.7 will be crucial to our arguments in later sections, where we argue that at each step t we make progress scaling with the distance dP (V (t), V ∗) and need that this distance is comparable to the initial distance dP (V (0), V ∗)
• We argue that with high probability, the dominant term given by X is large, while the error terms from Taylor approximation and from the trigonometric corrections are small
Results
• For all δ > 0 and ǫ ∈ (0, 1), there is an efficient algorithm that takes N = C0(r, d, α)(ln(n/δ))c0d · n log2(1/ǫ) samples (x, P (x)), where x ∼ N (0, Idn) and P is an unknown α-non-degenerate rank r, degree-d polynomial defined by hidden subspace U ∗, and outputs
• For all δ > 0 and ǫ ∈ (0, 1), there is an efficient algorithm that takes N = C(r, d, α)n log(1/δ)/ǫ2 samples (x, P (x)) for x ∼ N (0, Idn) and unknown P which is α-non-degenerate of rank r, and outputs a subspace U such that with probability at least 1− δ, dP (U, U ∗) < ǫ.
• Let Θ∗ = (c∗, v∗) be one of the two possible realizations of D for which v∗ ∈ Sn−1, and suppose the authors already have a warm start of Θ = (c, v), where the coefficients c and c∗ define the univariate degree-d polynomials p(z)
• The workaround for the issue posed in Section 2.2.1 is clear at least in the rank-1 case: to avoid moving in the wasteful directions which are orthogonal to the current iterate v, compute the vanilla gradient and project to the orthogonal complement of v.
• The authors will let Pnνc,ron,dd denote the set of all νcond non-degenerate rank r polynomials P of degree at most d in n variables that satisfy the normalization conditions Prν,cdond for Prν,cro,ndd .
• The following says that if a set of r orthogonal unit vectors all have large component in U ∗, their span is close to the true subspace in the sense of either of the distances above.
• Let D denote the distribution (X, Y ) where Y = P (X) is a α non-degerate polynomial of rank r and degree at most d as in the hypothesis of the theorem.
Conclusion
• Note that Lemma 4.2 already gives a nontrivial algorithmic guarantee for l = 0: given exact access to Mτ∅, the authors can recover a vector inside the true subspace by taking its top eigenvector.
• For c = {cI } ∈ RM , where the author ranges over multisets of size at most d consisting of elements of [r], and V ∈ Stnr , let parameters Θ = (c, V ) correspond to a rank-r polynomial
Summary
• Consider the classical polynomial regression problem in learning and statistics. In its most basic form, the authors receive samples of the form (x, y) with x ∈ Rn coming from some distribution and y is P (x) for a degree at most d polynomial in x.
• Given samples (x, y = P (x)) where x ∼ N (0, Idn), and P is an unknown degreed, rank-r polynomial, can one approximately recover the subspace defining P efficiently?
• For all δ > 0 and ǫ ∈ (0, 1), there is an efficient algorithm that takes N = C0(r, d, α)(ln(n/δ))c0d · n log2(1/ǫ) samples (x, P (x)), where x ∼ N (0, Idn) and P is an unknown α-non-degenerate rank r, degree-d polynomial defined by hidden subspace U ∗, and outputs
• For all δ > 0 and ǫ ∈ (0, 1), there is an efficient algorithm that takes N = C(r, d, α)n log(1/δ)/ǫ2 samples (x, P (x)) for x ∼ N (0, Idn) and unknown P which is α-non-degenerate of rank r, and outputs a subspace U such that with probability at least 1− δ, dP (U, U ∗) < ǫ.
• Let Θ∗ = (c∗, v∗) be one of the two possible realizations of D for which v∗ ∈ Sn−1, and suppose the authors already have a warm start of Θ = (c, v), where the coefficients c and c∗ define the univariate degree-d polynomials p(z)
• The workaround for the issue posed in Section 2.2.1 is clear at least in the rank-1 case: to avoid moving in the wasteful directions which are orthogonal to the current iterate v, compute the vanilla gradient and project to the orthogonal complement of v.
• The authors will let Pnνc,ron,dd denote the set of all νcond non-degenerate rank r polynomials P of degree at most d in n variables that satisfy the normalization conditions Prν,cdond for Prν,cro,ndd .
• The following says that if a set of r orthogonal unit vectors all have large component in U ∗, their span is close to the true subspace in the sense of either of the distances above.
• Let D denote the distribution (X, Y ) where Y = P (X) is a α non-degerate polynomial of rank r and degree at most d as in the hypothesis of the theorem.
• Note that Lemma 4.2 already gives a nontrivial algorithmic guarantee for l = 0: given exact access to Mτ∅, the authors can recover a vector inside the true subspace by taking its top eigenvector.
• For c = {cI } ∈ RM , where the author ranges over multisets of size at most d consisting of elements of [r], and V ∈ Stnr , let parameters Θ = (c, V ) correspond to a rank-r polynomial
Related work
• Filtering Data by Thresholding Our algorithm for obtaining a warm start (see Theorem 2.1) relies on filtering the data via some form of thresholding. This general paradigm has been used in other, unrelated contexts like robustness, see [SS19, SS18, DKK+19a, Li18b, DKK+19b, DKK+17] and the references therein, though typically the points which are smaller than some threshold are removed, whereas our algorithm, TrimmedPCA, is an intriguing case where the opposite kind of filter is applied.

Riemannian Optimization It is beyond the scope of this paper to reliably survey the vast literature on Riemannian optimization methods, and we refer the reader to the standard references on the subject [Udr[94], AMS09] which mostly provide asymptotic convergence guarantees, as well as the thesis of Boumal [Bou14] and the references therein. Some notable lines of work include optimization with respect to orthogonality constraints [EAS98], applications to low-rank matrix and tensor completion [MMBS13, Van[13], IAVHDL11, KSV14], dictionary learning [SQW16], independent component analysis [SJG09], canonical correlation analysis [LWW15], matrix equation solving [VV10], complexity theory and operator scaling [AZGL+18], subspace tracking [BNR10, ZB], and building a theory of geodesically convex optimization [ZS16, HS15, ZRS16].

We remark that the update rule we use in our boosting algorithm is very similar to that of [BNR10, ZB], as their and our work are based on geodesics on the Grassmannian manifold. That said, they solve a very different problem from ours, and the analysis is quite different.
Funding
• ∗This work was supported in part by a Paul and Daisy Soros Fellowship, NSF CAREER Award CCF-1453261, and NSF Large CCF-1565235
Reference
• Anima Anandkumar, Rong Ge, and Majid Janzamin. Analyzing tensor power method dynamics: Applications to learning overcomplete latent variable models. arXiv preprint arXiv:1411.1488, 2014.
• Animashree Anandkumar, Rong Ge, and Majid Janzamin. Learning overcomplete latent variable models through tensor methods. In Conference on Learning Theory, pages 36–112, 2015.
• [ALPV12] Albert Ai, Alex Lapanowski, Yaniv Plan, and Roman Vershynin. One-bit compressed sensing with non-gaussian measurements. arXiv preprint arXiv:1208.6279, 2012.
• P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.
• Alexandr Andoni, Rina Panigrahy, Gregory Valiant, and Li Zhang. Learning sparse polynomial functions. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 500–510. SIAM, 2014.
• Zeyuan Allen-Zhu, Ankit Garg, Yuanzhi Li, Rafael Oliveira, and Avi Wigderson. Operator scaling via geodesically convex optimization, invariant theory and polynomial identity testing. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 172–181, 2018.
• Zeyuan Allen-Zhu and Yuanzhi Li. Lazysvd: Even faster svd decomposition yet without agonizing pain. In Advances in Neural Information Processing Systems, pages 974–982, 2016.
• Dmitry Babichev, Francis Bach, et al. Slice inverse regression with score functions. Electronic Journal of Statistics, 12(1):1507–1543, 2018.
• V Bentkus. An inequality for tail probabilities of martingales with differences bounded from one side. Journal of Theoretical Probability, 16(1):161–173, 2003.
• Ainesh Bakshi, Rajesh Jayaram, and David P Woodruff. Learning two layer rectified neural networks in polynomial time. arXiv preprint arXiv:1811.01885, 2018.
• Laura Balzano, Robert Nowak, and Benjamin Recht. Online identification and tracking of subspaces from highly incomplete information. In 2010 48th Annual allerton conference on communication, control, and computing (Allerton), pages 704–7IEEE, 2010.
• Nicolas Boumal. Optimization and estimation on manifolds. PhD thesis, 2014.
• Aldo Conca, Dan Edidin, Milena Hering, and Cynthia Vinzant. An algebraic characterization of injectivity in phase retrieval. Applied and Computational Harmonic Analysis, 38(2):346–356, 2015.
• Emmanuel J Candes, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985–2007, 2015.
• Emmanuel J Candes, Thomas Strohmer, and Vladislav Voroninski. Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Communications on Pure and Applied Mathematics, 66(8):1241–1274, 2013.
• Rishabh Dudeja and Daniel Hsu. Learning single-index models in gaussian space. In Sebastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 1887–1930. PMLR, 06–09 Jul 2018.
• Arnak S Dalalyan, Anatoly Juditsky, and Vladimir Spokoiny. A new algorithm for estimating the effective dimension-reduction subspace. Journal of Machine Learning Research, 9(Aug):1647–1678, 2008.
• Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (in high dimensions) can be practical. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 999–1008. JMLR. org, 2017.
• [DKK+19a] Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high-dimensions without the computational intractability. SIAM Journal on Computing, 48(2):742–864, 2019.
• [DKK+19b] Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Jacob Steinhardt, and Alistair Stewart. Sever: A robust meta-algorithm for stochastic optimization. In International Conference on Machine Learning, pages 1596–1606, 2019.
• Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Learning geometric concepts with nasty noise. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1061–1073, 2018.
• Anindya De, Elchanan Mossel, and Joe Neeman. Is your function low dimensional? In Conference on Learning Theory, pages 979–993, 2019.
• Alan Edelman, Tomas A Arias, and Steven T Smith. The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications, 20(2):303–353, 1998.
• Surbhi Goel and Adam Klivans. Learning neural networks with two nonlinear layers in polynomial time. arXiv preprint arXiv:1709.06010, 2017.
• [GKKT16] Surbhi Goel, Varun Kanade, Adam Klivans, and Justin Thaler. Reliably learning the relu in polynomial time. arXiv preprint arXiv:1611.10258, 2016.
• [GKLW18] Rong Ge, Rohith Kuditipudi, Zhize Li, and Xiang Wang. Learning two-layer neural networks with symmetric inputs. arXiv preprint arXiv:1810.06793, 2018.
• Rong Ge, Jason D Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with landscape design. arXiv preprint arXiv:1711.00501, 2017.
• Rong Ge and Tengyu Ma. Decomposing overcomplete 3rd order tensors using sumof-squares algorithms. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2015). Schloss DagstuhlLeibniz-Zentrum fuer Informatik, 2015.
• Sivakant Gopi, Praneeth Netrapalli, Prateek Jain, and Aditya Nori. One-bit compressed sensing: Provable support and vector recovery. In International Conference on Machine Learning, pages 154–162, 2013.
• Marian Hristache, Anatoli Juditsky, Jorg Polzehl, Vladimir Spokoiny, et al. Structure adaptive approach for dimension reduction. The Annals of Statistics, 29(6):1537–1566, 2001.
• Marian Hristache, Anatoli Juditsky, and Vladimir Spokoiny. Direct estimation of the index coefficient in a single-index model. Annals of Statistics, pages 595–623, 2001.
• Reshad Hosseini and Suvrit Sra. Matrix manifold optimization for gaussian mixtures. In Advances in Neural Information Processing Systems, pages 910–918, 2015.
• Samuel B Hopkins, Jonathan Shi, and David Steurer. Tensor principal component analysis via sum-of-square proofs. In Conference on Learning Theory, pages 956–1006, 2015.
• Samuel B Hopkins, Tselil Schramm, Jonathan Shi, and David Steurer. Fast spectral algorithms from sum-of-squares proofs: tensor decomposition and planted sparse vectors. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 178–191, 2016.
• [IAVHDL11] Mariya Ishteva, P-A Absil, Sabine Van Huffel, and Lieven De Lathauwer. Best low multilinear rank approximation of higher-order tensors, based on the riemannian trust-region scheme. SIAM Journal on Matrix Analysis and Applications, 32(1):115– 135, 2011.
• Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Beating the perils of nonconvexity: Guaranteed training of neural networks using tensor methods. arXiv preprint arXiv:1506.08473, 2015.
• Adam R Klivans, Ryan O’Donnell, and Rocco A Servedio. Learning intersections and thresholds of halfspaces. Journal of Computer and System Sciences, 68(4):808–840, 2004.
• Adam R Klivans, Ryan O’Donnell, and Rocco A Servedio. Learning geometric concepts via gaussian surface area. In 2008 49th Annual IEEE Symposium on Foundations of Computer Science, pages 541–550. IEEE, 2008.
• Adam R Klivans and Rocco A Servedio. Learning dnf in time 2o (n1/3). Journal of Computer and System Sciences, 68(2):303–318, 2004.
• Subhash Khot and Rishi Saket. On hardness of learning intersection of two halfspaces. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 345–354, 2008.
• Daniel Kressner, Michael Steinlechner, and Bart Vandereycken. Low-rank tensor completion by riemannian optimization. BIT Numerical Mathematics, 54(2):447–468, 2014.
• Ker-Chau Li. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414):316–327, 1991.
• Ker-Chau Li. On principal hessian directions for data visualization and dimension reduction: Another application of stein’s lemma. Journal of the American Statistical Association, 87(420):1025–1039, 1992.
• Chris Junchi Li. A note on concentration inequality for vector-valued martingales with weak exponential-type tails. arXiv preprint arXiv:1809.02495, 2018.
• Jerry Zheng Li. Principled approaches to robust machine learning and beyond. PhD thesis, Massachusetts Institute of Technology, 2018.
• Nathan Linial, Yishay Mansour, and Noam Nisan. Constant depth circuits, fourier transform, and learnability. Journal of the ACM (JACM), 40(3):607–620, 1993.
• Xin-Guo Liu, Xue-Feng Wang, and Wei-Guo Wang. Maximization of matrix trace function of product stiefel manifolds. SIAM Journal on Matrix Analysis and Applications, 36(4):1489–1506, 2015.
• [MMBS13] Bamdev Mishra, Gilles Meyer, Francis Bach, and Rodolphe Sepulchre. Low-rank optimization with trace norm penalty. SIAM Journal on Optimization, 23(4):2124– 2149, 2013.
• Elchanan Mossel, Ryan O’Donnell, and Rocco P Servedio. Learning juntas. In Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, pages 206–212, 2003.
• Tengyu Ma, Jonathan Shi, and David Steurer. Polynomial-time tensor decompositions with sum-of-squares. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 438–446. IEEE, 2016.
• Praneeth Netrapalli, Prateek Jain, and Sujay Sanghavi. Phase retrieval using alternating minimization. In Advances in Neural Information Processing Systems, pages 2796–2804, 2013.
• Matey Neykov, Zhaoran Wang, and Han Liu. Agnostic estimation for misspecified phase retrieval models. In Advances in Neural Information Processing Systems, pages 4089–4097, 2016.
• Yaniv Plan and Roman Vershynin. One-bit compressed sensing by linear programming. Communications on Pure and Applied Mathematics, 66(8):1275–1297, 2013.
• Yaniv Plan and Roman Vershynin. The generalized lasso with non-linear observations. IEEE Transactions on information theory, 62(3):1528–1537, 2016.
• Yaniv Plan, Roman Vershynin, and Elena Yudovina. High-dimensional estimation with geometric constraints. Information and Inference: A Journal of the IMA, 6(1):1– 40, 2017.
• Hao Shen, Stefanie Jegelka, and Arthur Gretton. Fast kernel-based independent component analysis. IEEE Transactions on Signal Processing, 57(9):3498–3511, 2009.
• Ju Sun, Qing Qu, and John Wright. Complete dictionary recovery over the sphere ii: Recovery by riemannian trust-region method. IEEE Transactions on Information Theory, 63(2):885–914, 2016.
• Tselil Schramm and David Steurer. Fast and robust tensor decomposition with applications to dictionary learning. In Conference on Learning Theory, pages 1760–1793, 2017.
• Yanyao Shen and Sujay Sanghavi. Learning with bad training data via iterative trimmed loss minimization. arXiv preprint arXiv:1810.11874, 2018.
• Yanyao Shen and Sujay Sanghavi. Iterative least trimmed squares for mixed linear regression. In Advances in Neural Information Processing Systems, pages 6076–6086, 2019.
• Constantin Udriste. Convex functions and optimization methods on Riemannian manifolds, volume 297. Springer Science & Business Media, 1994.
• Bart Vandereycken. Low-rank matrix completion by riemannian optimization. SIAM Journal on Optimization, 23(2):1214–1236, 2013.
• Santosh S Vempala. Learning convex concepts from gaussian distributions with pca. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 124–130. IEEE, 2010.
• Santosh S Vempala. A random-sampling-based algorithm for learning intersections of halfspaces. Journal of the ACM (JACM), 57(6):1–14, 2010.
• Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
• Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge University Press, 2018.
• Van H Vu. Concentration of non-lipschitz functions and applications. Random Structures &amp; Algorithms, 20(3):262–316, 2002.
• Bart Vandereycken and Stefan Vandewalle. A riemannian optimization approach for computing low-rank solutions of lyapunov equations. SIAM Journal on Matrix Analysis and Applications, 31(5):2553–2579, 2010.
• Santosh S Vempala and Ying Xiao. Structure from local optima: Learning subspace juntas via higher order pca. arXiv preprint arXiv:1108.3329, 2011.
• Zhuoran Yang, Krishnakumar Balasubramanian, and Han Liu. On stein’s identity and near-optimal estimation in high-dimensional index models. arXiv preprint arXiv:1709.08795, 2017.
• Hongyi Zhang, Sashank J Reddi, and Suvrit Sra. Riemannian svrg: Fast stochastic optimization on riemannian manifolds. In Advances in Neural Information Processing Systems, pages 4592–4600, 2016.
• Hongyi Zhang and Suvrit Sra. First-order methods for geodesically convex optimization. In Conference on Learning Theory, pages 1617–1638, 2016.