Near-Optimal SQ Lower Bounds for Agnostically Learning Halfspaces and ReLUs under Gaussian Marginals

NeurIPS 2020, 2020.

Keywords:
good fittingRectified Linear UnitLearning with Less Labelsmarginal distributionsq low boundMore(7+)
Weibo:
Our lower bounds suggest that the accuracy-runtime tradeoff of known polynomial time approximation schemes for these problems that achieve error OPT + ǫ, for all γ > 0, in time poly(dpoly(1/γ), 1/ǫ) is qualitatively best possible

Abstract:

We study the fundamental problems of agnostically learning halfspaces and ReLUs under Gaussian marginals. In the former problem, given labeled examples $(\mathbf{x}, y)$ from an unknown distribution on $\mathbb{R}^d \times \{ \pm 1\}$, whose marginal distribution on $\mathbf{x}$ is the standard Gaussian and the labels $y$ can be arbitra...More

Code:

Data:

0
Introduction
• The authors study the fundamental problems of agnostically learning halfspaces and ReLU regression in the distribution-specific agnostic PAC model.
• In both of these problems, the authors are given i.i.d. samples from a joint distribution D on labeled examples (x, y), where x ∈ Rd is the example and y ∈ R is the corresponding label, and the goal is to compute a hypothesis that is competitive with the best-fitting halfspace or ReLU respectively.
Highlights
• 1.1 Background and Problem Motivation

We study the fundamental problems of agnostically learning halfspaces and Rectified Linear Unit (ReLU) regression in the distribution-specific agnostic PAC model
• In both of these problems, we are given i.i.d. samples from a joint distribution D on labeled examples (x, y), where x ∈ Rd is the example and y ∈ R is the corresponding label, and the goal is to compute a hypothesis that is competitive with the best-fitting halfspace or ReLU respectively
• [GKK19] gave a qualitatively similar reduction implying a computational lower bound of dΩ(log(1/ǫ)) for Problem 1.2
• Our lower bounds suggest that the accuracy-runtime tradeoff of known polynomial time approximation schemes (PTAS) for these problems [Dan[15], DGK+20] that achieve error (1 + γ)OPT + ǫ, for all γ > 0, in time poly(dpoly(1/γ), 1/ǫ) is qualitatively best possible
• Consider the set of distributions {Pv}, where v is any unit vector, such that the projection of Pv in the v-direction is equal to A and in the orthogonal complement Pv is an independent standard Gaussian
• This set of distributions has Statistical Query (SQ) dimension dΩ(k). By known results this implies that distinguishing such a distribution from the standard Gaussian or learning a distribution with better than 1/poly(dk) correlation with such a distribution is hard in the SQ model
Results
• The authors' Results and Techniques

The authors are ready to formally state the main results. For Problem 1.1 the authors prove: Theorem 1.4.
• Let d ≥ 1 and ǫ ≥ d−c, for some sufficiently small constant c > 0.
• Any SQ algorithm that agnostically learns halfspaces on Rd under Gaussian marginals within additive error ǫ > 0 requires at least dc/ǫ many statistical queries to STAT(d−c/ǫ).
• The above statement says that any SQ algorithm for Problem 1.1 requires time at least dΩ(1/ǫ)
• This comes close to the known upper bound of dO(1/ǫ2) [KKMS08] and exponentially improves on the best known lower bound of dΩ(log(1/ǫ)) [KK14]
Conclusion
• The reduction-based hardness of [KK14, GKK19] imply SQ lower bounds of dΩ(log(1/ǫ)) for both problems.
• The authors' new SQ lower bounds are qualitatively optimal, nearly matching current algorithms.
• For both problems, the results show a sharp separation in the complexity of obtaining error O(OPT) + ǫ (which is poly(d/ǫ)) versus optimal error OPT + ǫ.
• Consider the set of distributions {Pv}, where v is any unit vector, such that the projection of Pv in the v-direction is equal to A and in the orthogonal complement Pv is an independent standard Gaussian.
• By known results this implies that distinguishing such a distribution from the standard Gaussian or learning a distribution with better than 1/poly(dk) correlation with such a distribution is hard in the SQ model
Summary
• Introduction:

The authors study the fundamental problems of agnostically learning halfspaces and ReLU regression in the distribution-specific agnostic PAC model.
• In both of these problems, the authors are given i.i.d. samples from a joint distribution D on labeled examples (x, y), where x ∈ Rd is the example and y ∈ R is the corresponding label, and the goal is to compute a hypothesis that is competitive with the best-fitting halfspace or ReLU respectively.
• Results:

The authors' Results and Techniques

The authors are ready to formally state the main results. For Problem 1.1 the authors prove: Theorem 1.4.
• Let d ≥ 1 and ǫ ≥ d−c, for some sufficiently small constant c > 0.
• Any SQ algorithm that agnostically learns halfspaces on Rd under Gaussian marginals within additive error ǫ > 0 requires at least dc/ǫ many statistical queries to STAT(d−c/ǫ).
• The above statement says that any SQ algorithm for Problem 1.1 requires time at least dΩ(1/ǫ)
• This comes close to the known upper bound of dO(1/ǫ2) [KKMS08] and exponentially improves on the best known lower bound of dΩ(log(1/ǫ)) [KK14]
• Conclusion:

The reduction-based hardness of [KK14, GKK19] imply SQ lower bounds of dΩ(log(1/ǫ)) for both problems.
• The authors' new SQ lower bounds are qualitatively optimal, nearly matching current algorithms.
• For both problems, the results show a sharp separation in the complexity of obtaining error O(OPT) + ǫ (which is poly(d/ǫ)) versus optimal error OPT + ǫ.
• Consider the set of distributions {Pv}, where v is any unit vector, such that the projection of Pv in the v-direction is equal to A and in the orthogonal complement Pv is an independent standard Gaussian.
• By known results this implies that distinguishing such a distribution from the standard Gaussian or learning a distribution with better than 1/poly(dk) correlation with such a distribution is hard in the SQ model
Reference
• [ABL17] P. Awasthi, M. F. Balcan, and P. M. Long. The power of localization for efficiently learning linear separators with noise. J. ACM, 63(6):50:1–50:27, 2017.
• [BFJ+94] A. Blum, M. Furst, J. Jackson, M. Kearns, Y. Mansour, and S. Rudich. Weakly learning DNF and characterizing statistical query learning using Fourier analysis. In Proceedings of the Twenty-Sixth Annual Symposium on Theory of Computing, pages 253–262, 1994.
• [Dan15] A. Daniely. A PTAS for agnostically learning halfspaces. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, pages 484–502, 2015.
• [Dan16] A. Daniely. Complexity theoretic limitations on learning halfspaces. In Proceedings of the 48th Annual Symposium on Theory of Computing, STOC 2016, pages 105–117, 2016.
• [DGJ+10] I. Diakoniokolas, P. Gopalan, R. Jaiswal, R. Servedio, and E. Viola. Bounded independence fools halfspaces. SIAM Journal on Computing, 39(8):3441–3462, 2010.
• [DGK+20] I. Diakonikolas, S. Goel, S. Karmalkar, A. Klivans, and M. Soltanolkotabi. Approximation schemes for relu regression. In COLT 2020, to appear, 2020. Available at https://arxiv.org/abs/2005.12844.
• [DKKZ20] I. Diakonikolas, D. M. Kane, V. Kontonis, and N. Zarifis. Algorithms and SQ lower bounds for PAC learning one-hidden-layer relu networks. CoRR, abs/2006.12476, 2020. To appear in COLT 2020.
• [DKN10] I. Diakonikolas, D. M. Kane, and J. Nelson. Bounded independence fools degree-2 threshold functions. In 51th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2010, pages 11–20. IEEE Computer Society, 2010.
• I. Diakonikolas, D. M. Kane, and A. Stewart. Statistical query lower bounds for robust estimation of high-dimensional Gaussians and Gaussian mixtures. In FOCS, pages 73–84, 2017.
• I. Diakonikolas, D. M. Kane, and A. Stewart. Learning geometric concepts with nasty noise. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, pages 1061–1073, 2018.
• V. Feldman. A complete characterization of statistical query learning with applications to evolvability. In Proc. 50th Symposium on Foundations of Computer Science (FOCS), pages 375–384, 2009.
• V. Feldman. Statistical query learning. In Encyclopedia of Algorithms, pages 2090–2095. 2016.
• [FGKP06] V. Feldman, P. Gopalan, S. Khot, and A. Ponnuswami. New results for learning noisy parities and halfspaces. In Proc. FOCS, pages 563–576, 2006.
• [FGR+13] V. Feldman, E. Grigorescu, L. Reyzin, S. Vempala, and Y. Xiao. Statistical algorithms and a lower bound for detecting planted cliques. In Proceedings of STOC’13, pages 655–664, 2013.
• V. Feldman, C. Guzman, and S. S. Vempala. Statistical query algorithms for mean vector estimation and stochastic convex optimization. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2017, pages 1265–1277. SIAM, 2017.
• Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.
• [GGJ+20] S. Goel, A. Gollakota, Z. Jin, S. Karmalkar, and A. R. Klivans. Superpolynomial lower bounds for learning one-layer neural networks using gradient descent. CoRR, abs/2006.12011, 2020. To appear in ICML 2020.
• [GGK20] S. Goel, A. Gollakota, and A. Klivans. Statistical-query lower bounds via functional gradients. Manuscript, 2020.
• S. Goel, S. Karmalkar, and A. R. Klivans. Time/accuracy tradeoffs for learning a relu with respect to gaussian marginals. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, pages 8582–8591, 2019.
• [GKKT17] S. Goel, V. Kanade, A. R. Klivans, and J. Thaler. Reliably learning the relu in polynomial time. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, volume 65 of Proceedings of Machine Learning Research, pages 1004–1042. PMLR, 2017.
• V. Guruswami and P. Raghavendra. Hardness of learning halfspaces with noise. In Proc. 47th IEEE Symposium on Foundations of Computer Science (FOCS), pages 543–552. IEEE Computer Society, 2006.
• [Hau92] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100:78–150, 1992.
• [Kea98] M. J. Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM, 45(6):983–1006, 1998.
• A. R. Klivans and P. Kothari. Embedding hard learning problems into gaussian space. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2014, pages 793–809, 2014.
• [KKMS08] A. Kalai, A. Klivans, Y. Mansour, and R. Servedio. Agnostically learning halfspaces. SIAM Journal on Computing, 37(6):1777–1805, 2008.
• A. Klivans, P. Long, and R. Servedio. Learning halfspaces with malicious noise. To appear in Proc. 17th Internat. Colloq. on Algorithms, Languages and Programming (ICALP), 2009.
• [KSS94] M. Kearns, R. Schapire, and L. Sellie. Toward Efficient Agnostic Learning. Machine Learning, 17(2/3):115–141, 1994.
• [MR18] P. Manurangsi and D. Reichman. The computational complexity of training relu (s). arXiv preprint arXiv:1810.04207, 2018.
• W. Maass and G. Turan. How fast can a threshold gate learn? In S. Hanson, G. Drastal, and R. Rivest, editors, Computational Learning Theory and Natural Learning Systems, pages 381–414. MIT Press, 1994.
• [Ros58] F. Rosenblatt. The Perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958.
• M. Soltanolkotabi. Learning relus via gradient descent. In Advances in neural information processing systems, pages 2007–2017, 2017.
• [Sze89] G. Szegö. Orthogonal Polynomials, volume XXIII of American Mathematical Society Colloquium Publications. A.M.S, Providence, 1989.
• L. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
• [Vap98] V. Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.