# Approximation Schemes for ReLU Regression

COLT, pp. 1452-1485, 2020.

EI

Weibo:

Abstract:

We consider the fundamental problem of ReLU regression, where the goal is to output the best fitting ReLU with respect to square loss given access to draws from some unknown distribution. We give the first efficient, constant-factor approximation algorithm for this problem assuming the underlying distribution satisfies some weak concent...More

Code:

Data:

Introduction

- Finding the best-fitting ReLU with respect to square-loss – called “ReLU Regression” – is a fundamental primitive in the theory of neural networks.
- A recent result shows that finding a hypothesis achieving a loss of O + is NP-hard when there are no distributional assumptions on DX , the marginal of D on the examples (Manurangsi and Reichman, 2018).
- Recent work due to Goel et al (2019) gives hardness results for achieving error opt + , even if the underlying distribution is the standard Gaussian.
- For the problem of ReLU regression, is it possible to recover a hypothesis achieving error of O + in time poly(d, 1 )?

Highlights

- Finding the best-fitting ReLU with respect to square-loss – called “ReLU Regression” – is a fundamental primitive in the theory of neural networks
- We prove that the GLMtron algorithm of Kakade et al (2011) achieves a constant-factor approximation for ReLU regression
- We proved that optimizing a convex surrogate loss suffices for obtaining approximate guarantees
- We further proposed a polynomial-time approximation scheme for ReLU regression under the assumption of sub-gaussianity, which refines the so obtained solution using ideas from localization and polynomial approximation
- The underlying surrogate loss approach seems powerful and exploring further applications is an interesting direction for future work

Conclusion

- The authors gave the first constant approximation scheme for ReLU regression under the assumption of log-concavity.
- The authors proved that optimizing a convex surrogate loss suffices for obtaining approximate guarantees.
- The authors further proposed a PTAS for ReLU regression under the assumption of sub-gaussianity, which refines the so obtained solution using ideas from localization and polynomial approximation.
- Designing approximation schemes for a linear combination of activations functions is an interesting open question

Summary

## Introduction:

Finding the best-fitting ReLU with respect to square-loss – called “ReLU Regression” – is a fundamental primitive in the theory of neural networks.- A recent result shows that finding a hypothesis achieving a loss of O + is NP-hard when there are no distributional assumptions on DX , the marginal of D on the examples (Manurangsi and Reichman, 2018).
- Recent work due to Goel et al (2019) gives hardness results for achieving error opt + , even if the underlying distribution is the standard Gaussian.
- For the problem of ReLU regression, is it possible to recover a hypothesis achieving error of O + in time poly(d, 1 )?
## Conclusion:

The authors gave the first constant approximation scheme for ReLU regression under the assumption of log-concavity.- The authors proved that optimizing a convex surrogate loss suffices for obtaining approximate guarantees.
- The authors further proposed a PTAS for ReLU regression under the assumption of sub-gaussianity, which refines the so obtained solution using ideas from localization and polynomial approximation.
- Designing approximation schemes for a linear combination of activations functions is an interesting open question

Related work

- Here we provide an overview of the most relevant prior work. Goel et al (2017) give an efficient algorithm for ReLU regression that succeeds with respect to any distribution supported on the unit sphere, but has sample complexity and running time exponential in 1 . Soltanolkotabi (2017) shows that SGD efficiently learns a ReLU in the realizable setting when the underlying distribution is assumed to be the standard Gaussian. Goel et al (2018) gives a learning algorithm for one convolutional layer of ReLUs for any symmetric distribution (including Gaussians). Goel et al (2019) gives an efficient algorithm for ReLU regression with error guarantee of O(opt2 3) + .

Yehudai and Shamir (2019) shows that it is hard to learn a single ReLU activation via stochastic gradient descent, when the hypothesis used to learn the ReLU function is of the form N (x) ∶= ∑ri=1 uifi(x) and the functions fi(x) are random feature maps drawn from a fixed distribution. In particular, they show that any N (x) which approximates ReLU(⟨w∗, x⟩ + b) (where w 2 = d2 and b ∈ R) up to a small constant square loss, must have one of the ui being exponentially large in d for some i or have exponentially many random features in the sum (i.e., r ≥ exp(Ω(d)). Their paper makes the point that regression using random features cannot learn the ReLU function in polynomial time. Our results use different techniques to learn the unknown ReLU function that are not captured by this model.

Funding

- ID was supported by NSF Award CCF-1652862 (CAREER), a Sloan Research Fellowship, and a DARPA Learning with Less Labels (LwLL) grant
- SG was supported by the JP Morgan AI Phd Fellowship
- SK was supported by NSF award CNS 1414082 and ID’s startup grant
- AK was supported by NSF awards CCF 1909204 and CCF 1717896
- MS was supported by the Packard Fellowship in Science and Engineering, a Sloan Research Fellowship in Mathematics, an NSF-CAREER under award #1846369, the Air Force Office of Scientific Research Young Investigator Program (AFOSR-YIP) under award #FA 9550-18-1-0078, DARPA Learning with Less Labels (LwLL) and FastNICs programs, an NSF-CIF award #1813877, and a Google faculty research award

Reference

- Adamczak, R., Litvak, A., Pajor, A., and Tomczak-Jaegermann, N. (2010). Quantitative estimates of the convergence of the empirical covariance matrix in log-concave ensembles. Journal of the American Mathematical Society, 23(2):535–561.
- Auer, P., Herbster, M., and Warmuth, M. K. K. (1996). Exponentially many local minima for single neurons. In Touretzky, D. S., Mozer, M. C., and Hasselmo, M. E., editors, Advances in Neural Information Processing Systems 8, pages 316–32MIT Press.
- Awasthi, P., Balcan, M. F., and Long, P. M. (2017). The power of localization for efficiently learning linear separators with noise. J. ACM, 63(6):50:1–50:27.
- Bentkus, V. (2003). An inequality for tail probabilities of martingales with differences bounded from one side. Journal of Theoretical Probability, 16(1):161–173.
- Candes, E. J., Li, X., and Soltanolkotabi, M. (2015). Phase retrieval via wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985–2007.
- Daniely, A. (2015). A PTAS for agnostically learning halfspaces. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, pages 484–502.
- De, A., Diakonikolas, I., Feldman, V., and Servedio, R. (2012). Near-optimal solutions for the Chow Parameters Problem and low-weight approximation of halfspaces. In Proc. 44th ACM Symposium on Theory of Computing (STOC), pages 729–746.
- Diakonikolas, I., Kane, D., and Manurangsi, P. (2019). Nearly tight bounds for robust proper learning of halfspaces with a margin. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, pages 10473–10484.
- Diakonikolas, I., Kane, D. M., and Stewart, A. (2018). Learning geometric concepts with nasty noise. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, pages 1061–1073.
- Goel, S., Kanade, V., Klivans, A., and Thaler, J. (2017). Reliably learning the relu in polynomial time. In Conference on Learning Theory, pages 1004–1042.
- Goel, S., Karmalkar, S., and Klivans, A. (2019). Time/accuracy tradeoffs for learning a relu with respect to gaussian marginals. In Advances in Neural Information Processing Systems, pages 8582–8591.
- Goel, S., Klivans, A. R., and Meka, R. (2018). Learning one convolutional layer with overlapping patches. International Conference on Machine Learning.
- Haussler, D. (1992). Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100:78–150.
- Kakade, S. M., Kanade, V., Shamir, O., and Kalai, A. (2011). Efficient learning of generalized linear and single index models with isotonic regression. In Advances in Neural Information Processing Systems, pages 927–935.
- Kalai, A., Klivans, A., Mansour, Y., and Servedio, R. (2005). Agnostically learning halfspaces. In Proceedings of the 46th IEEE Symposium on Foundations of Computer Science (FOCS), pages 11–20.
- Kalai, A. T. and Sastry, R. (2009). The isotron algorithm: High-dimensional isotonic regression. COLT.
- Kanade, V. (2018). Lecture notes: Learning real-valued functions.
- Kearns, M., Schapire, R., and Sellie, L. (1994). Toward Efficient Agnostic Learning. Machine Learning, 17(2/3):115–141.
- Klivans, A. R. and Meka, R. (2017). Learning graphical models using multiplicative weights. In Umans, C., editor, 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 15-17, 2017, pages 343–354. IEEE Computer Society.
- Ledoux, M. and Talagrand, M. (2013). Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media.
- Manurangsi, P. and Reichman, D. (2018). The computational complexity of training relu (s). arXiv preprint arXiv:1810.04207.
- O’Donnell, R. and Servedio, R. (2008). The Chow Parameters Problem. In Proc. 40th STOC, pages 517–526.
- Sherstov, A. A. (2012). Making polynomials robust to noise. In Proceedings of the Forty-Fourth Annual ACM Symposium on Theory of Computing, STOC ’12, page 747–758, New York, NY, USA. Association for Computing Machinery.
- Shorack, G. R. and Wellner, J. A. (2009). Empirical processes with applications to statistics. SIAM.
- Soltanolkotabi, M. (2017). Learning relus via gradient descent. In Advances in neural information processing systems, pages 2007–2017.
- Valiant, L. G. (1984). A theory of the learnable. In Proc. 16th Annual ACM Symposium on Theory of Computing (STOC), pages 436–445. ACM Press.
- Vapnik, V. (1982). Estimation of Dependences Based on Empirical Data: Springer Series in Statistics. Springer-Verlag, Berlin, Heidelberg.
- Yehudai, G. and Shamir, O. (2019). On the power and limitations of random features for understanding neural networks. CoRR, abs/1904.00687.

Full Text

Tags

Comments