# Non-Convex SGD Learns Halfspaces with Adversarial Label Noise

NIPS 2020, 2020.

EI

Weibo:

Abstract:

We study the problem of agnostically learning homogeneous halfspaces in the distribution-specific PAC model. For a broad family of structured distributions, including log-concave distributions, we show that non-convex SGD efficiently converges to a solution with misclassification error $O(\opt)+\eps$, where $\opt$ is the misclassificati...More

Code:

Data:

Introduction

- 1.1 Background and Motivation

Learning in the presence of noisy data is a central challenge in machine learning. - The authors' main result is that SGD on a non-convex surrogate of the zero-one loss solves the problem of learning a homogeneous halfspace with adversarial label noise when the underlying marginal distribution on the examples is well-behaved.

Highlights

- 1.1 Background and Motivation

Learning in the presence of noisy data is a central challenge in machine learning - We study the problem of agnostically learning homogeneous halfspaces in the distributionspecific PAC model
- For a broad family of structured distributions, including log-concave distributions, we show that non-convex SGD efficiently converges to a solution with misclassification error O(opt) + ǫ, where opt is the misclassification error of the best-fitting halfspace
- We show that optimizing any convex surrogate inherently leads to misclassification error of ω(opt), even under Gaussian marginals
- We show that non-convex SGD efficiently learns homogeneous halfspaces in the presence of adversarial label noise with respect to a broad family of well-behaved distributions, including log-concave distributions
- Our main result is that SGD on a non-convex surrogate of the zero-one loss solves the problem of learning a homogeneous halfspace with adversarial label noise when the underlying marginal distribution on the examples is well-behaved

Results

- SGD on the objective (1) has the following performance guarantee: For any ǫ > 0, it draws m = O(d/ǫ4) labeled examples from D, uses O(m) gradient evaluations, and outputs a hypothesis halfspace with misclassification error O(opt) + ǫ with probability at least 99%.
- The authors' lower bound result shows a strong statement about convex surrogates: Even under the nicest distribution possible, i.e., a Gaussian, there is some simple label noise that does not depend on the convex loss l(·) such that no convex objective can achieve O(opt) error.
- The authors exploit the fact that all optimal halfspaces lie in a small cone, and show that there exists a fixed noise distribution such that all convex loss functions have non-zero gradients inside this cone.
- [KLS09b] studied the problem of learning homogeneous halfspaces in the adversarial label noise model, when the marginal distribution on the examples is isotropic log-concave, and gave a polynomial-time algorithm with error guarantee O(opt1/3) + ǫ.
- The authors draw an analogy with recent work [DGK+20] which established that convex surrogates suffice to obtain error O(opt) + ǫ for the related problem of agnostically learning ReLUs under well-behaved distributions.
- There is an algorithm with the following performance guarantee: For any ǫ > 0, it draws m = O(d log(1/δ)/ǫ4) labeled examples from D, uses O(m) gradient evaluations, and outputs a hypothesis vector wthat satisfies errD0−1(hw ) ≤ O(opt) + ǫ with probability at least 1 − δ, where opt is the minimum classification error achieved by halfspaces.

Conclusion

- There exists a distribution D on R2 × {±1} and a halfspace w∗ such that errD0−1(w∗) ≤ Prx∼Dx[ x 2 ≥ Z], the x-marginal of D is Dx, and for every convex, non-decreasing, non-constant loss l(·) and every w such that θ(w, w∗) ≤ θ it holds ∇wC(w) = 0, where C is defined in Eq (2).
- Let Dx be the standard normal distribution on Rd. There exists a distribution D on Rd × {±1} such that for every convex, non-decreasing loss l(·), the objective C(w) = Ex,y∼D[l(−y x, w )] is minimized at some halfspace h with error errD0−1(h) = Ω(opt log(1/opt)).

Summary

- 1.1 Background and Motivation

Learning in the presence of noisy data is a central challenge in machine learning. - The authors' main result is that SGD on a non-convex surrogate of the zero-one loss solves the problem of learning a homogeneous halfspace with adversarial label noise when the underlying marginal distribution on the examples is well-behaved.
- SGD on the objective (1) has the following performance guarantee: For any ǫ > 0, it draws m = O(d/ǫ4) labeled examples from D, uses O(m) gradient evaluations, and outputs a hypothesis halfspace with misclassification error O(opt) + ǫ with probability at least 99%.
- The authors' lower bound result shows a strong statement about convex surrogates: Even under the nicest distribution possible, i.e., a Gaussian, there is some simple label noise that does not depend on the convex loss l(·) such that no convex objective can achieve O(opt) error.
- The authors exploit the fact that all optimal halfspaces lie in a small cone, and show that there exists a fixed noise distribution such that all convex loss functions have non-zero gradients inside this cone.
- [KLS09b] studied the problem of learning homogeneous halfspaces in the adversarial label noise model, when the marginal distribution on the examples is isotropic log-concave, and gave a polynomial-time algorithm with error guarantee O(opt1/3) + ǫ.
- The authors draw an analogy with recent work [DGK+20] which established that convex surrogates suffice to obtain error O(opt) + ǫ for the related problem of agnostically learning ReLUs under well-behaved distributions.
- There is an algorithm with the following performance guarantee: For any ǫ > 0, it draws m = O(d log(1/δ)/ǫ4) labeled examples from D, uses O(m) gradient evaluations, and outputs a hypothesis vector wthat satisfies errD0−1(hw ) ≤ O(opt) + ǫ with probability at least 1 − δ, where opt is the minimum classification error achieved by halfspaces.
- There exists a distribution D on R2 × {±1} and a halfspace w∗ such that errD0−1(w∗) ≤ Prx∼Dx[ x 2 ≥ Z], the x-marginal of D is Dx, and for every convex, non-decreasing, non-constant loss l(·) and every w such that θ(w, w∗) ≤ θ it holds ∇wC(w) = 0, where C is defined in Eq (2).
- Let Dx be the standard normal distribution on Rd. There exists a distribution D on Rd × {±1} such that for every convex, non-decreasing loss l(·), the objective C(w) = Ex,y∼D[l(−y x, w )] is minimized at some halfspace h with error errD0−1(h) = Ω(opt log(1/opt)).

- Table1: Common well-behaved distribution families with their corresponding parameters U, R, t(·), see Definition 1.2. The last two columns show the best possible error achievable by convex objectives and our non-convex objective of Eq(1)

Related work

- Here we provide a detailed summary of the most relevant prior work with a focus on poly(d/ǫ) time algorithms. [KLS09b] studied the problem of learning homogeneous halfspaces in the adversarial label noise model, when the marginal distribution on the examples is isotropic log-concave, and gave a polynomial-time algorithm with error guarantee O(opt1/3) + ǫ. This error bound was improved by [ABL17] who gave an efficient localization-based algorithm that learns to accuracy O(opt) + ǫ for isotropic log-concave distributions. [DKS18] gave a localization-based algorithm that learns arbitrary halfspaces with error O(opt) + ǫ for Gaussian marginals. [BZ17] extended the algorithms of [ABL17] to the class of s-concave distributions, for s > −Ω(1/d). Inspired by the localization approach, [YZ17] gave a perceptron-like learning algorithm that succeeds under the uniform distribution on the sphere. The algorithm of [YZ17] takes O(d/ǫ) samples, runs in time O(d2/ǫ), and achieves error of O(log d · opt) + ǫ – scaling logarithmically with the dimension d. We also note that [DKTZ20] established a structural result regarding the sufficiency of stationary points for learning homogeneous halfspaces with Massart noise. Finally, we draw an analogy with recent work [DGK+20] which established that convex surrogates suffice to obtain error O(opt) + ǫ for the related problem of agnostically learning ReLUs under well-behaved distributions. This positive result for ReLUs stands in sharp contrast to the case of sign activations studied in this paper (as follows from our lower bound result). An interesting direction is to explore the effect of non-convexity for other common activation functions.

Reference

- [ABL17] P. Awasthi, M. F. Balcan, and P. M. Long. The power of localization for efficiently learning linear separators with noise. J. ACM, 63(6):50:1–50:27, 2017.
- [BJM06] P. L. Bartlett, M. I. Jordan, and J. D. Mcauliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
- M.-F. Balcan and H. Zhang. Sample and computationally efficient learning algorithms under s-concave distributions. In Advances in Neural Information Processing Systems, pages 4796–4805, 2017.
- [Dan15] A. Daniely. A PTAS for agnostically learning halfspaces. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, pages 484–502, 2015.
- A. Daniely. Complexity theoretic limitations on learning halfspaces. In Proceedings of the 48th Annual Symposium on Theory of Computing, STOC 2016, pages 105–117, 2016.
- [DGK+20] I. Diakonikolas, S. Goel, S. Karmalkar, A. Klivans, and M. Soltanolkotabi. Approximation schemes for relu regression. In COLT 2020, to appear, 2020. Available at https://arxiv.org/abs/2005.12844.
- I. Diakonikolas, D. M. Kane, and A. Stewart. Learning geometric concepts with nasty noise. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, pages 1061–1073, 2018.
- [DKTZ20] I. Diakonikolas, V. Kontonis, C. Tzamos, and N. Zarifis. Learning halfspaces with massart noise under structured distributions. arXiv, February 2020. Available at https://arxiv.org/abs/2002.05632. To appear in COLT 2020.
- [DKZ20] I. Diakonikolas, D. M. Kane, and N. Zarifis. Near-optimal sq lower bounds for agnostically learning halfspaces and relus under gaussian marginals. Manuscript, 2020.
- [FGKP06] V. Feldman, P. Gopalan, S. Khot, and A. Ponnuswami. New results for learning noisy parities and halfspaces. In Proc. FOCS, pages 563–576, 2006.
- [GGK20] S. Goel, A. Gollakota, and A. Klivans. Statistical-query lower bounds via functional gradients. Manuscript, 2020.
- V. Guruswami and P. Raghavendra. Hardness of learning halfspaces with noise. In Proc. 47th IEEE Symposium on Foundations of Computer Science (FOCS), pages 543–552. IEEE Computer Society, 2006.
- [Hau92] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100:78–150, 1992.
- [KKMS08] A. Kalai, A. Klivans, Y. Mansour, and R. Servedio. Agnostically learning halfspaces. SIAM Journal on Computing, 37(6):1777–1805, 2008.
- [KLS09a] A. Klivans, P. Long, and R. Servedio. Learning halfspaces with malicious noise. To appear in Proc. 17th Internat. Colloq. on Algorithms, Languages and Programming (ICALP), 2009.
- [KLS09b] A. Klivans, P. Long, and R. Servedio. Learning Halfspaces with Malicious Noise. Journal of Machine Learning Research, 10:2715–2740, 2009.
- [KSS94] M. Kearns, R. Schapire, and L. Sellie. Toward Efficient Agnostic Learning. Machine Learning, 17(2/3):115–141, 1994.
- W. Maass and G. Turan. How fast can a threshold gate learn? In S. Hanson, G. Drastal, and R. Rivest, editors, Computational Learning Theory and Natural Learning Systems, pages 381–414. MIT Press, 1994.
- [Nov62] A. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on Mathematical Theory of Automata, volume XII, pages 615–622, 1962.
- [Ros58] F. Rosenblatt. The Perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958.
- L. G. Valiant. A theory of the learnable. In Proc. 16th Annual ACM Symposium on Theory of Computing (STOC), pages 436–445. ACM Press, 1984.
- S. Yan and C. Zhang. Revisiting perceptron: Efficient and label-optimal learning of halfspaces. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pages 1056–1066, 2017.

Full Text

Tags

Comments