Learning Halfspaces with Massart Noise Under Structured Distributions
COLT, pp. 1486-1513, 2020.
EI
Weibo:
Abstract:
We study the problem of learning halfspaces with Massart noise in the distribution-specific PAC model. We give the first computationally efficient algorithm for this problem with respect to a broad family of distributions, including log-concave distributions. This resolves an open question posed in a number of prior works. Our approach ...More
Code:
Data:
Introduction
- The main result of this paper is the first polynomial-time algorithm for learning halfspaces with Massart noise with respect to a broad class of well-behaved distributions.
- There is a computationally efficient algorithm that learns halfspaces in the presence of Massart noise with respect to the class of (U, R, t)bounded distributions on Rd. the algorithm draws m = poly (U/R, t( /2), 1/(1 − 2η)) · O(d/ 4) samples from a noisy example oracle at rate η < 1/2, runs in sample-polynomial time, and outputs a hypothesis halfspace h that is -close to the target with probability at least 9/10.
Highlights
- Halfspaces, or Linear Threshold Functions, are Boolean functions hw : Rd → {±1} of the form hw(x) = sign ( w, x ), where w ∈ Rd is the associated weight vector. (The univariate function sign(t) is defined as sign(t) = 1, for t ≥ 0, and sign(t) = −1 otherwise.) Halfspaces have been a central object of study in various fields, including complexity theory, optimization, and machine learning [MP68, Yao[90], GHR92, STC00, O’D14]
- I.e., when all the labels are consistent with the target halfspace, this learning problem amounts to linear programming, can be solved in polynomial time
- Our approach is extremely simple: We take an optimization view and leverage the structure of the learning problem to identify a simple non-convex surrogate loss Lσ(w) with the following property: Any approximate stationary point w of Lσ defines a halfspace hw, which is close to the target halfspace f (x) = sign( w∗, x )
- Even though finding a global optimum of a non-convex function is hard in general, we show that a much weaker requirement suffices for our learning problem
- We prove that we can tune the parameter σ so that the stationary points of our non-convex loss are close to w∗
Results
- There exists a polynomial-time algorithm that learns halfspaces with Massart noise under any isotropic log-concave distribution.
- The authors' approach is extremely simple: The authors take an optimization view and leverage the structure of the learning problem to identify a simple non-convex surrogate loss Lσ(w) with the following property: Any approximate stationary point w of Lσ defines a halfspace hw, which is close to the target halfspace f (x) = sign( w∗, x ).
- That work gave the first polynomial-time algorithm for the problem that succeeds under the uniform distribution on the unit sphere, assuming the upper bound on the noise rate η is smaller than a sufficiently small constant (≈ 10−6).
- Algorithm 2 has the following performance guarantee: It draws m = O (U/R)12 · t8( /2)/(1 − 2η)10 · O(d/ 4) labeled examples from D, uses O(m) gradient evaluations, and outputs a hypothesis vector wthat satisfies errD0−x1 ≤ with probability at least 1 − δ, where f is the target halfspace.
- The authors' algorithm proceeds by Projected Stochastic Gradient Descent (PSGD), with projection on the 2-unit sphere, to find an approximate stationary point of the non-convex surrogate loss.
- Algorithm 3 has the following performance guarantee: It draws m = O (U 12/R18)(t8( /2)/c6) O(d/ 4) labeled examples from D, uses O(m) gradient evaluations, and outputs a hypothesis vector wthat satisfies errD0−1 ≤ errD0−1(f ) + with probability at least 1 − δ.
Conclusion
- The main structural result of this section generalizes Lemma 3.3: Lemma 5.3 (Stationary points of Lσ suffice with strong Massart noise).
- The authors denote by γ(x, y) the density of the 2dimensional projection on V of the marginal distribution Dx. Since the integrand is non-negative the authors may bound from below the contribution of region G on the gradient by integrating over φ ∈ (π/2, π).
Summary
- The main result of this paper is the first polynomial-time algorithm for learning halfspaces with Massart noise with respect to a broad class of well-behaved distributions.
- There is a computationally efficient algorithm that learns halfspaces in the presence of Massart noise with respect to the class of (U, R, t)bounded distributions on Rd. the algorithm draws m = poly (U/R, t( /2), 1/(1 − 2η)) · O(d/ 4) samples from a noisy example oracle at rate η < 1/2, runs in sample-polynomial time, and outputs a hypothesis halfspace h that is -close to the target with probability at least 9/10.
- There exists a polynomial-time algorithm that learns halfspaces with Massart noise under any isotropic log-concave distribution.
- The authors' approach is extremely simple: The authors take an optimization view and leverage the structure of the learning problem to identify a simple non-convex surrogate loss Lσ(w) with the following property: Any approximate stationary point w of Lσ defines a halfspace hw, which is close to the target halfspace f (x) = sign( w∗, x ).
- That work gave the first polynomial-time algorithm for the problem that succeeds under the uniform distribution on the unit sphere, assuming the upper bound on the noise rate η is smaller than a sufficiently small constant (≈ 10−6).
- Algorithm 2 has the following performance guarantee: It draws m = O (U/R)12 · t8( /2)/(1 − 2η)10 · O(d/ 4) labeled examples from D, uses O(m) gradient evaluations, and outputs a hypothesis vector wthat satisfies errD0−x1 ≤ with probability at least 1 − δ, where f is the target halfspace.
- The authors' algorithm proceeds by Projected Stochastic Gradient Descent (PSGD), with projection on the 2-unit sphere, to find an approximate stationary point of the non-convex surrogate loss.
- Algorithm 3 has the following performance guarantee: It draws m = O (U 12/R18)(t8( /2)/c6) O(d/ 4) labeled examples from D, uses O(m) gradient evaluations, and outputs a hypothesis vector wthat satisfies errD0−1 ≤ errD0−1(f ) + with probability at least 1 − δ.
- The main structural result of this section generalizes Lemma 3.3: Lemma 5.3 (Stationary points of Lσ suffice with strong Massart noise).
- The authors denote by γ(x, y) the density of the 2dimensional projection on V of the marginal distribution Dx. Since the integrand is non-negative the authors may bound from below the contribution of region G on the gradient by integrating over φ ∈ (π/2, π).
Reference
- [ABHU15] P. Awasthi, M. F. Balcan, N. Haghtalab, and R. Urner. Efficient learning of linear separators under bounded noise. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, pages 167–190, 2015.
- [ABHZ16] P. Awasthi, M. F. Balcan, N. Haghtalab, and H. Zhang. Learning and 1-bit compressed sensing under asymmetric noise. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, pages 152–192, 2016.
- [ABL17] P. Awasthi, M. F. Balcan, and P. M. Long. The power of localization for efficiently learning linear separators with noise. J. ACM, 63(6):50:1–50:27, 2017.
- [ACD+19] Y. Arjevani, Y. Carmon, J. C. Duchi, D. J. Foster, N. Srebro, and B. Woodworth. Lower bounds for non-convex stochastic optimization, 2019.
- D. Angluin and P. Laird. Learning from noisy examples. Mach. Learn., 2(4):343–370, 1988.
- P. Awasthi. Noisy pac learning of halfspaces. TTI Chicago, Summer Workshop on Robust Statistics, available at http://www.iliasdiakonikolas.org/tti-robust/Awasthi.pdf, 2018.
- [BFKV96] A. Blum, A. M. Frieze, R. Kannan, and S. Vempala. A polynomial-time algorithm for learning noisy linear threshold functions. In 37th Annual Symposium on Foundations of Computer Science, FOCS ’96, pages 330–338, 1996.
- [BFKV97] A. Blum, A. Frieze, R. Kannan, and S. Vempala. A polynomial time algorithm for learning noisy linear threshold functions. Algorithmica, 22(1/2):35–52, 1997.
- M. F. Balcan and N. Haghtalab. Noise in classification. In T. Roughgarden, editor, Beyond the Worst-Case Analysis of Algorithms. Cambridge University Press, 2020.
- M.-F. Balcan and H. Zhang. Sample and computationally efficient learning algorithms under s-concave distributions. In Advances in Neural Information Processing Systems, pages 4796–4805, 2017.
- A. Daniely. Complexity theoretic limitations on learning halfspaces. In Proceedings of the 48th Annual Symposium on Theory of Computing, STOC 2016, pages 105–117, 2016.
- I. Diakonikolas, T. Gouleakis, and C. Tzamos. Distribution-independent pac learning of halfspaces with massart noise. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alche Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 4751–4762. Curran Associates, Inc., 2019.
- I. Diakonikolas, D. M. Kane, and A. Stewart. Learning geometric concepts with nasty noise. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, pages 1061–1073, 2018.
- L. Devroye and G. Lugosi. Combinatorial methods in density estimation. Springer Series in Statistics, Springer, 2001.
- Y. Drori and O. Shamir. The complexity of finding stationary points with stochastic gradient descent, 2019.
- [FGKP06] V. Feldman, P. Gopalan, S. Khot, and A. Ponnuswami. New results for learning noisy parities and halfspaces. In Proc. FOCS, pages 563–576, 2006.
- [GHR92] M. Goldmann, J. Hastad, and A. Razborov. Majority gates vs. general weighted threshold gates. Computational Complexity, 2:277–300, 1992.
- V. Guruswami and P. Raghavendra. Hardness of learning halfspaces with noise. In Proc. 47th IEEE Symposium on Foundations of Computer Science (FOCS), pages 543–552. IEEE Computer Society, 2006.
- [Hau92] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100:78–150, 1992.
- A. R. Klivans and P. Kothari. Embedding hard learning problems into gaussian space. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2014, pages 793–809, 2014.
- [KKMS08] A. Kalai, A. Klivans, Y. Mansour, and R. Servedio. Agnostically learning halfspaces. SIAM Journal on Computing, 37(6):1777–1805, 2008.
- [KSS94] M. Kearns, R. Schapire, and L. Sellie. Toward Efficient Agnostic Learning. Machine Learning, 17(2/3):115–141, 1994.
- L. Lovasz and S. Vempala. The geometry of logconcave functions and sampling algorithms. Random Structures & Algorithms, 30(3):307–358, 2007.
- [MN06] P. Massart and E. Nedelec. Risk bounds for statistical learning. Ann. Statist., 34(5):2326–2366, 10 2006.
- [MP68] M. Minsky and S. Papert. Perceptrons: an introduction to computational geometry. MIT Press, Cambridge, MA, 1968.
- W. Maass and G. Turan. How fast can a threshold gate learn? In S. Hanson, G. Drastal, and R. Rivest, editors, Computational Learning Theory and Natural Learning Systems, pages 381–414. MIT Press, 1994.
- O. Mangoubi and N. K. Vishnoi. Nonconvex sampling with the metropolis-adjusted langevin algorithm. In Conference on Learning Theory, COLT 2019, pages 2259–2293, 2019.
- [O’D14] R. O’Donnell. Analysis of Boolean Functions. Cambridge University Press, 2014.
- [Pao06] G. Paouris. Concentration of mass on convex bodies. Geometric & Functional Analysis GAFA, 16(5):1021–1049, Dec 2006.
- [Ros58] F. Rosenblatt. The Perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958.
- R. Rivest and R. Sloan. A formal model of hierarchical concept learning. Information and Computation, 114(1):88–114, 1994.
- R. H. Sloan. Types of noise in data for concept learning. In Proceedings of the First Annual Workshop on Computational Learning Theory, COLT ’88, pages 91–96, San Francisco, CA, USA, 1988. Morgan Kaufmann Publishers Inc.
- J. Shawe-Taylor and N. Cristianini. An introduction to support vector machines. Cambridge University Press, 2000.
- V. Vapnik. Estimation of Dependences Based on Empirical Data: Springer Series in Statistics. Springer-Verlag, Berlin, Heidelberg, 1982.
- A. Yao. On ACC and threshold circuits. In Proceedings of the Thirty-First Annual Symposium on Foundations of Computer Science, pages 619–627, 1990.
- S. Yan and C. Zhang. Revisiting perceptron: Efficient and label-optimal learning of halfspaces. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pages 1056–1066, 2017.
- Y. Zhang, P. Liang, and M. Charikar. A hitting time analysis of stochastic gradient langevin dynamics. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, pages 1980–2022, 2017.
Full Text
Tags
Comments