Learning Halfspaces with Massart Noise Under Structured Distributions

COLT, pp. 1486-1513, 2020.

Cited by: 0|Bibtex|Views39
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
I.e., when all the labels are consistent with the target halfspace, this learning problem amounts to linear programming, can be solved in polynomial time

Abstract:

We study the problem of learning halfspaces with Massart noise in the distribution-specific PAC model. We give the first computationally efficient algorithm for this problem with respect to a broad family of distributions, including log-concave distributions. This resolves an open question posed in a number of prior works. Our approach ...More

Code:

Data:

0
Introduction
  • The main result of this paper is the first polynomial-time algorithm for learning halfspaces with Massart noise with respect to a broad class of well-behaved distributions.
  • There is a computationally efficient algorithm that learns halfspaces in the presence of Massart noise with respect to the class of (U, R, t)bounded distributions on Rd. the algorithm draws m = poly (U/R, t( /2), 1/(1 − 2η)) · O(d/ 4) samples from a noisy example oracle at rate η < 1/2, runs in sample-polynomial time, and outputs a hypothesis halfspace h that is -close to the target with probability at least 9/10.
Highlights
  • Halfspaces, or Linear Threshold Functions, are Boolean functions hw : Rd → {±1} of the form hw(x) = sign ( w, x ), where w ∈ Rd is the associated weight vector. (The univariate function sign(t) is defined as sign(t) = 1, for t ≥ 0, and sign(t) = −1 otherwise.) Halfspaces have been a central object of study in various fields, including complexity theory, optimization, and machine learning [MP68, Yao[90], GHR92, STC00, O’D14]
  • I.e., when all the labels are consistent with the target halfspace, this learning problem amounts to linear programming, can be solved in polynomial time
  • Our approach is extremely simple: We take an optimization view and leverage the structure of the learning problem to identify a simple non-convex surrogate loss Lσ(w) with the following property: Any approximate stationary point w of Lσ defines a halfspace hw, which is close to the target halfspace f (x) = sign( w∗, x )
  • Even though finding a global optimum of a non-convex function is hard in general, we show that a much weaker requirement suffices for our learning problem
  • We prove that we can tune the parameter σ so that the stationary points of our non-convex loss are close to w∗
Results
  • There exists a polynomial-time algorithm that learns halfspaces with Massart noise under any isotropic log-concave distribution.
  • The authors' approach is extremely simple: The authors take an optimization view and leverage the structure of the learning problem to identify a simple non-convex surrogate loss Lσ(w) with the following property: Any approximate stationary point w of Lσ defines a halfspace hw, which is close to the target halfspace f (x) = sign( w∗, x ).
  • That work gave the first polynomial-time algorithm for the problem that succeeds under the uniform distribution on the unit sphere, assuming the upper bound on the noise rate η is smaller than a sufficiently small constant (≈ 10−6).
  • Algorithm 2 has the following performance guarantee: It draws m = O (U/R)12 · t8( /2)/(1 − 2η)10 · O(d/ 4) labeled examples from D, uses O(m) gradient evaluations, and outputs a hypothesis vector wthat satisfies errD0−x1 ≤ with probability at least 1 − δ, where f is the target halfspace.
  • The authors' algorithm proceeds by Projected Stochastic Gradient Descent (PSGD), with projection on the 2-unit sphere, to find an approximate stationary point of the non-convex surrogate loss.
  • Algorithm 3 has the following performance guarantee: It draws m = O (U 12/R18)(t8( /2)/c6) O(d/ 4) labeled examples from D, uses O(m) gradient evaluations, and outputs a hypothesis vector wthat satisfies errD0−1 ≤ errD0−1(f ) + with probability at least 1 − δ.
Conclusion
  • The main structural result of this section generalizes Lemma 3.3: Lemma 5.3 (Stationary points of Lσ suffice with strong Massart noise).
  • The authors denote by γ(x, y) the density of the 2dimensional projection on V of the marginal distribution Dx. Since the integrand is non-negative the authors may bound from below the contribution of region G on the gradient by integrating over φ ∈ (π/2, π).
Summary
  • The main result of this paper is the first polynomial-time algorithm for learning halfspaces with Massart noise with respect to a broad class of well-behaved distributions.
  • There is a computationally efficient algorithm that learns halfspaces in the presence of Massart noise with respect to the class of (U, R, t)bounded distributions on Rd. the algorithm draws m = poly (U/R, t( /2), 1/(1 − 2η)) · O(d/ 4) samples from a noisy example oracle at rate η < 1/2, runs in sample-polynomial time, and outputs a hypothesis halfspace h that is -close to the target with probability at least 9/10.
  • There exists a polynomial-time algorithm that learns halfspaces with Massart noise under any isotropic log-concave distribution.
  • The authors' approach is extremely simple: The authors take an optimization view and leverage the structure of the learning problem to identify a simple non-convex surrogate loss Lσ(w) with the following property: Any approximate stationary point w of Lσ defines a halfspace hw, which is close to the target halfspace f (x) = sign( w∗, x ).
  • That work gave the first polynomial-time algorithm for the problem that succeeds under the uniform distribution on the unit sphere, assuming the upper bound on the noise rate η is smaller than a sufficiently small constant (≈ 10−6).
  • Algorithm 2 has the following performance guarantee: It draws m = O (U/R)12 · t8( /2)/(1 − 2η)10 · O(d/ 4) labeled examples from D, uses O(m) gradient evaluations, and outputs a hypothesis vector wthat satisfies errD0−x1 ≤ with probability at least 1 − δ, where f is the target halfspace.
  • The authors' algorithm proceeds by Projected Stochastic Gradient Descent (PSGD), with projection on the 2-unit sphere, to find an approximate stationary point of the non-convex surrogate loss.
  • Algorithm 3 has the following performance guarantee: It draws m = O (U 12/R18)(t8( /2)/c6) O(d/ 4) labeled examples from D, uses O(m) gradient evaluations, and outputs a hypothesis vector wthat satisfies errD0−1 ≤ errD0−1(f ) + with probability at least 1 − δ.
  • The main structural result of this section generalizes Lemma 3.3: Lemma 5.3 (Stationary points of Lσ suffice with strong Massart noise).
  • The authors denote by γ(x, y) the density of the 2dimensional projection on V of the marginal distribution Dx. Since the integrand is non-negative the authors may bound from below the contribution of region G on the gradient by integrating over φ ∈ (π/2, π).
Reference
  • [ABHU15] P. Awasthi, M. F. Balcan, N. Haghtalab, and R. Urner. Efficient learning of linear separators under bounded noise. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, pages 167–190, 2015.
    Google ScholarLocate open access versionFindings
  • [ABHZ16] P. Awasthi, M. F. Balcan, N. Haghtalab, and H. Zhang. Learning and 1-bit compressed sensing under asymmetric noise. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, pages 152–192, 2016.
    Google ScholarLocate open access versionFindings
  • [ABL17] P. Awasthi, M. F. Balcan, and P. M. Long. The power of localization for efficiently learning linear separators with noise. J. ACM, 63(6):50:1–50:27, 2017.
    Google ScholarLocate open access versionFindings
  • [ACD+19] Y. Arjevani, Y. Carmon, J. C. Duchi, D. J. Foster, N. Srebro, and B. Woodworth. Lower bounds for non-convex stochastic optimization, 2019.
    Google ScholarFindings
  • D. Angluin and P. Laird. Learning from noisy examples. Mach. Learn., 2(4):343–370, 1988.
    Google ScholarLocate open access versionFindings
  • P. Awasthi. Noisy pac learning of halfspaces. TTI Chicago, Summer Workshop on Robust Statistics, available at http://www.iliasdiakonikolas.org/tti-robust/Awasthi.pdf, 2018.
    Locate open access versionFindings
  • [BFKV96] A. Blum, A. M. Frieze, R. Kannan, and S. Vempala. A polynomial-time algorithm for learning noisy linear threshold functions. In 37th Annual Symposium on Foundations of Computer Science, FOCS ’96, pages 330–338, 1996.
    Google ScholarLocate open access versionFindings
  • [BFKV97] A. Blum, A. Frieze, R. Kannan, and S. Vempala. A polynomial time algorithm for learning noisy linear threshold functions. Algorithmica, 22(1/2):35–52, 1997.
    Google ScholarLocate open access versionFindings
  • M. F. Balcan and N. Haghtalab. Noise in classification. In T. Roughgarden, editor, Beyond the Worst-Case Analysis of Algorithms. Cambridge University Press, 2020.
    Google ScholarLocate open access versionFindings
  • M.-F. Balcan and H. Zhang. Sample and computationally efficient learning algorithms under s-concave distributions. In Advances in Neural Information Processing Systems, pages 4796–4805, 2017.
    Google ScholarLocate open access versionFindings
  • A. Daniely. Complexity theoretic limitations on learning halfspaces. In Proceedings of the 48th Annual Symposium on Theory of Computing, STOC 2016, pages 105–117, 2016.
    Google ScholarLocate open access versionFindings
  • I. Diakonikolas, T. Gouleakis, and C. Tzamos. Distribution-independent pac learning of halfspaces with massart noise. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alche Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 4751–4762. Curran Associates, Inc., 2019.
    Google ScholarLocate open access versionFindings
  • I. Diakonikolas, D. M. Kane, and A. Stewart. Learning geometric concepts with nasty noise. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, pages 1061–1073, 2018.
    Google ScholarLocate open access versionFindings
  • L. Devroye and G. Lugosi. Combinatorial methods in density estimation. Springer Series in Statistics, Springer, 2001.
    Google ScholarLocate open access versionFindings
  • Y. Drori and O. Shamir. The complexity of finding stationary points with stochastic gradient descent, 2019.
    Google ScholarFindings
  • [FGKP06] V. Feldman, P. Gopalan, S. Khot, and A. Ponnuswami. New results for learning noisy parities and halfspaces. In Proc. FOCS, pages 563–576, 2006.
    Google ScholarLocate open access versionFindings
  • [GHR92] M. Goldmann, J. Hastad, and A. Razborov. Majority gates vs. general weighted threshold gates. Computational Complexity, 2:277–300, 1992.
    Google ScholarLocate open access versionFindings
  • V. Guruswami and P. Raghavendra. Hardness of learning halfspaces with noise. In Proc. 47th IEEE Symposium on Foundations of Computer Science (FOCS), pages 543–552. IEEE Computer Society, 2006.
    Google ScholarLocate open access versionFindings
  • [Hau92] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100:78–150, 1992.
    Google ScholarLocate open access versionFindings
  • A. R. Klivans and P. Kothari. Embedding hard learning problems into gaussian space. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2014, pages 793–809, 2014.
    Google ScholarLocate open access versionFindings
  • [KKMS08] A. Kalai, A. Klivans, Y. Mansour, and R. Servedio. Agnostically learning halfspaces. SIAM Journal on Computing, 37(6):1777–1805, 2008.
    Google ScholarLocate open access versionFindings
  • [KSS94] M. Kearns, R. Schapire, and L. Sellie. Toward Efficient Agnostic Learning. Machine Learning, 17(2/3):115–141, 1994.
    Google ScholarLocate open access versionFindings
  • L. Lovasz and S. Vempala. The geometry of logconcave functions and sampling algorithms. Random Structures & Algorithms, 30(3):307–358, 2007.
    Google ScholarLocate open access versionFindings
  • [MN06] P. Massart and E. Nedelec. Risk bounds for statistical learning. Ann. Statist., 34(5):2326–2366, 10 2006.
    Google ScholarLocate open access versionFindings
  • [MP68] M. Minsky and S. Papert. Perceptrons: an introduction to computational geometry. MIT Press, Cambridge, MA, 1968.
    Google ScholarFindings
  • W. Maass and G. Turan. How fast can a threshold gate learn? In S. Hanson, G. Drastal, and R. Rivest, editors, Computational Learning Theory and Natural Learning Systems, pages 381–414. MIT Press, 1994.
    Google ScholarLocate open access versionFindings
  • O. Mangoubi and N. K. Vishnoi. Nonconvex sampling with the metropolis-adjusted langevin algorithm. In Conference on Learning Theory, COLT 2019, pages 2259–2293, 2019.
    Google ScholarLocate open access versionFindings
  • [O’D14] R. O’Donnell. Analysis of Boolean Functions. Cambridge University Press, 2014.
    Google ScholarFindings
  • [Pao06] G. Paouris. Concentration of mass on convex bodies. Geometric & Functional Analysis GAFA, 16(5):1021–1049, Dec 2006.
    Google ScholarLocate open access versionFindings
  • [Ros58] F. Rosenblatt. The Perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958.
    Google ScholarLocate open access versionFindings
  • R. Rivest and R. Sloan. A formal model of hierarchical concept learning. Information and Computation, 114(1):88–114, 1994.
    Google ScholarLocate open access versionFindings
  • R. H. Sloan. Types of noise in data for concept learning. In Proceedings of the First Annual Workshop on Computational Learning Theory, COLT ’88, pages 91–96, San Francisco, CA, USA, 1988. Morgan Kaufmann Publishers Inc.
    Google ScholarLocate open access versionFindings
  • J. Shawe-Taylor and N. Cristianini. An introduction to support vector machines. Cambridge University Press, 2000.
    Google ScholarFindings
  • V. Vapnik. Estimation of Dependences Based on Empirical Data: Springer Series in Statistics. Springer-Verlag, Berlin, Heidelberg, 1982.
    Google ScholarFindings
  • A. Yao. On ACC and threshold circuits. In Proceedings of the Thirty-First Annual Symposium on Foundations of Computer Science, pages 619–627, 1990.
    Google ScholarLocate open access versionFindings
  • S. Yan and C. Zhang. Revisiting perceptron: Efficient and label-optimal learning of halfspaces. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pages 1056–1066, 2017.
    Google ScholarLocate open access versionFindings
  • Y. Zhang, P. Liang, and M. Charikar. A hitting time analysis of stochastic gradient langevin dynamics. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, pages 1980–2022, 2017.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments