Distribution-Independent PAC Learning of Halfspaces with Massart Noise.

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), (2019): 4751-4762

被引用24|浏览797
EI
下载 PDF 全文
引用
微博一下

摘要

We study the problem of distribution -independent PAC learning of halfspaces in the presence of Massart noise. Specifically, we are given a set of labeled examples (x, y) drawn from a distribution 7) on Rd+1 such that the marginal distribution on the unlabeled points x is arbitrary and the labels y are generated by an unknown halfspace co...更多

代码

数据

简介
  • In the agnostic model [Hau92, KSS94] – where an adversary is allowed to arbitrarily corrupt an arbitrary η < 1/2 fraction of the labels – even weak learning is known to be computationally intractable [GR06, FGKP06, Dan16].
  • In the presence of Random Classification Noise (RCN) [AL88] – where each label is flipped independently with probability exactly η < 1/2 – a polynomial time algorithm is known [BFKV96, BFKV97].
  • Let C be a class of Boolean functions over X = Rd, Dx be an arbitrary distribution over X, and 0 ≤ η < 1/2.
  • A noisy example oracle, EXMas(f, Dx, η), works as follows: Each time EXMas(f, Dx, η) is invoked, it
重点内容
  • Halfspaces, or Linear Threshold Functions ( LTFs), are Boolean functions f : Rd → {±1} of the form f (x) = sign( w, x − θ), where w ∈ Rd is the weight vector and θ ∈ R is the threshold. (The function sign : R → {±1} is defined as sign(u) = 1 if u ≥ 0 and sign(u) = −1 otherwise.) The problem of learning an unknown halfspace is as old as the field of machine learning — starting with Rosenblatt’s Perceptron algorithm [Ros58] — and has arguably been the most influential problem in the development of the field
  • We focus on learning halfspaces with Massart noise [MN06]: Definition 1.1 (Massart Noise Model)
  • The most obvious open problem is whether this error guarantee can be improved to f (OPT) + (for some function f : R → R such that limx→0 f (x) = 0) or, ideally, to OPT +
  • It is a plausible conjecture that obtaining better error guarantees is computationally intractable. This is left as an interesting open problem for future work. Another open question is whether there is an efficient proper learner matching the error guarantees of our algorithm
  • What other concept classes admit non-trivial algorithms in the Massart noise model? Can one establish non-trivial reductions between the Massart noise model and the agnostic model? And are there other natural semi-random input models that allow for efficient PAC learning algorithms in the distribution-free setting?
结果
  • The main result of this paper is the following: Theorem 1.2 (Main Result). There is an algorithm that for all 0 < η < 1/2, on input a set of i.i.d. examples from a distribution D = EXMas(f, Dx, η) on Rd+1, where f is an unknown halfspace on Rd, it runs in poly(d, b, 1/ ) time, where b is an upper bound on the bit complexity of the examples, and outputs a hypothesis h that with high probability satisfies Pr(x,y)∼D[h(x) = y] ≤ η + .

    See Theorem 2.9 for a more detailed formal statement.
  • The main result of this paper is the following: Theorem 1.2 (Main Result).
  • There is an algorithm that for all 0 < η < 1/2, on input a set of i.i.d. examples from a distribution D = EXMas(f, Dx, η) on Rd+1, where f is an unknown halfspace on Rd, it runs in poly(d, b, 1/ ) time, where b is an upper bound on the bit complexity of the examples, and outputs a hypothesis h that with high probability satisfies Pr(x,y)∼D[h(x) = y] ≤ η +.
  • See Theorem 2.9 for a more detailed formal statement.
  • For large-margin halfspaces, the authors obtain a slightly better error guarantee; see Theorem 2.2 and Remark 2.6
结论
  • The authors note that the algorithm is non-proper, i.e., the hypothesis h itself is not a halfspace.
  • (See Section 1.2 for a discussion.)The main contribution of this paper is the first non-trivial learning algorithm for the class of halfspaces in the distribution-free PAC model with Massart noise.
  • It is a plausible conjecture that obtaining better error guarantees is computationally intractable
  • This is left as an interesting open problem for future work.
  • Another open question is whether there is an efficient proper learner matching the error guarantees of the algorithm.
  • What other concept classes admit non-trivial algorithms in the Massart noise model? Can one establish non-trivial reductions between the Massart noise model and the agnostic model? And are there other natural semi-random input models that allow for efficient PAC learning algorithms in the distribution-free setting?
总结
  • Introduction:

    In the agnostic model [Hau92, KSS94] – where an adversary is allowed to arbitrarily corrupt an arbitrary η < 1/2 fraction of the labels – even weak learning is known to be computationally intractable [GR06, FGKP06, Dan16].
  • In the presence of Random Classification Noise (RCN) [AL88] – where each label is flipped independently with probability exactly η < 1/2 – a polynomial time algorithm is known [BFKV96, BFKV97].
  • Let C be a class of Boolean functions over X = Rd, Dx be an arbitrary distribution over X, and 0 ≤ η < 1/2.
  • A noisy example oracle, EXMas(f, Dx, η), works as follows: Each time EXMas(f, Dx, η) is invoked, it
  • Objectives:

    The authors' goal is to design a poly(d, 1/ , 1/γ) time learning algorithm in the presence of Massart noise.
  • The authors' goal is to find a hypothesis classifier h with low misclassification error
  • Results:

    The main result of this paper is the following: Theorem 1.2 (Main Result). There is an algorithm that for all 0 < η < 1/2, on input a set of i.i.d. examples from a distribution D = EXMas(f, Dx, η) on Rd+1, where f is an unknown halfspace on Rd, it runs in poly(d, b, 1/ ) time, where b is an upper bound on the bit complexity of the examples, and outputs a hypothesis h that with high probability satisfies Pr(x,y)∼D[h(x) = y] ≤ η + .

    See Theorem 2.9 for a more detailed formal statement.
  • The main result of this paper is the following: Theorem 1.2 (Main Result).
  • There is an algorithm that for all 0 < η < 1/2, on input a set of i.i.d. examples from a distribution D = EXMas(f, Dx, η) on Rd+1, where f is an unknown halfspace on Rd, it runs in poly(d, b, 1/ ) time, where b is an upper bound on the bit complexity of the examples, and outputs a hypothesis h that with high probability satisfies Pr(x,y)∼D[h(x) = y] ≤ η +.
  • See Theorem 2.9 for a more detailed formal statement.
  • For large-margin halfspaces, the authors obtain a slightly better error guarantee; see Theorem 2.2 and Remark 2.6
  • Conclusion:

    The authors note that the algorithm is non-proper, i.e., the hypothesis h itself is not a halfspace.
  • (See Section 1.2 for a discussion.)The main contribution of this paper is the first non-trivial learning algorithm for the class of halfspaces in the distribution-free PAC model with Massart noise.
  • It is a plausible conjecture that obtaining better error guarantees is computationally intractable
  • This is left as an interesting open problem for future work.
  • Another open question is whether there is an efficient proper learner matching the error guarantees of the algorithm.
  • What other concept classes admit non-trivial algorithms in the Massart noise model? Can one establish non-trivial reductions between the Massart noise model and the agnostic model? And are there other natural semi-random input models that allow for efficient PAC learning algorithms in the distribution-free setting?
相关工作
  • Bylander [Byl94] gave a polynomial time algorithm to learn large margin halfspaces with RCN (under an additional anti-concentration assumption). The work of Blum et al [BFKV96, BFKV97] gave the first polynomial time algorithm for distribution-independent learning of halfspaces with RCN without any margin assumptions. Soon thereafter, [Coh97] gave a polynomial-time proper learning algorithm for the problem. Subsequently, Dunagan and Vempala [DV04b] gave a rescaled perceptron algorithm for solving linear programs, which translates to a significantly simpler and faster proper learning algorithm.

    The term “Massart noise” was coined after [MN06]. An equivalent version of the model was previously studied by Rivest and Sloan [Slo88, Slo92, RS94, Slo96], and a very similar asymmetric random noise model goes back to Vapnik [Vap82]. Prior to this work, essentially no efficient algorithms with non-trivial error guarantees were known in the distribution-free Massart noise model. It should be noted that polynomial time algorithms with error OPT + are known [ABHU15, ZLC17, YZ17] when the marginal distribution on the unlabeled data is uniform on the unit sphere. For the case that the unlabeled data comes from an isotropic log-concave distribution, [ABHZ16] give a d2poly(1/(1−2η)) /poly( ) sample and time algorithm.
基金
  • Ilias Diakonikolas is supported by Supported by NSF Award CCF-1652862 (CAREER) and a Sloan Research Fellowship
引用论文
  • [ABHU15] P. Awasthi, M. F. Balcan, N. Haghtalab, and R. Urner. Efficient learning of linear separators under bounded noise. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, pages 167–190, 2015.
    Google ScholarLocate open access versionFindings
  • [ABHZ16] P. Awasthi, M. F. Balcan, N. Haghtalab, and H. Zhang. Learning and 1-bit compressed sensing under asymmetric noise. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, pages 152–192, 2016.
    Google ScholarLocate open access versionFindings
  • [ABL17] P. Awasthi, M. F. Balcan, and P. M. Long. The power of localization for efficiently learning linear separators with noise. J. ACM, 63(6):50:1–50:27, 2017.
    Google ScholarLocate open access versionFindings
  • [AL88] D. Angluin and P. Laird. Learning from noisy examples. Mach. Learn., 2(4):343–370, 1988.
    Google ScholarLocate open access versionFindings
  • [Ber06] T. Bernholt. Robust estimators are hard to compute. Technical report, University of Dortmund, Germany, 2006.
    Google ScholarFindings
  • [BFKV96] A. Blum, A. M. Frieze, R. Kannan, and S. Vempala. A polynomial-time algorithm for learning noisy linear threshold functions. In 37th Annual Symposium on Foundations of Computer Science, FOCS ’96, pages 330–338, 1996.
    Google ScholarLocate open access versionFindings
  • [BFKV97] A. Blum, A. Frieze, R. Kannan, and S. Vempala. A polynomial time algorithm for learning noisy linear threshold functions. Algorithmica, 22(1/2):35–52, 1997.
    Google ScholarLocate open access versionFindings
  • [Blu03] A. Blum. Machine learning: My favorite results, directions, and open problems. In 44th Symposium on Foundations of Computer Science (FOCS 2003), pages 11–14, 2003.
    Google ScholarLocate open access versionFindings
  • [Byl94] T. Bylander. Learning linear threshold functions in the presence of classification noise. In Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory, COLT 1994, pages 340–347, 1994.
    Google ScholarLocate open access versionFindings
  • [Coh97] E. Cohen. Learning noisy perceptrons by a perceptron in polynomial time. In Proceedings of the Thirty-Eighth Symposium on Foundations of Computer Science, pages 514–521, 1997.
    Google ScholarLocate open access versionFindings
  • [Dan16] A. Daniely. Complexity theoretic limitations on learning halfspaces. In Proceedings of the 48th Annual Symposium on Theory of Computing, STOC 2016, pages 105–117, 2016.
    Google ScholarLocate open access versionFindings
  • [DKK+16] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Robust estimators in high dimensions without the computational intractability. In Proceedings of FOCS’16, pages 655–664, 2016.
    Google ScholarLocate open access versionFindings
  • [DKK+17] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Being robust (in high dimensions) can be practical. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, pages 999–1008, 2017.
    Google ScholarLocate open access versionFindings
  • [DKK+18] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Robustly learning a gaussian: Getting optimal error, efficiently. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, pages 2683–2702, 2018.
    Google ScholarLocate open access versionFindings
  • [DKK+19] I. Diakonikolas, G. Kamath, D. Kane, J. Li, J. Steinhardt, and Alistair Stewart. Sever: A robust meta-algorithm for stochastic optimization. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, pages 1596–1606, 2019.
    Google ScholarLocate open access versionFindings
  • [DKS18] I. Diakonikolas, D. M. Kane, and A. Stewart. Learning geometric concepts with nasty noise. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, pages 1061–1073, 2018.
    Google ScholarLocate open access versionFindings
  • [DKS19] I. Diakonikolas, W. Kong, and A. Stewart. Efficient algorithms and lower bounds for robust linear regression. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, pages 2745–2754, 2019.
    Google ScholarLocate open access versionFindings
  • [DKW56] A. Dvoretzky, J. Kiefer, and J. Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. Ann. Mathematical Statistics, 27(3):642–669, 1956.
    Google ScholarLocate open access versionFindings
  • [Duc16] J. C. Duchi. Introductory lectures on stochastic convex optimization. Park City Mathematics Institute, Graduate Summer School Lectures, 2016.
    Google ScholarFindings
  • [DV04a] J. Dunagan and S. Vempala. Optimal outlier removal in high-dimensional spaces. J. Computer & System Sciences, 68(2):335–373, 2004.
    Google ScholarLocate open access versionFindings
  • [DV04b] J. Dunagan and S. Vempala. A simple polynomial-time rescaling algorithm for solving linear programs. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing, pages 315–320, 2004.
    Google ScholarLocate open access versionFindings
  • [FGKP06] V. Feldman, P. Gopalan, S. Khot, and A. Ponnuswami. New results for learning noisy parities and halfspaces. In Proc. FOCS, pages 563–576, 2006.
    Google ScholarLocate open access versionFindings
  • [GR06] V. Guruswami and P. Raghavendra. Hardness of learning halfspaces with noise. In Proc. 47th IEEE Symposium on Foundations of Computer Science (FOCS), pages 543–552. IEEE Computer Society, 2006.
    Google ScholarLocate open access versionFindings
  • [Hau92] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100:78–150, 1992.
    Google ScholarLocate open access versionFindings
  • [Kea93] M. J. Kearns. Efficient noise-tolerant learning from statistical queries. In Proceedings of the Twenty-Fifth Annual ACM Symposium on Theory of Computing, pages 392–401, 1993.
    Google ScholarLocate open access versionFindings
  • [Kea98] M. J. Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM, 45(6):983–1006, 1998.
    Google ScholarLocate open access versionFindings
  • [KKM18] A. R. Klivans, P. K. Kothari, and R. Meka. Efficient algorithms for outlier-robust regression. In Conference On Learning Theory, COLT 2018, pages 1420–1430, 2018.
    Google ScholarLocate open access versionFindings
  • [KLS09] A. Klivans, P. Long, and R. Servedio. Learning halfspaces with malicious noise. To appear in Proc. 17th Internat. Colloq. on Algorithms, Languages and Programming (ICALP), 2009.
    Google ScholarLocate open access versionFindings
  • [KSS94] M. Kearns, R. Schapire, and L. Sellie. Toward Efficient Agnostic Learning. Machine Learning, 17(2/3):115–141, 1994.
    Google ScholarLocate open access versionFindings
  • [LRV16] K. A. Lai, A. B. Rao, and S. Vempala. Agnostic estimation of mean and covariance. In Proceedings of FOCS’16, 2016.
    Google ScholarLocate open access versionFindings
  • [LS10] P. M. Long and R. A. Servedio. Random classification noise defeats all convex potential boosters. Machine Learning, 78(3):287–304, 2010.
    Google ScholarLocate open access versionFindings
  • [MN06] P. Massart and E. Nedelec. Risk bounds for statistical learning. Ann. Statist., 34(5):2326– 2366, 10 2006.
    Google ScholarLocate open access versionFindings
  • [MT94] W. Maass and G. Turan. How fast can a threshold gate learn? In S. Hanson, G. Drastal, and R. Rivest, editors, Computational Learning Theory and Natural Learning Systems, pages 381–414. MIT Press, 1994.
    Google ScholarLocate open access versionFindings
  • [Ros58] F. Rosenblatt. The Perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958.
    Google ScholarLocate open access versionFindings
  • [RS94] R. Rivest and R. Sloan. A formal model of hierarchical concept learning. Information and Computation, 114(1):88–114, 1994.
    Google ScholarLocate open access versionFindings
  • [Slo88] R. H. Sloan. Types of noise in data for concept learning. In Proceedings of the First Annual Workshop on Computational Learning Theory, COLT ’88, pages 91–96, San Francisco, CA, USA, 1988. Morgan Kaufmann Publishers Inc.
    Google ScholarLocate open access versionFindings
  • [Slo92] R. H. Sloan. Corrigendum to types of noise in data for concept learning. In Proceedings of the Fifth Annual ACM Conference on Computational Learning Theory, COLT 1992, page 450, 1992.
    Google ScholarLocate open access versionFindings
  • [Slo96] R. H. Sloan. Pac Learning, Noise, and Geometry, pages 21–41. Birkhäuser Boston, Boston, MA, 1996.
    Google ScholarFindings
  • [Val84] L. G. Valiant. A theory of the learnable. In Proc. 16th Annual ACM Symposium on Theory of Computing (STOC), pages 436–445. ACM Press, 1984.
    Google ScholarLocate open access versionFindings
  • [Vap82] V. Vapnik. Estimation of Dependences Based on Empirical Data: Springer Series in Statistics. Springer-Verlag, Berlin, Heidelberg, 1982.
    Google ScholarFindings
  • [YZ17] S. Yan and C. Zhang. Revisiting perceptron: Efficient and label-optimal learning of halfspaces. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pages 1056–1066, 2017.
    Google ScholarLocate open access versionFindings
  • [ZLC17] Y. Zhang, P. Liang, and M. Charikar. A hitting time analysis of stochastic gradient langevin dynamics. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, pages 1980–2022, 2017.
    Google ScholarLocate open access versionFindings
您的评分 :
0

 

最佳论文
2019年, 荣获NeurIPS的最佳论文奖
标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科