# Distribution-Independent PAC Learning of Halfspaces with Massart Noise.

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), (2019): 4751-4762

EI

We study the problem of distribution -independent PAC learning of halfspaces in the presence of Massart noise. Specifically, we are given a set of labeled examples (x, y) drawn from a distribution 7) on Rd+1 such that the marginal distribution on the unlabeled points x is arbitrary and the labels y are generated by an unknown halfspace co...更多

• In the agnostic model [Hau92, KSS94] – where an adversary is allowed to arbitrarily corrupt an arbitrary η < 1/2 fraction of the labels – even weak learning is known to be computationally intractable [GR06, FGKP06, Dan16].
• In the presence of Random Classification Noise (RCN) [AL88] – where each label is flipped independently with probability exactly η < 1/2 – a polynomial time algorithm is known [BFKV96, BFKV97].
• Let C be a class of Boolean functions over X = Rd, Dx be an arbitrary distribution over X, and 0 ≤ η < 1/2.
• A noisy example oracle, EXMas(f, Dx, η), works as follows: Each time EXMas(f, Dx, η) is invoked, it

• Halfspaces, or Linear Threshold Functions ( LTFs), are Boolean functions f : Rd → {±1} of the form f (x) = sign( w, x − θ), where w ∈ Rd is the weight vector and θ ∈ R is the threshold. (The function sign : R → {±1} is defined as sign(u) = 1 if u ≥ 0 and sign(u) = −1 otherwise.) The problem of learning an unknown halfspace is as old as the field of machine learning — starting with Rosenblatt’s Perceptron algorithm [Ros58] — and has arguably been the most influential problem in the development of the field
• We focus on learning halfspaces with Massart noise [MN06]: Definition 1.1 (Massart Noise Model)
• The most obvious open problem is whether this error guarantee can be improved to f (OPT) + (for some function f : R → R such that limx→0 f (x) = 0) or, ideally, to OPT +
• It is a plausible conjecture that obtaining better error guarantees is computationally intractable. This is left as an interesting open problem for future work. Another open question is whether there is an efficient proper learner matching the error guarantees of our algorithm
• What other concept classes admit non-trivial algorithms in the Massart noise model? Can one establish non-trivial reductions between the Massart noise model and the agnostic model? And are there other natural semi-random input models that allow for efficient PAC learning algorithms in the distribution-free setting?

• The main result of this paper is the following: Theorem 1.2 (Main Result). There is an algorithm that for all 0 < η < 1/2, on input a set of i.i.d. examples from a distribution D = EXMas(f, Dx, η) on Rd+1, where f is an unknown halfspace on Rd, it runs in poly(d, b, 1/ ) time, where b is an upper bound on the bit complexity of the examples, and outputs a hypothesis h that with high probability satisfies Pr(x,y)∼D[h(x) = y] ≤ η + .

See Theorem 2.9 for a more detailed formal statement.
• The main result of this paper is the following: Theorem 1.2 (Main Result).
• There is an algorithm that for all 0 < η < 1/2, on input a set of i.i.d. examples from a distribution D = EXMas(f, Dx, η) on Rd+1, where f is an unknown halfspace on Rd, it runs in poly(d, b, 1/ ) time, where b is an upper bound on the bit complexity of the examples, and outputs a hypothesis h that with high probability satisfies Pr(x,y)∼D[h(x) = y] ≤ η +.
• See Theorem 2.9 for a more detailed formal statement.
• For large-margin halfspaces, the authors obtain a slightly better error guarantee; see Theorem 2.2 and Remark 2.6

• The authors note that the algorithm is non-proper, i.e., the hypothesis h itself is not a halfspace.
• (See Section 1.2 for a discussion.)The main contribution of this paper is the first non-trivial learning algorithm for the class of halfspaces in the distribution-free PAC model with Massart noise.
• It is a plausible conjecture that obtaining better error guarantees is computationally intractable
• This is left as an interesting open problem for future work.
• Another open question is whether there is an efficient proper learner matching the error guarantees of the algorithm.
• What other concept classes admit non-trivial algorithms in the Massart noise model? Can one establish non-trivial reductions between the Massart noise model and the agnostic model? And are there other natural semi-random input models that allow for efficient PAC learning algorithms in the distribution-free setting?

• ## Introduction:

In the agnostic model [Hau92, KSS94] – where an adversary is allowed to arbitrarily corrupt an arbitrary η < 1/2 fraction of the labels – even weak learning is known to be computationally intractable [GR06, FGKP06, Dan16].
• In the presence of Random Classification Noise (RCN) [AL88] – where each label is flipped independently with probability exactly η < 1/2 – a polynomial time algorithm is known [BFKV96, BFKV97].
• Let C be a class of Boolean functions over X = Rd, Dx be an arbitrary distribution over X, and 0 ≤ η < 1/2.
• A noisy example oracle, EXMas(f, Dx, η), works as follows: Each time EXMas(f, Dx, η) is invoked, it
• ## Objectives:

The authors' goal is to design a poly(d, 1/ , 1/γ) time learning algorithm in the presence of Massart noise.
• The authors' goal is to find a hypothesis classifier h with low misclassification error
• ## Results:

The main result of this paper is the following: Theorem 1.2 (Main Result). There is an algorithm that for all 0 < η < 1/2, on input a set of i.i.d. examples from a distribution D = EXMas(f, Dx, η) on Rd+1, where f is an unknown halfspace on Rd, it runs in poly(d, b, 1/ ) time, where b is an upper bound on the bit complexity of the examples, and outputs a hypothesis h that with high probability satisfies Pr(x,y)∼D[h(x) = y] ≤ η + .

See Theorem 2.9 for a more detailed formal statement.
• The main result of this paper is the following: Theorem 1.2 (Main Result).
• There is an algorithm that for all 0 < η < 1/2, on input a set of i.i.d. examples from a distribution D = EXMas(f, Dx, η) on Rd+1, where f is an unknown halfspace on Rd, it runs in poly(d, b, 1/ ) time, where b is an upper bound on the bit complexity of the examples, and outputs a hypothesis h that with high probability satisfies Pr(x,y)∼D[h(x) = y] ≤ η +.
• See Theorem 2.9 for a more detailed formal statement.
• For large-margin halfspaces, the authors obtain a slightly better error guarantee; see Theorem 2.2 and Remark 2.6
• ## Conclusion:

The authors note that the algorithm is non-proper, i.e., the hypothesis h itself is not a halfspace.
• (See Section 1.2 for a discussion.)The main contribution of this paper is the first non-trivial learning algorithm for the class of halfspaces in the distribution-free PAC model with Massart noise.
• It is a plausible conjecture that obtaining better error guarantees is computationally intractable
• This is left as an interesting open problem for future work.
• Another open question is whether there is an efficient proper learner matching the error guarantees of the algorithm.
• What other concept classes admit non-trivial algorithms in the Massart noise model? Can one establish non-trivial reductions between the Massart noise model and the agnostic model? And are there other natural semi-random input models that allow for efficient PAC learning algorithms in the distribution-free setting?

• Bylander [Byl94] gave a polynomial time algorithm to learn large margin halfspaces with RCN (under an additional anti-concentration assumption). The work of Blum et al [BFKV96, BFKV97] gave the first polynomial time algorithm for distribution-independent learning of halfspaces with RCN without any margin assumptions. Soon thereafter, [Coh97] gave a polynomial-time proper learning algorithm for the problem. Subsequently, Dunagan and Vempala [DV04b] gave a rescaled perceptron algorithm for solving linear programs, which translates to a significantly simpler and faster proper learning algorithm.

The term “Massart noise” was coined after [MN06]. An equivalent version of the model was previously studied by Rivest and Sloan [Slo88, Slo92, RS94, Slo96], and a very similar asymmetric random noise model goes back to Vapnik [Vap82]. Prior to this work, essentially no efficient algorithms with non-trivial error guarantees were known in the distribution-free Massart noise model. It should be noted that polynomial time algorithms with error OPT + are known [ABHU15, ZLC17, YZ17] when the marginal distribution on the unlabeled data is uniform on the unit sphere. For the case that the unlabeled data comes from an isotropic log-concave distribution, [ABHZ16] give a d2poly(1/(1−2η)) /poly( ) sample and time algorithm.

• Ilias Diakonikolas is supported by Supported by NSF Award CCF-1652862 (CAREER) and a Sloan Research Fellowship   0

2019年， 荣获NeurIPS的最佳论文奖 