# Learnability with Indirect Supervision Signals

NIPS 2020, 2020.

EI

Weibo:

Abstract:

Learning from indirect supervision signals is important in real-world AI applications when, often, gold labels are missing or too costly. In this paper, we develop a unified theoretical framework for multi-class classification when the supervision is provided by a variable that contains nonzero mutual information with the gold label. Th...More

Code:

Data:

Introduction

- The authors are interested in the problem of multiclass classification where direct and gold annotations for the unlabeled instance are expensive or inaccessible , and instead the observation of a dependent variable of the true label is used as supervision signal.
- 1. The authors decompose the learnability condition of a general indirect supervision problem into three aspects: complexity, consistency and identifiability and provide a unified learning bound for the problem (Theorem 4.2).
- 2. The authors propose a simple yet powerful concept called separation, which encodes the prior knowledge about the transition using statistical distance between distributions over the annotation space and uses it to characterize consistency and identifiability (Theorem 5.2).

Highlights

- We are interested in the problem of multiclass classification where direct and gold annotations for the unlabeled instance are expensive or inaccessible, and instead the observation of a dependent variable of the true label is used as supervision signal
- To extract the information contained in a dependent variable, the learner should have certain prior knowledge about the relation between the true label and the supervision signal, which can be expressed in various forms
- Our goal is to develop a unified theoretical framework that can (i) provide learnability conditions for general indirect supervision problems, (ii) describe what prior knowledge is needed about the transition, and (iii) characterize the difficulty of learning with indirect supervision
- We provide a unified framework for analyzing the learnability of multiclass classification with indirect supervision
- Our theory builds upon two key components: (i) The construction of the induced hypothesis class and its complexity analysis, which allows us to indirectly supervise the learning by minimizing the annotation risk. (ii) A formal description of the prior knowledge about the transition and its encoding in the learning condition and bound, which allows us to bound the classification error by the annotation risk

Results

- The authors present theorem 4.2 that decomposes the learnability of a general indirect supervision problem into three aspects: complexity, consistency and identifiability.
- Bound (2) suggests that the difficulty of the learning can be characterized by (i) the identifiability level η, which mainly depends on the nature of the indirect supervision and how about: learner’s prior information of the transition hypothesis, and will be further studied .
- 2. When all transition hypotheses in T are instance-independent and the annotation loss only depends on (T, y, o) (e.g., the cross-entropy loss defined in (1)), dT can be trivially bounded by dT ≤ cs = |Y × O|; d ≤ 2((dH + cs) (log(6(dH + cs))) + 2dH log c).
- In this case, one can ensure learnability by the ERM which minimizes the following transition-independent annotation loss
- A noisy annotation for multiclass classification may break the condition (8) due to a large noise rate for certain labels, but it can still provide information to separate other labels if (8) is satisfied for any other pairs of (i, j).
- The authors present the following result to characterize the learnability under joint supervision O: Proposition 5.10 (No Free Separation).
- If there do exist constraint about the two transition classes, Proposition 5.10 no longer holds and joint supervision may create new separation.
- This example it is necessary to model possible constraints between different supervision sources, which help to reduce the size of the joint transition class and may improve the separation degree.
- The authors' theory builds upon two key components: (i) The construction of the induced hypothesis class and its complexity analysis, which allows them to indirectly supervise the learning by minimizing the annotation risk.

Conclusion

- (ii) A formal description of the prior knowledge about the transition and its encoding in the learning condition and bound, which allows them to bound the classification error by the annotation risk.
- The authors need scale the annotation loss to O/b in order to use the theorem (i.e., let f = O/b in the definition of E[Z(F)], i.e., equation (1.2) of [3]).
- The authors believe the the concepts introduced are general, and that the analysis tools can be applied in many other supervision scenarios

Summary

- The authors are interested in the problem of multiclass classification where direct and gold annotations for the unlabeled instance are expensive or inaccessible , and instead the observation of a dependent variable of the true label is used as supervision signal.
- 1. The authors decompose the learnability condition of a general indirect supervision problem into three aspects: complexity, consistency and identifiability and provide a unified learning bound for the problem (Theorem 4.2).
- 2. The authors propose a simple yet powerful concept called separation, which encodes the prior knowledge about the transition using statistical distance between distributions over the annotation space and uses it to characterize consistency and identifiability (Theorem 5.2).
- The authors present theorem 4.2 that decomposes the learnability of a general indirect supervision problem into three aspects: complexity, consistency and identifiability.
- Bound (2) suggests that the difficulty of the learning can be characterized by (i) the identifiability level η, which mainly depends on the nature of the indirect supervision and how about: learner’s prior information of the transition hypothesis, and will be further studied .
- 2. When all transition hypotheses in T are instance-independent and the annotation loss only depends on (T, y, o) (e.g., the cross-entropy loss defined in (1)), dT can be trivially bounded by dT ≤ cs = |Y × O|; d ≤ 2((dH + cs) (log(6(dH + cs))) + 2dH log c).
- In this case, one can ensure learnability by the ERM which minimizes the following transition-independent annotation loss
- A noisy annotation for multiclass classification may break the condition (8) due to a large noise rate for certain labels, but it can still provide information to separate other labels if (8) is satisfied for any other pairs of (i, j).
- The authors present the following result to characterize the learnability under joint supervision O: Proposition 5.10 (No Free Separation).
- If there do exist constraint about the two transition classes, Proposition 5.10 no longer holds and joint supervision may create new separation.
- This example it is necessary to model possible constraints between different supervision sources, which help to reduce the size of the joint transition class and may improve the separation degree.
- The authors' theory builds upon two key components: (i) The construction of the induced hypothesis class and its complexity analysis, which allows them to indirectly supervise the learning by minimizing the annotation risk.
- (ii) A formal description of the prior knowledge about the transition and its encoding in the learning condition and bound, which allows them to bound the classification error by the annotation risk.
- The authors need scale the annotation loss to O/b in order to use the theorem (i.e., let f = O/b in the definition of E[Z(F)], i.e., equation (1.2) of [3]).
- The authors believe the the concepts introduced are general, and that the analysis tools can be applied in many other supervision scenarios

Related work

- Specific Indirect Supervision Problems. Our work is motivated by many previous studies on the problem of learning in the absence of gold labels. Specially, the problem of classification under label noise dates back to [1] and has been studied extensively over the past decades. Our work is mostly related to (i) Theoretical analysis of PAC guarantees and consistency of loss functions, including learning with bounded noise [18, 16, 2], and instance-dependent noise [25, 19, 7]. (ii) Algorithms for learning from noisy labels, including using the inverse information of the transition [21, 32], and inducing predictions of noisy label (which is more similar to our formulation) [6, 30].

Superset (also called partial label) problems, where the annotation is given as a subset of the annotation space, arises in various forms in standard multiclass classification and structured prediction [11, 9, 15, 22]. While it is possible to extend some approaches in the theory of noisy problems to the superset case, the superset problem focuses on the case of a large and complex annotation space, and some of the assumptions (such as “known transition") would be too strong in practice. On the theoretical side, [11] defines ambiguity degree to characterize the learning bound. [17] provides an insightful discussion of the PAC-learnability of the superset problem and proposes the concept of induced hypothesis. This two papers motivate the approach pursued in this paper.

Reference

- Dana Angluin and Philip Laird. Learning from noisy examples. Machine Learning, 2(4): 343–370, 1988. doi: 10.1023/A:1022873112823. URL https://doi.org/10.1023/A:1022873112823.
- Pranjal Awasthi, Maria-Florina Balcan, Nika Haghtalab, and Ruth Urner. Efficient learning of linear separators under bounded noise. In Peter Grünwald, Elad Hazan, and Satyen Kale, editors, Proceedings of The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pages 167–190, Paris, France, 03–06 Jul 2015. PMLR. URL http://proceedings.mlr.press/v40/Awasthi15b.html.
- Yannick Baraud. Bounding the expectation of the supremum of an empirical process over a (weak) vc-major class. Electron. J. Statist., 10(2):1709–1728, 2016. doi: 10.1214/15-EJS1055. URL https://doi.org/10.1214/15-EJS1055.
- Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res., 3(null):463–482, March 2003. ISSN 1532-4435.
- Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from question-answer pairs. In EMNLP, 2013.
- Jakramate Bootkrajang and Ata Kabán. Label-noise robust logistic regression and its applications. In Peter A. Flach, Tijl De Bie, and Nello Cristianini, editors, Machine Learning and Knowledge Discovery in Databases, pages 143–158, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. ISBN 978-3-642-33460-3.
- Jiacheng Cheng, Tongliang Liu, Kotagiri Ramamohanarao, and Dacheng Tao. Learning with bounded instance- and label-dependent label noise. 09 2017.
- Jesús Cid-Sueiro. Proper losses for learning from partial labels. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, page 1565–1573, Red Hook, NY, USA, 2012. Curran Associates Inc.
- Jesús Cid-Sueiro, Darío García-García, and Raúl Santos-Rodríguez. Consistency of losses for learning from weak labels. In Toon Calders, Floriana Esposito, Eyke Hüllermeier, and Rosa Meo, editors, Machine Learning and Knowledge Discovery in Databases, pages 197–210, Berlin, Heidelberg, 2014. Springer Berlin Heidelberg. ISBN 978-3-662-44848-9.
- James Clarke, Dan Goldwasser, Ming-Wei Chang, and Dan Roth. Driving Semantic Parsing from the World’s Response. 7 20URL http://cogcomp.org/papers/CGCR10.pdf.
- Timothée Cour, Benjamin Sapp, and Ben Taskar. Learning from partial labels. J. Mach. Learn. Res., 12:1501–1536, 2011.
- Ilias Diakonikolas, Themis Gouleakis, and Christos Tzamos. Distribution-independent pac learning of halfspaces with massart noise. In NeurIPS, 2019.
- Mor Geva, Yoav Goldberg, and Jonathan Berant. Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets. In EMNLP/IJCNLP, 2019.
- Aritra Ghosh, Naresh Manwani, and P.S. Sastry. Making risk minimization tolerant to label noise. Neurocomputing, 160:93 – 107, 2015. ISSN 0925-2312. doi: https://doi.org/10.1016/j.neucom.2014.09.081. URL http://www.sciencedirect.com/science/article/pii/ S0925231215001204.
- Takashi Ishida, Gang Niu, Masashi Sugiyama, and Weihua Hu. Learning from complementary labels. 05 2017.
- Michael Kearns. Efficient noise-tolerant learning from statistical queries. J. ACM, 45(6): 983–1006, November 1998. ISSN 0004-5411. doi: 10.1145/293347.293351. URL https://doi.org/10.1145/293347.293351.
- Li-Ping Liu and Thomas G. Dietterich. Learnability of the superset label learning problem. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, pages II–1629–II–1637. JMLR.org, 2014. URL http://dl.acm.org/citation.cfm?id=3044805.3045074.
- Pascal Massart and Élodie Nédélec. Risk bounds for statistical learning. Ann. Statist., 34 (5):2326–2366, 10 2006. doi: 10.1214/009053606000000786. URL https://doi.org/10.1214/009053606000000786.
- Aditya Krishna Menon, Brendan van Rooyen, and Nagarajan Natarajan. Learning from binary labels with instance-dependent noise. Machine Learning, 107(8):1561–1595, 2018. doi: 10.1007/s10994-018-5715-3. URL https://doi.org/10.1007/s10994-018-5715-3.
- B. K. Natarajan. On learning sets and functions. Machine Learning, 4(1):67–97, 1989. doi: 10.1007/BF00114804. URL https://doi.org/10.1007/BF00114804.
- Nagarajan Natarajan, Inderjit S. Dhillon, Pradeep Ravikumar, and Ambuj Tewari. Learning with noisy labels. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1, NIPS’13, pages 1196–1204, USA, 2013. Curran Associates Inc. URL http://dl.acm.org/citation.cfm?id=2999611.2999745.
- Qiang Ning, Hangfeng He, Chuchu Fan, and Dan Roth. Partial or Complete, That’s The Question. 2019. URL http://cogcomp.org/papers/NHFR19.pdf.
- [24] Aditi Raghunathan, Roy Frostig, John Duchi, and Percy Liang. Estimation from indirect supervision with linear moments. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 2568–2577, New York, New York, USA, 20–22 Jun 2016. PMLR. URL http://proceedings.mlr.press/v48/raghunathan16.html.
- [25] Liva Ralaivola, François Denis, and Christophe Magnan. CN = CPCN. volume 148, pages 721–728, 06 2006. doi: 10.1145/1143844.1143935.
- [26] David Rolnick, Andreas Veit, Serge Belongie, and Nir Shavit. Deep learning is robust to massive label noise. 05 2017.
- [27] Clayton Scott, Gilles Blanchard, and Gregory Handy. Classification with asymmetric label noise: Consistency and maximal denoising. In Shai Shalev-Shwartz and Ingo Steinwart, editors, Proceedings of the 26th Annual Conference on Learning Theory, volume 30 of Proceedings of Machine Learning Research, pages 489–511, Princeton, NJ, USA, 12–14 Jun 2013. PMLR. URL http://proceedings.mlr.press/v30/Scott13.html.
- [28] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, New York, NY, USA, 2014. ISBN 1107057132, 9781107057135.
- [29] Jacob Steinhardt and Percy Liang. Learning with relaxed supervision. In NIPS, 2015.
- [30] Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir D. Bourdev, and Rob Fergus. Training convolutional networks with noisy labels. arXiv: Computer Vision and Pattern Recognition, 2014.
- [31] A.B. Tsybakov. Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer New York, 2008. ISBN 9780387790527.
- [32] Brendan van Rooyen and Robert C. Williamson. A theory of learning with corrupted labels. Journal of Machine Learning Research, 18(228):1–50, 2018. URL http://jmlr.org/papers/v18/16-315.html.
- [33] Vladimir Vapnik. Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities. 1971.

Full Text

Tags

Comments