# Provably Consistent Partial-Label Learning

NeurIPS, 2020.

EI

Weibo:

Abstract:

Partial-label learning (PLL) is a multi-class classification problem, where each training example is associated with a set of candidate labels. Even though many practical PLL methods have been proposed in the last two decades, there lacks a theoretical understanding of the consistency of those methods-none of the PLL methods hitherto po...More

Code:

Data:

Introduction

- Unlike supervised learning and unsupervised learning, weakly supervised learning [1] aims to learn with weak supervision.
- Examples include semi-supervised learning [2, 3, 4], multi-instance learning [5, 6], positive-unlabeled learning [7, 8], complementary-label learning [9, 10], noisy-label learning [11, 12, 13], positive-confidence learning [14], similarunlabeled learning [15], and unlabeled-unlabeled learning [16, 17]
- In recent years, another weakly supervised learning framework called partial-label learning (PLL) [18, 19, 20, 21, 22, 23, 24] has gradually attracted attention from machine learning and data mining communities.
- Due to the difficulty in collecting accurately labeled data in many real-world scenarios, PLL has been successfully applied to a wide range of application domains, such as web mining [31], bird song classification [29], and automatic face naming [26]

Highlights

- Unlike supervised learning and unsupervised learning, weakly supervised learning [1] aims to learn with weak supervision
- partial-label learning (PLL) aims to deal with the problem where each instance is provided with a set of candidate labels, only one of which is the correct label
- We find that the candidate label sets with higher entropy better match our generation model, and on such datasets, our proposed PLL methods achieve better performance
- We further show via experiments that even when given candidate label sets do not match our proposed generation model well, our methods still significantly outperform other compared methods
- To the best of our knowledge, we provided the first risk-consistent PLL method
- Extensive experimental results clearly demonstrated the effectiveness of the proposed generation model and two PLL methods

Methods

- Based on the assumed partially labeled data distribution in Eq (5), the authors present a novel risk-consistent method and a novel classifier-consistent method, and theoretically derive an estimator error bound for each of them
- Both of the consistent methods are agnostic in specific classification models and can be trained with stochastic optimization, which ensures their scalability to large-scale datasets.

Results

- The authors can observe that RC always achieves the best performance and significantly outperforms other compared methods in most cases.
- The authors further show via experiments that even when given candidate label sets do not match the proposed generation model well, the methods still significantly outperform other compared methods

Conclusion

- The authors for the first time provided an explicit mathematical formulation of the partially labeled data generation process for PLL.
- Based on the data generation model, the authors further derived a novel risk-consistent method and a novel classifier-consistent method.
- To the best of the knowledge, the authors provided the first risk-consistent PLL method.
- The authors theoretically derived an estimation error bound for each of the proposed methods.
- Extensive experimental results clearly demonstrated the effectiveness of the proposed generation model and two PLL methods

Summary

## Introduction:

Unlike supervised learning and unsupervised learning, weakly supervised learning [1] aims to learn with weak supervision.- Examples include semi-supervised learning [2, 3, 4], multi-instance learning [5, 6], positive-unlabeled learning [7, 8], complementary-label learning [9, 10], noisy-label learning [11, 12, 13], positive-confidence learning [14], similarunlabeled learning [15], and unlabeled-unlabeled learning [16, 17]
- In recent years, another weakly supervised learning framework called partial-label learning (PLL) [18, 19, 20, 21, 22, 23, 24] has gradually attracted attention from machine learning and data mining communities.
- Due to the difficulty in collecting accurately labeled data in many real-world scenarios, PLL has been successfully applied to a wide range of application domains, such as web mining [31], bird song classification [29], and automatic face naming [26]
## Methods:

Based on the assumed partially labeled data distribution in Eq (5), the authors present a novel risk-consistent method and a novel classifier-consistent method, and theoretically derive an estimator error bound for each of them- Both of the consistent methods are agnostic in specific classification models and can be trained with stochastic optimization, which ensures their scalability to large-scale datasets.
## Results:

The authors can observe that RC always achieves the best performance and significantly outperforms other compared methods in most cases.- The authors further show via experiments that even when given candidate label sets do not match the proposed generation model well, the methods still significantly outperform other compared methods
## Conclusion:

The authors for the first time provided an explicit mathematical formulation of the partially labeled data generation process for PLL.- Based on the data generation model, the authors further derived a novel risk-consistent method and a novel classifier-consistent method.
- To the best of the knowledge, the authors provided the first risk-consistent PLL method.
- The authors theoretically derived an estimation error bound for each of the proposed methods.
- Extensive experimental results clearly demonstrated the effectiveness of the proposed generation model and two PLL methods

- Table1: Test performance (mean±std) of each method using neural networks on benchmark datasets. ResNet is trained on CIFAR-10, and MLP is trained on the other three datasets
- Table2: Test performance (mean±std) of each method using neural networks on benchmark datasets. DenseNet is trained on CIFAR-10, and LeNet is trained on the other three datasets
- Table3: Test performance (mean±std) of each method using linear model on UCI datasets
- Table4: Test performance (mean±std) of each method using linear model on real-world datasets
- Table5: Characteristics of the controlled datasets
- Table6: Characteristics of the real-world partially labeled datasets
- Table7: Transductive accuracy of each method using neural networks on benchmark datasets. ResNet is trained on CIFAR-10, and MLP is trained on the other three datasets
- Table8: Transductive accuracy of each method using neural networks on benchmark datasets. DenseNet is trained on CIFAR-10, and LeNet is trained on the other three datasets
- Table9: Test performance (mean±std) of the RC method using neural networks on benchmark datasets with different generation models. The best performance is highlighted in bold
- Table10: Test performance (mean±std) of the CC method using neural networks on benchmark datasets with different generation models. The best performance is highlighted in bold
- Table11: Test performance (mean±std) of each method using neural networks on benchmark datasets. DenseNet is trained on CIFAR-10, and LeNet is trained on the other three datasets. Candidate label sets are generated by the generation model in Case 1 (entropy=2.015)

Funding

- BH was supported by the Early Career Scheme (ECS) through the Research Grants Council of Hong Kong under Grant No.22200720, HKBU Tier-1 Start-up Grant, and HKBU CSD Start-up Grant
- GN and MS were supported by JST AIP Acceleration Research Grant Number JPMJCR20U3, Japan

Study subjects and analysis

widely used benchmark datasets: 4

Datasets. We collect four widely used benchmark datasets including MNIST [43], Kuzushiji-MNIST [44], Fashion-MNIST [45], and CIFAR-10 [46], and five datasets from the UCI Machine Learning Repository [46]. In order to generate candidate label sets on these datasets, following the motivation in Section 3.2, we uniformly sample the candidate label set that includes the correct label from C for each instance

benchmark datasets: 4

Experimental Results. We run 5 trials on the four benchmark datasets and run 10 trials (with 90%/10% train/test split) on UCI datasets and real-world partially labeled datasets, and record the mean accuracy with standard deviation (mean±std). We also use paired t-test at 5% significance level, and •/◦ represents whether the best of RC and CC is significantly better/worse than other compared methods

Reference

- Z.-H. Zhou, “A brief introduction to weakly supervised learning,” National Science Review, vol. 5, no. 1, pp. 44–53, 2018.
- O. Chapelle, B. Scholkopf, and A. Zien, Semi-Supervised Learning. MIT Press, 2006.
- G. Niu, W. Jitkrittum, B. Dai, H. Hachiya, and M. Sugiyama, “Squared-loss mutual information regularization: A novel information-theoretic approach to semi-supervised learning,” in ICML, pp. 10–18, 2013.
- Y.-F. Li and D.-M. Liang, “Safe semi-supervised learning: a brief introduction,” Frontiers of Computer Science, vol. 13, no. 4, pp. 669–676, 2019.
- J. Amores, “Multiple instance classification: Review, taxonomy and comparative study,” Artificial Intelligence, vol. 201, pp. 81–105, 2013.
- Z.-H. Zhou, M.-L. Zhang, S.-J. Huang, and Y.-F. Li, “Multi-instance multi-label learning,” Artificial Intelligence, vol. 176, no. 1, pp. 2291–2320, 2012.
- M. C. du Plessis, G. Niu, and M. Sugiyama, “Convex formulation for learning from positive and unlabeled data,” in ICML, pp. 1386–1394, 2015.
- R. Kiryo, G. Niu, M. C. du Plessis, and M. Sugiyama, “Positive-unlabeled learning with non-negative risk estimator.,” in NeurIPS, pp. 1674–1684, 2017.
- T. Ishida, G. Niu, W. Hu, and M. Sugiyama, “Learning from complementary labels,” in NeurIPS, pp. 5644–5654, 2017.
- T. Ishida, G. Niu, A. K. Menon, and M. Sugiyama, “Complementary-label learning for arbitrary losses and models.,” in ICML, pp. 2971–2980, 2019.
- B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama, “Co-teaching: Robust training of deep neural networks with extremely noisy labels,” in NeurIPS, pp. 8527–8537, 2018.
- L. Feng, S. Shu, Z. Lin, F. Lv, L. Li, and B. An, “Can cross entropy loss be robust to label noise?,” in IJCAI, pp. 2206–2212, 2020.
- H. Wei, L. Feng, X. Chen, and B. An, “Combating noisy labels by agreement: A joint training method with co-regularization,” in CVPR, pp. 13726–13735, 2020.
- T. Ishida, G. Niu, and M. Sugiyama, “Binary classification for positive-confidence data.,” in NeurIPS, pp. 5917–5928, 2018.
- H. Bao, G. Niu, and M. Sugiyama, “Classification from pairwise similarity and unlabeled data.,” in ICML, pp. 452–461, 2018.
- N. Lu, G. Niu, A. K. Menon, and M. Sugiyama, “On the minimal supervision for training any binary classifier from only unlabeled data,” in ICLR, 2019.
- N. Lu, T. Zhang, G. Niu, and M. Sugiyama, “Mitigating overfitting in supervised classification from two unlabeled datasets: A consistent risk correction approach,” in AISTATS, 2020.
- R. Jin and Z. Ghahramani, “Learning with multiple labels,” in NeurIPS, pp. 921–928, 2003.
- T. Cour, B. Sapp, and B. Taskar, “Learning from partial labels,” JMLR, vol. 12, no. 5, pp. 1501–1536, 2011.
- L. Liu and T. Dietterich, “Learnability of the superset label learning problem,” in ICML, pp. 1629–1637, 2014.
- Y.-C. Chen, V. M. Patel, R. Chellappa, and P. J. Phillips, “Ambiguously labeled learning using dictionaries,” TIFS, vol. 9, no. 12, pp. 2076–2088, 2014.
- M.-L. Zhang and F. Yu, “Solving the partial label learning problem: An instance-based approach.,” in IJCAI, pp. 4048–4054, 2015.
- L. Feng and B. An, “Partial label learning with self-guided retraining,” in AAAI, pp. 3542–3549, 2019.
- G. Lyu, S. Feng, T. Wang, C. Lang, and Y. Li, “Gm-pll: Graph matching based partial label learning,” TKDE, 2019.
- E. Hüllermeier and J. Beringer, “Learning from ambiguously labeled examples,” Intelligent Data Analysis, vol. 10, no. 5, pp. 419–439, 2006.
- Z.-N. Zeng, S.-J. Xiao, K. Jia, T.-H. Chan, S.-H. Gao, D. Xu, and Y. Ma, “Learning by associating ambiguously labeled images,” in CVPR, pp. 708–715, 2013.
- C.-H. Chen, V. M. Patel, and R. Chellappa, “Learning from ambiguously labeled face images,” TPAMI, vol. 40, no. 7, pp. 1653–1667, 2018.
- Y. Yao, C. Gong, J. Deng, X. Chen, J. Wu, and J. Yang, “Deep discriminative cnn with temporal ensembling for ambiguously-labeled image classification,” in AAAI, p. in press, 2020.
- L.-P. Liu and T. G. Dietterich, “A conditional multinomial mixture model for superset label learning,” in NeurIPS, pp. 548–556, 2012.
- C. Gong, T. Liu, Y. Tang, J. Yang, J. Yang, and D. Tao, “A regularization approach for instance-based superset label learning,” IEEE Transactions on Cybernetics, vol. 48, no. 3, pp. 967–978, 2018.
- J. Luo and F. Orabona, “Learning from candidate labeling sets,” in NeurIPS, pp. 1504–1512, 2010.
- N. Nguyen and R. Caruana, “Classification with partial labels,” in KDD, pp. 551–559, 2008.
- L. Feng and B. An, “Leveraging latent label distributions for partial label learning.,” in IJCAI, pp. 2107– 2113, 2018.
- X. Xia, T. Liu, N. Wang, B. Han, C. Gong, G. Niu, and M. Sugiyama, “Are anchor points really indispensable in label-noise learning?,” in NeurIPS, pp. 6835–6846, 2019.
- M.-L. Zhang, F. Yu, and C.-Z. Tang, “Disambiguation-free partial label learning,” TKDE, vol. 29, no. 10, pp. 2155–2167, 2017.
- X. Yu, T. Liu, M. Gong, and D. Tao, “Learning with biased complementary labels,” in ECCV, pp. 68–83, 2018.
- L. Feng, T. Kaneko, B. Han, G. Niu, B. An, and M. Sugiyama, “Learning with multiple complementary labels,” in ICML, 2020.
- B. Han, J. Yao, G. Niu, M. Zhou, I. Tsang, Y. Zhang, and M. Sugiyama, “Masking: A new perspective of noisy supervision,” in NeurIPS, pp. 5836–5846, 2018.
- A. Gretton, A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, and B. Schölkopf, “Covariate shift by kernel mean matching,” Dataset Shift in Machine Learning, vol. 3, no. 4, p. 5, 2009.
- P. L. Bartlett and S. Mendelson, “Rademacher and gaussian complexities: Risk bounds and structural results,” JMLR, vol. 3, no. 11, pp. 463–482, 2002.
- N. Golowich, A. Rakhlin, and O. Shamir, “Size-independent sample complexity of neural networks,” arXiv preprint arXiv:1712.06541, 2017.
- J. Lv, M. Xu, L. Feng, G. Niu, X. Geng, and M. Sugiyama, “Progressive identification of true labels for partial-label learning,” in ICML, 2020.
- Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al., “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
- T. Clanuwat, M. Bober-Irizar, A. Kitamoto, A. Lamb, K. Yamamoto, and D. Ha, “Deep learning for classical japanese literature,” arXiv preprint arXiv:1812.01718, 2018.
- H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms,” arXiv preprint arXiv:1708.07747, 2017.
- A. Krizhevsky, G. Hinton, et al., “Learning multiple layers of features from tiny images,” tech. rep., Citeseer, 2009.
- F. Briggs, X. Z. Fern, and R. Raich, “Rank-loss support instance machines for miml instance annotation,” in KDD, pp. 534–542, 2012.
- M. Guillaumin, J. Verbeek, and C. Schmid, “Multiple instance metric learning from automatically labeled bags of faces,” Lecture Notes in Computer Science, vol. 63, no. 11, pp. 634–647, 2010.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, pp. 770–778, 2016.
- G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in CVPR, pp. 4700–4708, 2017.
- C. Elkan and K. Noto, “Learning classifiers from only positive and unlabeled data,” in KDD, pp. 213–220, 2008.
- A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., “Pytorch: An imperative style, high-performance deep learning library,” in NeurIPS, pp. 8024–8035, 2019.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.
- M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of Machine Learning. MIT Press, 2012.
- C. McDiarmid, “On the method of bounded differences,” in Surveys in Combinatorics, 1989.
- M. Ledoux and M. Talagrand, Probability in Banach Spaces: Isoperimetry and Processes. Springer Science & Business Media, 2013.
- G. Panis and A. Lanitis, “An overview of research activities in facial age estimation using the fg-net aging database,” in ECCV, pp. 737–750, 2014.
- N. Halko, P.-G. Martinsson, and J. A. Tropp, “Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions,” 2009.
- D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf, “Learning with local and global consistency,” in NeurIPS, pp. 321–328, 2004.
- 0. With these notations, we can obtain
- [57] Linear Model
- [29] Linear Model
- [47] Linear Model
- [26] Linear Model
- [48] Linear Model

Full Text

Tags

Comments