Extreme Multi-label Classification from Aggregated Labels

ICML 2020, 2020.

Cited by: 0|Bibtex|Views49|Links
Keywords:
multi-label learningmulti-instance multi-labelextreme multi label classificationlarge scalexmc methodMore(7+)
Weibo:
We study XMC with aggregated labels, and propose the first efficient algorithm Efficient AGgregated Label lEarning algorithm that advances standard XMC methods in most settings

Abstract:

Extreme multi-label classification (XMC) is the problem of finding the relevant labels for an input, from a very large universe of possible labels. We consider XMC in the setting where labels are available only for groups of samples - but not for individual ones. Current XMC approaches are not built for such multi-instance multi-label (...More

Code:

Data:

0
Introduction
  • Extreme multi-label classification (XMC) is the problem of finding the relevant labels for an input from a very large universe of possible labels.
  • Modern machine learning applications need to deal with certain types of weak supervision, including partial but noisy labeling and active labeling.
  • These scenarios lead to exploration of advanced learning methods including semi-supervised learning, robust learning and active learning
Highlights
  • Extreme multi-label classification (XMC) is the problem of finding the relevant labels for an input from a very large universe of possible labels
  • We study a typical weak supervision setting for XMC named Aggregated Label eXtreme Multi-label Classification (AL-XMC), where only aggregated labels are provided to a group of samples
  • We propose an Efficient AGgregated Label lEarning algorithm (EAGLE) that assigns labels to each sample by learning label embeddings based on the structure of the aggregation
  • We find that Efficient AGgregated Label lEarning algorithm performs better than Efficient AGgregated Label lEarning algorithm-0 almost consistently, across all tasks and all grouping methods, and is much better than Baseline where we ignore such aggregation structure
  • The performance of Efficient AGgregated Label lEarning algorithm in multiple multi-instance multi-label tasks is shown in Table 3
  • We study XMC with aggregated labels, and propose the first efficient algorithm Efficient AGgregated Label lEarning algorithm that advances standard XMC methods in most settings
Results
  • The authors empirically verify the effectiveness of EAGLE from multiple standpoints.
  • The authors run synthetic experiments on standard XMC datasets to understand the advantages of EAGLE under multiple aggregation rules.
  • The authors find that EAGLE performs better than EAGLE-0 almost consistently, across all tasks and all grouping methods, and is much better than Baseline where the authors ignore such aggregation structure.
  • The consistency exists when the authors change the heterogeneity within group by injecting noise to the feature representation when running the hierarchical clustering algorithm, as shown in Figure 4-(c).
Conclusion
  • The authors study XMC with aggregated labels, and propose the first efficient algorithm EAGLE that advances standard XMC methods in most settings.
  • The authors' work leaves open several interesting issues to study in the future.
  • While using positively labeled groups to learn label embedding, what would be the most efficient way to learn/sample from negatively labeled groups?
  • Is there a way to estimate the clustering quality and adjust the hyper-parameters ?
  • While using positively labeled groups to learn label embedding, what would be the most efficient way to learn/sample from negatively labeled groups? Second, is there a way to estimate the clustering quality and adjust the hyper-parameters ? Moving forward, the authors believe the co-attention
Summary
  • Introduction:

    Extreme multi-label classification (XMC) is the problem of finding the relevant labels for an input from a very large universe of possible labels.
  • Modern machine learning applications need to deal with certain types of weak supervision, including partial but noisy labeling and active labeling.
  • These scenarios lead to exploration of advanced learning methods including semi-supervised learning, robust learning and active learning
  • Results:

    The authors empirically verify the effectiveness of EAGLE from multiple standpoints.
  • The authors run synthetic experiments on standard XMC datasets to understand the advantages of EAGLE under multiple aggregation rules.
  • The authors find that EAGLE performs better than EAGLE-0 almost consistently, across all tasks and all grouping methods, and is much better than Baseline where the authors ignore such aggregation structure.
  • The consistency exists when the authors change the heterogeneity within group by injecting noise to the feature representation when running the hierarchical clustering algorithm, as shown in Figure 4-(c).
  • Conclusion:

    The authors study XMC with aggregated labels, and propose the first efficient algorithm EAGLE that advances standard XMC methods in most settings.
  • The authors' work leaves open several interesting issues to study in the future.
  • While using positively labeled groups to learn label embedding, what would be the most efficient way to learn/sample from negatively labeled groups?
  • Is there a way to estimate the clustering quality and adjust the hyper-parameters ?
  • While using positively labeled groups to learn label embedding, what would be the most efficient way to learn/sample from negatively labeled groups? Second, is there a way to estimate the clustering quality and adjust the hyper-parameters ? Moving forward, the authors believe the co-attention
Tables
  • Table1: Statistics of 4 XMC datasets. ‘sample size’ column includes training & test set. The last column includes precisions with the clean datasets, which can be thought of as the oracle performance given an XMC dataset with aggregated labels
  • Table2: Comparing Baseline, EAGLE-0 (EAGLE without label learning) and EAGLE on small/mid/largesize XMC datasets with aggregated labels. ‘O’ stands for oversized model (>5GB). R-4/10 randomly selects 4/10 samples in each group and observes their aggregated labels. C clusters samples based on hierarchical k-means. The cluster depth is determined based on sample size (8 for EurLex-4k, Wiki-10k and 16 for AmazonCat-13k and Wiki-325k)
  • Table3: Prediction accuracy on multiple MIML tasks
  • Table4: List of notations
  • Table5: EurLex-4k detailed performance results. We list the standard deviation for each precision score, calculated over the result of 5 random seeds
  • Table6: Hyper-parameter search for Deep-MIML on the Yelp dataset. We search over batch size (bs) in {32, · · · , 512}, and learning rate (lr) in {1e − 4, · · · , 0.2}. We select bs= 64, lr=1e − 4 for our experiments
Download tables as Excel
Related work
  • Extreme multi-label classification (XMC). The most classic and straightforward approach for XMC is the One-Vs-All (OVA) method [YHR+16, BS17, LCWY17, YHD+17], which simply treats each label separately and learns a classifier for each label. OVA has shown to achieve high accuracy, but the computation is too expensive for extremely large label set. Tree-based methods, on the other hand, try to improve the efficiency of OVA by using hierarchical representations for samples [AGPV13, PV14, JPV16, SZK+17] or labels [PKH+18, JBCV19]. Among these approaches, label partitioning based methods, including Parabel [PKH+18], have achieved leading performances with training cost sub-linear in the number of labels. Apart from tree-based methods, embedding based methods [ZYWZ18, CYZ+19, YZDZ19, GMW+19] have been studied recently in the context of XMC in order to better use the textual features. In general, while embedding based methods may learn a better representation and use the contextual information better than tf-idf, the scalability of these approaches is worse than tree-based methods. Very recently, Medini et al [MHW+19] apply sketching to learn XMC models with label size at the scale of 50 million.
Reference
  • [AGPV13] Rahul Agrawal, Archit Gupta, Yashoteja Prabhu, and Manik Varma. Multi-label learning with millions of labels: Recommending advertiser bid phrases for web pages. In Proceedings of the 22nd international conference on World Wide Web, pages 13–24, 2013.
    Google ScholarLocate open access versionFindings
  • [APZ17] Abubakar Abid, Ada Poon, and James Zou. Linear regression with shuffled labels. arXiv preprint arXiv:1705.01342, 2017.
    Findings
  • Rohit Babbar and Bernhard Schölkopf. Dismec: Distributed sparse machines for extreme multilabel classification. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pages 721–729, 2017.
    Google ScholarLocate open access versionFindings
  • Olivier Collier and Arnak S Dalalyan. Minimax rates in permutation estimation for feature matching. The Journal of Machine Learning Research, 17(1):162–192, 2016.
    Google ScholarLocate open access versionFindings
  • [CSGR10] Ming-Wei Chang, Vivek Srikumar, Dan Goldwasser, and Dan Roth. Structured output learning with indirect supervision. In ICML, pages 199–206, 2010.
    Google ScholarLocate open access versionFindings
  • [CYZ+19] Wei-Cheng Chang, Hsiang-Fu Yu, Kai Zhong, Yiming Yang, and Inderjit Dhillon. X-BERT: extreme multi-label text classification using bidirectional encoder representations from transformers. arXiv preprint arXiv:1905.02331, 2019.
    Findings
  • [DLLP97] Thomas G Dietterich, Richard H Lathrop, and Tomás Lozano-Pérez. Solving the multiple instance problem with axis-parallel rectangles. Artificial intelligence, 89(1-2):31–71, 1997.
    Google ScholarLocate open access versionFindings
  • [DWP+15] Emily Denton, Jason Weston, Manohar Paluri, Lubomir Bourdev, and Rob Fergus. User conditional hashtag prediction for images. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pages 1731–1740, 2015.
    Google ScholarLocate open access versionFindings
  • [FCS+13] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems, pages 2121–2129, 2013.
    Google ScholarLocate open access versionFindings
  • Ji Feng and Zhi-Hua Zhou. Deep MIML network. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
    Google ScholarLocate open access versionFindings
  • [GMW+19] Chuan Guo, Ali Mousavi, Xiang Wu, Daniel N Holtmann-Rice, Satyen Kale, Sashank Reddi, and Sanjiv Kumar. Breaking the glass ceiling for embedding-based classifiers for large output spaces. In Advances in Neural Information Processing Systems, pages 4944–4954, 2019.
    Google ScholarLocate open access versionFindings
  • Saeid Haghighatshoar and Giuseppe Caire. Signal recovery from unlabeled samples. IEEE Transactions on Signal Processing, 66(5):1242–1257, 2017.
    Google ScholarLocate open access versionFindings
  • [HSS17] Daniel J Hsu, Kevin Shi, and Xiaorui Sun. Linear regression without correspondence. In Advances in Neural Information Processing Systems, pages 1531–1540, 2017.
    Google ScholarLocate open access versionFindings
  • [ITW18] Maximilian Ilse, Jakub Tomczak, and Max Welling. Attention-based deep multiple instance learning. In International Conference on Machine Learning, pages 2132–2141, 2018.
    Google ScholarLocate open access versionFindings
  • Himanshu Jain, Venkatesh Balasubramanian, Bhanu Chunduri, and Manik Varma. Slice: Scalable linear extreme classifiers trained on 100 million labels for related searches. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pages 528–536. ACM, 2019.
    Google ScholarLocate open access versionFindings
  • Himanshu Jain, Yashoteja Prabhu, and Manik Varma. Extreme multi-label loss functions for recommendation, tagging, ranking & other missing label applications. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 935–944, 2016.
    Google ScholarLocate open access versionFindings
  • [LCWY17] Jingzhou Liu, Wei-Cheng Chang, Yuexin Wu, and Yiming Yang. Deep learning for extreme multi-label text classification. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 115–124, 2017.
    Google ScholarLocate open access versionFindings
  • [LLK+19] Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In International Conference on Machine Learning, pages 3744–3753, 2019.
    Google ScholarLocate open access versionFindings
  • Tongliang Liu and Dacheng Tao. Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence, 38(3):447–461, 2015.
    Google ScholarLocate open access versionFindings
  • [McC99] Andrew McCallum. Multi-label text classification with a mixture model trained by EM. In AAAI workshop on Text Learning, pages 1–7, 1999.
    Google ScholarLocate open access versionFindings
  • [MHW+19] Tharun Kumar Reddy Medini, Qixuan Huang, Yiqiu Wang, Vijai Mohan, and Anshumali Shrivastava. Extreme classification in log memory using count-min sketch: A case study of Amazon search with 50M products. In Advances in Neural Information Processing Systems, pages 13244–13254, 2019.
    Google ScholarLocate open access versionFindings
  • [MLP98] Oded Maron and Tomás Lozano-Pérez. A framework for multiple-instance learning. In Advances in neural information processing systems, pages 570–576, 1998.
    Google ScholarLocate open access versionFindings
  • [NDRT13] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy labels. In Advances in neural information processing systems, pages 1196–1204, 2013.
    Google ScholarLocate open access versionFindings
  • [PKB+15] Ioannis Partalas, Aris Kosmopoulos, Nicolas Baskiotis, Thierry Artieres, George Paliouras, Eric Gaussier, Ion Androutsopoulos, Massih-Reza Amini, and Patrick Galinari. LSHTC: A benchmark for large-scale text classification. arXiv preprint arXiv:1503.08581, 2015.
    Findings
  • Yashoteja Prabhu, Anil Kag, Shilpa Gopinath, Kunal Dahiya, Shrutendra Harsola, Rahul Agrawal, and Manik Varma. Extreme multi-label learning with label features for warm-start tagging, ranking & recommendation. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pages 441–449. ACM, 2018.
    Google ScholarLocate open access versionFindings
  • Yashoteja Prabhu, Anil Kag, Shrutendra Harsola, Rahul Agrawal, and Manik Varma. Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising. In Proceedings of the 2018 World Wide Web Conference, pages 993–1002. International World Wide Web Conferences Steering Committee, 2018.
    Google ScholarLocate open access versionFindings
  • Yashoteja Prabhu and Manik Varma. FastXML: A fast, accurate and stable tree-classifier for extreme multi-label learning. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 263–272, 2014.
    Google ScholarLocate open access versionFindings
  • [PWC17a] Ashwin Pananjady, Martin J Wainwright, and Thomas A Courtade. Denoising linear models with permuted data. In 2017 IEEE International Symposium on Information Theory (ISIT), pages 446–450. IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • [PWC17b] Ashwin Pananjady, Martin J Wainwright, and Thomas A Courtade. Linear regression with shuffled data: Statistical and computational limits of permutation recovery. IEEE Transactions on Information Theory, 64(5):3286–3300, 2017.
    Google ScholarLocate open access versionFindings
  • Yanyao Shen and Sujay Sanghavi. Iterative least trimmed squares for mixed linear regression. In Advances in Neural Information Processing Systems, pages 6076–6086, 2019.
    Google ScholarLocate open access versionFindings
  • Yanyao Shen and Sujay Sanghavi. Learning with bad training data via iterative trimmed loss minimization. In International Conference on Machine Learning, pages 5739–5748, 2019.
    Google ScholarLocate open access versionFindings
  • [SZK+17] Si Si, Huan Zhang, S Sathiya Keerthi, Dhruv Mahajan, Inderjit S Dhillon, and Cho-Jui Hsieh. Gradient boosted decision trees for high dimensional sparse output. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3182–3190. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • [WBU11] Jason Weston, Samy Bengio, and Nicolas Usunier. Wsabie: Scaling up to large vocabulary image annotation. In Twenty-Second International Joint Conference on Artificial Intelligence, 2011.
    Google ScholarLocate open access versionFindings
  • Hai Wang and Hoifung Poon. Deep probabilistic logic: A unifying framework for indirect supervision. arXiv preprint arXiv:1808.08485, 2018.
    Findings
  • Ian EH Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit Dhillon, and Eric Xing. Ppdsparse: A parallel primal-dual sparse method for extreme classification. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 545–553, 2017.
    Google ScholarLocate open access versionFindings
  • [YHR+16] Ian En-Hsu Yen, Xiangru Huang, Pradeep Ravikumar, Kai Zhong, and Inderjit Dhillon. Pdsparse: A primal and dual sparse approach to extreme multiclass and multilabel classification. In International Conference on Machine Learning, pages 3069–3077, 2016.
    Google ScholarLocate open access versionFindings
  • [YJKD14] Hsiang-Fu Yu, Prateek Jain, Purushottam Kar, and Inderjit Dhillon. Large-scale multi-label learning with missing labels. In International conference on machine learning, pages 593–601, 2014.
    Google ScholarLocate open access versionFindings
  • [YZDZ19] Ronghui You, Zihan Zhang, Suyang Dai, and Shanfeng Zhu. HAXMLNet: Hierarchical attention network for extreme multi-label text classification. arXiv preprint arXiv:1904.12578, 2019.
    Findings
  • [ZYWZ18] Wenjie Zhang, Junchi Yan, Xiangfeng Wang, and Hongyuan Zha. Deep extreme multi-label learning. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, pages 100–107. ACM, 2018.
    Google ScholarLocate open access versionFindings
  • Zhi-Li Zhang and Min-Ling Zhang. Multi-instance multi-label learning with application to scene classification. In Advances in neural information processing systems, pages 1609–1616, 2007.
    Google ScholarLocate open access versionFindings
  • Min-Ling Zhang and Zhi-Hua Zhou. A review on multi-label learning algorithms. IEEE transactions on knowledge and data engineering, 26(8):1819–1837, 2013.
    Google ScholarFindings
  • [ZZHL12] Zhi-Hua Zhou, Min-Ling Zhang, Sheng-Jun Huang, and Yu-Feng Li. Multi-instance multi-label learning. Artificial Intelligence, 176(1):2291–2320, 2012.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments