Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification

ICML, pp. 10809-10819, 2020.

Cited by: 0|Views44
EI
Weibo:
We have proposed Adaptive Probabilistic Label Clusters to deal with extreme labels efficiently

Abstract:

Extreme multi-label text classification (XMTC) is a task for tagging a given text with the most relevant labels from an extremely large label set. We propose a novel deep learning method called APLC-XLNet. Our approach fine-tunes the recently released generalized autoregressive pretrained model (XLNet) to learn a dense representation fo...More

Code:

Data:

0
Introduction
  • Extreme classification is the problem of learning a classifier to annotate each instance with the most relevant labels from an extremely large label set.
  • Extreme multi-label text classification (XMTC) is a fundamental task of extreme classification where both the instances and labels are in text format.
  • One traditional approach to represent text features is bag-of-words (BOW), where a vector represents the frequency of a word in a predefined vocabulary.
  • The traditional methods based on BOW or its variants, ignoring the location information of the words, cannot capture the contextual and semantic information of the text
Highlights
  • Extreme classification is the problem of learning a classifier to annotate each instance with the most relevant labels from an extremely large label set
  • Note that the three deep learning approaches take the raw text as the input, and can make use of the contextual and semantic information of the text
  • We have proposed a novel deep learning approach for the XMTC problem
  • We have proposed Adaptive Probabilistic Label Clusters to deal with extreme labels efficiently
  • The application of Adaptive Probabilistic Label Clusters is not limited to XMTC
  • We believe that Adaptive Probabilistic Label Clusters can be general enough to be applied to the extreme classification problem as the output layer, especially the tasks where the distributions of classes are unbalanced
Methods
  • The authors conducted experiments on five standard benchmark datasets, including three medium-scale datasets, EURLex-4k, AmazonCat-13k and Wiki10-31k, and two large-scale datasets, Wiki-500k and Amazon-670k.
  • The term frequencyinverse document frequency features for the five datasets are publicly available at the Extreme classification Respository1.
  • The authors used the raw text of 3 datasets, including AmazonCat-13k, Wiki10-31k and Amazon-670k, from the the Extreme classification Respository.
  • The authors obtained the raw text of EURLex2 and Wiki-500k3 from the public websites
Results
  • Results of different number of clusters

    92 EURLex Wiki10. Results of different partitions

    P@1 P@1

    Number of clusters Id of partition

    DisMEC on dataset Amazon-670k, with a drop of 2 percent.
  • Bonsai has the best performance on four datasets among the three tree-based approaches.
  • Note that the three deep learning approaches take the raw text as the input, and can make use of the contextual and semantic information of the text.
  • They utilize different models to learn the text representation
Conclusion
  • The authors have proposed a novel deep learning approach for the XMTC problem.
  • The authors have proposed APLC to deal with extreme labels efficiently.
  • The authors have carried out theoretical analysis on the model size and computation complexity for APLC.
  • The application of APLC is not limited to XMTC.
  • The authors believe that APLC can be general enough to be applied to the extreme classification problem as the output layer, especially the tasks where the distributions of classes are unbalanced
Summary
  • Introduction:

    Extreme classification is the problem of learning a classifier to annotate each instance with the most relevant labels from an extremely large label set.
  • Extreme multi-label text classification (XMTC) is a fundamental task of extreme classification where both the instances and labels are in text format.
  • One traditional approach to represent text features is bag-of-words (BOW), where a vector represents the frequency of a word in a predefined vocabulary.
  • The traditional methods based on BOW or its variants, ignoring the location information of the words, cannot capture the contextual and semantic information of the text
  • Methods:

    The authors conducted experiments on five standard benchmark datasets, including three medium-scale datasets, EURLex-4k, AmazonCat-13k and Wiki10-31k, and two large-scale datasets, Wiki-500k and Amazon-670k.
  • The term frequencyinverse document frequency features for the five datasets are publicly available at the Extreme classification Respository1.
  • The authors used the raw text of 3 datasets, including AmazonCat-13k, Wiki10-31k and Amazon-670k, from the the Extreme classification Respository.
  • The authors obtained the raw text of EURLex2 and Wiki-500k3 from the public websites
  • Results:

    Results of different number of clusters

    92 EURLex Wiki10. Results of different partitions

    P@1 P@1

    Number of clusters Id of partition

    DisMEC on dataset Amazon-670k, with a drop of 2 percent.
  • Bonsai has the best performance on four datasets among the three tree-based approaches.
  • Note that the three deep learning approaches take the raw text as the input, and can make use of the contextual and semantic information of the text.
  • They utilize different models to learn the text representation
  • Conclusion:

    The authors have proposed a novel deep learning approach for the XMTC problem.
  • The authors have proposed APLC to deal with extreme labels efficiently.
  • The authors have carried out theoretical analysis on the model size and computation complexity for APLC.
  • The application of APLC is not limited to XMTC.
  • The authors believe that APLC can be general enough to be applied to the extreme classification problem as the output layer, especially the tasks where the distributions of classes are unbalanced
Tables
  • Table1: Statistics of datasets. Ntrain is the number of training samples, Ntest is the number of test samples, D is the dimension of the feature vector, L is the cardinality of the label set, Lis the average number of labels per sample, Lis the average samples per label,
  • Table2: Implementation details of APLC. dh is the dimension of the input hidden state, q is the factor by which the dimension of hidden state for the tail cluster decreases, Ncls is the number of clusters and Pnum is the proportion for which the number of labels in each cluster accounts
  • Table3: Hyperparameters for training the model. Lseq is the length of input sequence. ηx, ηh, and ηa denote the learning rate of the XLNet model, the hidden layer and APLC layer, respectively. Nb is the batch size and Ne is the number of training epochs
  • Table4: Comparisons between APLC-XLNet and state-of-the-art baselines. The best result among all the methods is in bold
Download tables as Excel
Related work
  • Many effective methods have been proposed for addressing the challenges of XMTC. They can be generally categorized into two types according to the method used for feature representation. One traditional type is to use the BOW as the feature. It contains three different approaches: one-vs-all approaches, embedding-based approaches and tree-based approaches. The other type is the modern deep learning approach. Deep learning models have been proposed to learn powerful text representation from the raw text and have shown great success on different NLP tasks.
Reference
  • Babbar, R. and Scholkopf, B. Dismec: Distributed sparse machines for extreme multi-label classification. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 721–729. ACM, 2017.
    Google ScholarLocate open access versionFindings
  • Bhatia, K., Jain, H., Kar, P., Varma, M., and Jain, P. Sparse local embeddings for extreme multi-label classification. In Advances in Neural Information Processing Systems, pp. 730–738, 2015.
    Google ScholarLocate open access versionFindings
  • Chen, Z., Trabelsi, M., Heflin, J., Xu, Y., and Davison, B. D. Table search using a deep contextualized language model. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, July 2020.
    Google ScholarLocate open access versionFindings
  • Conneau, A. and Lample, G. Cross-lingual language model pretraining. In Advances in Neural Information Processing Systems, pp. 7057–7067, 2019.
    Google ScholarLocate open access versionFindings
  • Dai, Z., Yang, Z., Yang, Y., Carbonell, J. G., Le, Q., and Salakhutdinov, R. Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2978–2988, 2019.
    Google ScholarLocate open access versionFindings
  • Dekel, O. and Shamir, O. Multiclass-multilabel classification with more classes than examples. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 137–144, 2010.
    Google ScholarLocate open access versionFindings
  • Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, volume 1 (Long and Short Papers), pp. 4171–4186, 2019.
    Google ScholarLocate open access versionFindings
  • Grave, E., Joulin, A., Cisse, M., Jegou, H., et al. Efficient softmax approximation for gpus. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1302–1310. JMLR.org, 2017.
    Google ScholarLocate open access versionFindings
  • Guo, Q., Qiu, X., Liu, P., Shao, Y., Xue, X., and Zhang, Z. Star-transformer. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1315–1325, 2019.
    Google ScholarLocate open access versionFindings
  • Howard, J. and Ruder, S. Universal language model finetuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 328–339, 2018.
    Google ScholarLocate open access versionFindings
  • Jain, H., Prabhu, Y., and Varma, M. Extreme multi-label loss functions for recommendation, tagging, ranking & other missing label applications. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 935–944. ACM, 2016.
    Google ScholarLocate open access versionFindings
  • Jain, H., Balasubramanian, V., Chunduri, B., and Varma, M. Slice: Scalable linear extreme classifiers trained on 100 million labels for related searches. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 528–536. ACM, 2019.
    Google ScholarLocate open access versionFindings
  • Jasinska, K., Dembczynski, K., Busa-Fekete, R., Pfannschmidt, K., Klerx, T., and Hullermeier, E. Extreme f-measure maximization using sparse probability estimates. In International Conference on Machine Learning, pp. 1435–1444, 2016.
    Google ScholarLocate open access versionFindings
  • Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 427–431, 2017.
    Google ScholarLocate open access versionFindings
  • Khandagale, S., Xiao, H., and Babbar, R. Bonsai-diverse and shallow trees for extreme multi-label classification. arXiv preprint arXiv:1904.08249, 2019.
    Findings
  • Kim, Y. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751, 2014.
    Google ScholarLocate open access versionFindings
  • Kudo, T. and Richardson, J. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 66–71, 2018.
    Google ScholarLocate open access versionFindings
  • Lai, S., Xu, L., Liu, K., and Zhao, J. Recurrent convolutional neural networks for text classification. In Twenty-ninth AAAI Conference on Artificial Intelligence, 2015.
    Google ScholarLocate open access versionFindings
  • Le, H.-S., Oparin, I., Allauzen, A., Gauvain, J.-L., and Yvon, F. Structured output layer neural network language model. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5524–5527. IEEE, 2011.
    Google ScholarLocate open access versionFindings
  • Liu, J., Chang, W.-C., Wu, Y., and Yang, Y. Deep learning for extreme multi-label text classification. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 115–124. ACM, 2017.
    Google ScholarLocate open access versionFindings
  • Liu, P., Qiu, X., and Huang, X. Recurrent neural network for text classification with multi-task learning. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pp. 2873–2879, 2016.
    Google ScholarLocate open access versionFindings
  • Mikolov, T., Kombrink, S., Burget, L., Cernocky, J., and Khudanpur, S. Extensions of recurrent neural network language model. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5528–5531. IEEE, 2011.
    Google ScholarLocate open access versionFindings
  • Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
    Findings
  • Mnih, A. and Teh, Y. W. A fast and simple algorithm for training neural probabilistic language models. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pp. 419–426, 2012.
    Google ScholarLocate open access versionFindings
  • Morin, F. and Bengio, Y. Hierarchical probabilistic neural network language model. In AISTATS, volume 5, pp. 246–252, 2005.
    Google ScholarLocate open access versionFindings
  • Prabhu, Y. and Varma, M. Fastxml: A fast, accurate and stable tree-classifier for extreme multi-label learning. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 263–272. ACM, 2014.
    Google ScholarLocate open access versionFindings
  • Prabhu, Y., Kag, A., Harsola, S., Agrawal, R., and Varma, M. Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising. In Proceedings of the 2018 World Wide Web Conference, pp. 993–1002. International World Wide Web Conferences Steering Committee, 2018.
    Google ScholarLocate open access versionFindings
  • Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. Improving language understanding by generative pretraining, 2018.
    Google ScholarFindings
  • Siblini, W., Kuntz, P., and Meyer, F. CRAFTML, an efficient clustering-based random forest for extreme multi-label learning. In Proceedings of the 35th International Conference on Machine Learning, 2018.
    Google ScholarLocate open access versionFindings
  • Tagami, Y. Annexml: Approximate nearest neighbor search for extreme multi-label classification. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 455–464. ACM, 2017.
    Google ScholarLocate open access versionFindings
  • Vaswani, A., Zhao, Y., Fossum, V., and Chiang, D. Decoding with large-scale neural language models improves translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1387–1392, 2013.
    Google ScholarLocate open access versionFindings
  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • Wei-Cheng, C., Hsiang-Fu, Y., Kai, Z., Yiming, Y., and Inderjit, D. X-BERT: eXtreme Multi-label Text Classification using Bidirectional Encoder Representations from Transformers. In NeurIPS Science Meets Engineering of Deep Learning Workshop, 2019.
    Google ScholarLocate open access versionFindings
  • Wydmuch, M., Jasinska, K., Kuznetsov, M., Busa-Fekete, R., and Dembczynski, K. A no-regret generalization of hierarchical softmax to extreme multi-label classification. In Advances in Neural Information Processing Systems, pp. 6355–6366, 2018.
    Google ScholarLocate open access versionFindings
  • Xu, H., Liu, B., Shu, L., and Yu, P. Bert post-training for review reading comprehension and aspect-based sentiment analysis. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, volume 1, 2019.
    Google ScholarLocate open access versionFindings
  • Yang, W., Xie, Y., Lin, A., Li, X., Tan, L., Xiong, K., Li, M., and Lin, J. End-to-end open-domain question answering with BERTserini. In NAACL-HLT (Demonstrations), 2019a.
    Google ScholarLocate open access versionFindings
  • Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., and Hovy, E. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489, 2016.
    Google ScholarLocate open access versionFindings
  • Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., and Le, Q. V. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems, pp. 5753–5763, 2019b.
    Google ScholarLocate open access versionFindings
  • Yen, I. E., Huang, X., Dai, W., Ravikumar, P., Dhillon, I., and Xing, E. Ppdsparse: A parallel primal-dual sparse method for extreme classification. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 545–553. ACM, 2017.
    Google ScholarLocate open access versionFindings
  • Yen, I. E.-H., Huang, X., Ravikumar, P., Zhong, K., and Dhillon, I. Pd-sparse: A primal and dual sparse approach to extreme multiclass and multilabel classification. In International Conference on Machine Learning, pp. 3069– 3077, 2016.
    Google ScholarLocate open access versionFindings
  • Yin, W. and Schutze, H. Attentive convolution: Equipping CNNs with RNN-style attention mechanisms. Transactions of the Association for Computational Linguistics, 6: 687–702, 2018.
    Google ScholarLocate open access versionFindings
  • You, R., Zhang, Z., Wang, Z., Dai, S., Mamitsuka, H., and Zhu, S. Attentionxml: Label tree-based attention-aware deep model for high-performance extreme multi-label text classification. In Advances in Neural Information Processing Systems, pp. 5812–5822, 2019.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments