AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We propose META, a novel framework that leverages metadata information as an additional source of weak supervision and incorporates it into the classification framework

META: Metadata Empowered Weak Supervision for Text Classification

EMNLP 2020, pp.8351-8361, (2020)

Cited by: 0|Views214
Full Text
Bibtex
Weibo

Abstract

Recent advances in weakly supervised learning enable training high-quality text classifiers by only providing a few user-provided seed words. Existing methods mainly use text data alone to generate pseudo-labels despite the fact that metadata information (e.g., author and timestamp) is widely available across various domains. Strong label...More

Code:

Data:

0
Introduction
  • Supervised text classification has recently gained much attention from the researchers because it reduces the burden of annotating the data.
Highlights
  • Supervised text classification has recently gained much attention from the researchers because it reduces the burden of annotating the data
  • The major source of weak supervision lies in text data itself (Agichtein and Gravano, 2000; Kuipers et al, 2006; Riloff et al, 2003; Tao et al, 2015; Meng et al, 2018; Mekala and Shang, 2020)
  • We develop a unified, principled ranking mechanism to select label-indicative motif instances and words, forming expanded weak supervision
  • We explore to incorporate metadata information as an additional source of weak supervision for text classification along with seed words
  • We propose META, a novel framework that leverages metadata information as an additional source of weak supervision and incorporates it into the classification framework
  • Experimental results and case studies demonstrate that our model outperforms previous methods significantly, thereby signifying the advantages of leveraging metadata as weak supervision
Methods
  • The authors compare the proposed method with a wide range of methods described below: IR-TF-IDF treats seed words as a query.
  • The authors present results of all the baselines on the metadata-augmented datasets, where a token for every relevant motif instance is appended to the text data of a document.
  • This is denoted by ++ in Table 2, e.g., WeSTClass++ represents the performance of WeSTClass on metadata-augmented datasets.
  • The results of HAN-Sup reported are on the test set which follows an 80-10-10 train-dev-test split
Results
  • One can observe that META prunes out many motif instances, as the final selection ratio is far less than 100%.
Conclusion
  • The authors propose META, a novel framework that leverages metadata information as an additional source of weak supervision and incorporates it into the classification framework.
  • Experimental results and case studies demonstrate that the model outperforms previous methods significantly, thereby signifying the advantages of leveraging metadata as weak supervision.
  • There should be negatively label-indicative combinations as well which can eliminate some classes from potential labels
  • This is another potential direction for the extension of the method
Tables
  • Table1: Dataset statistics
  • Table2: Evaluation Results on Two Datasets. ++ represents that the input is metadata-augmented
  • Table3: Case Study: Expanded motif instances
  • Table4: Case Study: Percentage of motif instances expanded for Book Graph dataset. A stands for author, P for publisher and Y for year
  • Table5: Expanded seed words of comics, history, and mystery classes in Books dataset
Download tables as Excel
Related work
  • We review the literature about (1) weakly supervised text classification methods, (2) text classification with metadata, and (3) document classifiers.

    5.1 Weakly Supervised Text Classification

    Due to the training data bottleneck in supervised classification, weakly supervised classification has recently attracted much attention from researchers. The majority of weakly supervised classification techniques require seeds in various forms, including label surface names (Li et al, 2018; Song and Roth, 2014; Tao et al, 2015), label-indicative words (Chang et al, 2008; Meng et al, 2018; Tao et al, 2015; Mekala and Shang, 2020), and labeleddocuments (Tang et al, 2015b; Xu et al, 2017; Miyato et al, 2016; Meng et al, 2018).

    Dataless (Song and Roth, 2014) considers label surface names as seeds and classifies documents by embedding both labels and documents in a semantic space and computing semantic similarity between a document and a potential label; Along similar lines, Doc2Cube (Tao et al, 2015) expands label-indicative words using label surface names and performs multi-dimensional document classification by learning dimension-aware embedding; WeSTClass (Meng et al, 2018) considers both word-level and document level supervision sources. It first generates bag-of-words pseudo documents for neural model pre-training, then bootstraps the model on unlabeled data. This method is later extended to a hierarchical setting with a pre-defined hierarchy (Meng et al, 2019); ConWea (Mekala and Shang, 2020) leverages contextualized representation techniques to provide contextualized weak supervision for text classification. However, all these techniques consider only the text data and don’t leverage metadata information for classification. In this paper, we focus on user-provided seed words and mine label-indicative words and metadata in an iterative manner.
Funding
  • The research was sponsored in part by National Science Foundation CA-2040727
Study subjects and analysis
papers: 38128
• DBLP dataset: The DBLP dataset contains a comprehensive set of research papers in computer science. We select 38, 128 papers published in flagship venues. In addition to text data, it has information about authors, published year, and venue for each paper

Reference
  • Eugene Agichtein and Luis Gravano. 2000. Snowball: Extracting relations from large plain-text collections. In Proceedings of the fifth ACM conference on Digital libraries, pages 85–94.
    Google ScholarLocate open access versionFindings
  • Austin R Benson, David F Gleich, and Jure Leskovec. 2016. Higher-order organization of complex networks. Science, 353(6295):163–166.
    Google ScholarLocate open access versionFindings
  • Ming-Wei Chang, Lev-Arie Ratinov, Dan Roth, and Vivek Srikumar. 2008. Importance of semantic representation: Dataless classification. In Aaai, volume 2, pages 830–835.
    Google ScholarLocate open access versionFindings
  • Huimin Chen, Maosong Sun, Cunchao Tu, Yankai Lin, and Zhiyuan Liu. 2016. Neural sentiment classification with user and product attention. In Proceedings of the 2016 conference on empirical methods in natural language processing, pages 1650–1659.
    Google ScholarLocate open access versionFindings
  • Yuxiao Dong, Nitesh V Chawla, and Ananthram Swami. 2017. metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pages 135–144.
    Google ScholarLocate open access versionFindings
  • A. Hensley, A. Doboli, R. Mangoubi, and S. Doboli. 2015. Generalized label propagation. In 2015 International Joint Conference on Neural Networks (IJCNN), pages 1–8.
    Google ScholarLocate open access versionFindings
  • Rie Johnson and Tong Zhang. 2014. Effective use of word order for text categorization with convolutional neural networks. arXiv preprint arXiv:1412.1058.
    Findings
  • Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
    Findings
  • Benjamin J Kuipers, Patrick Beeson, Joseph Modayil, and Jefferson Provost. 2006. Bootstrap learning of foundational representations. Connection Science, 18(2):145–158.
    Google ScholarLocate open access versionFindings
  • Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convolutional neural networks for text classification. In Twenty-ninth AAAI conference on artificial intelligence.
    Google ScholarLocate open access versionFindings
  • Keqian Li, Hanwen Zha, Yu Su, and Xifeng Yan. 2018. Unsupervised neural categorization for scientific publications. In SIAM Data Mining, pages 37–45. SIAM.
    Google ScholarLocate open access versionFindings
  • Jialu Liu, Jingbo Shang, Chi Wang, Xiang Ren, and Jiawei Han. 2015. Mining quality phrases from massive text corpora. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1729–1744.
    Google ScholarLocate open access versionFindings
  • Dheeraj Mekala and Jingbo Shang. 2020. Contextualized weak supervision for text classification. arXiv preprint arXiv:1612.06778.
    Findings
  • Yu Meng, Jiaming Shen, Chao Zhang, and Jiawei Han. 2018. Weakly-supervised neural text classification. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pages 983–992. ACM.
    Google ScholarLocate open access versionFindings
  • Yu Meng, Jiaming Shen, Chao Zhang, and Jiawei Han. 2019. Weakly-supervised hierarchical text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6826–6833.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
    Google ScholarLocate open access versionFindings
  • Ron Milo, Shai Shen-Orr, Shalev Itzkovitz, Nadav Kashtan, Dmitri Chklovskii, and Uri Alon. 2002. Network motifs: simple building blocks of complex networks. Science, 298(5594):824–827.
    Google ScholarLocate open access versionFindings
  • Takeru Miyato, Andrew M Dai, and Ian Goodfellow. 2016. Adversarial training methods for semisupervised text classification. arXiv:1605.07725.
    Findings
  • Ellen Riloff, Janyce Wiebe, and Theresa Wilson. 2003. Learning subjective nouns using extraction pattern bootstrapping. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 25–32. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. 2012. The author-topic model for authors and documents. arXiv preprint arXiv:1207.4169.
    Findings
  • Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R Voss, and Jiawei Han. 2018. Automated phrase mining from massive text corpora. IEEE Transactions on Knowledge and Data Engineering, 30(10):1825–1837.
    Google ScholarLocate open access versionFindings
  • Jingbo Shang, Meng Qu, Jialu Liu, Lance M Kaplan, Jiawei Han, and Jian Peng. 2016. Meta-path guided embedding for similarity search in large-scale heterogeneous information networks. arXiv preprint arXiv:1610.09769.
    Findings
  • Jingbo Shang, Xinyang Zhang, Liyuan Liu, Sha Li, and Jiawei Han. 2020. Nettaxo: Automated topic taxonomy construction from text-rich network. In Proceedings of The Web Conference 2020, pages 1908– 1919.
    Google ScholarLocate open access versionFindings
  • Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
    Google ScholarLocate open access versionFindings
  • Yangqiu Song and Dan Roth. 2014. On dataless hierarchical text classification. In AAAI.
    Google ScholarFindings
  • Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S Yu, and Tianyi Wu. 2011. Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. Proceedings of the VLDB Endowment, 4(11):992–1003.
    Google ScholarLocate open access versionFindings
  • Duyu Tang, Bing Qin, and Ting Liu. 2015a. Learning semantic representations of users and products for document level sentiment classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1014–1023.
    Google ScholarLocate open access versionFindings
  • Yu Zhang, Wei Wei, Binxuan Huang, Kathleen M Carley, and Yan Zhang. 2017. Rate: Overcoming noise and sparsity of textual features in real-time location estimation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pages 2423–2426.
    Google ScholarLocate open access versionFindings
  • Jian Tang, Meng Qu, and Qiaozhu Mei. 2015b. Pte: Predictive text embedding through large-scale heterogeneous text networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1165–1174. ACM.
    Google ScholarLocate open access versionFindings
  • Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. Arnetminer: Extraction and mining of academic social networks. In KDD’08, pages 990–998.
    Google ScholarLocate open access versionFindings
  • Fangbo Tao, Chao Zhang, Xiusi Chen, Meng Jiang, Tim Hanratty, Lance Kaplan, and Jiawei Han. 2015. Doc2cube: Automated document allocation to text cube via dimension-aware joint embedding. Dimension, 2016:2017.
    Google ScholarLocate open access versionFindings
  • Mengting Wan and Julian J. McAuley. 2018. Item recommendation on monotonic behavior chains. In Proceedings of the 12th ACM Conference on Recommender Systems, RecSys 2018, Vancouver, BC, Canada, October 2-7, 2018, pages 86–94. ACM.
    Google ScholarLocate open access versionFindings
  • Mengting Wan, Rishabh Misra, Ndapa Nakashole, and Julian J. McAuley. 2019. Fine-grained spoiler detection from large-scale review corpora. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 2605–2610. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Weidi Xu, Haoze Sun, Chao Deng, and Ying Tan. 2017. Variational autoencoder for semi-supervised text classification. In AAAI.
    Google ScholarFindings
  • Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 1480–1489.
    Google ScholarLocate open access versionFindings
  • Yu Zhang, Yu Meng, Jiaxin Huang, Frank F Xu, Xuan Wang, and Jiawei Han. 2020. Minimally supervised categorization of text with metadata. arXiv preprint arXiv:2005.00624.
    Findings
Author
Dheeraj Mekala
Dheeraj Mekala
Xinyang Zhang
Xinyang Zhang
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科