AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We propose META, a novel framework that leverages metadata information as an additional source of weak supervision and incorporates it into the classification framework
META: Metadata Empowered Weak Supervision for Text Classification
EMNLP 2020, pp.8351-8361, (2020)
Recent advances in weakly supervised learning enable training high-quality text classifiers by only providing a few user-provided seed words. Existing methods mainly use text data alone to generate pseudo-labels despite the fact that metadata information (e.g., author and timestamp) is widely available across various domains. Strong label...More
PPT (Upload PPT)
- Supervised text classification has recently gained much attention from the researchers because it reduces the burden of annotating the data.
- Supervised text classification has recently gained much attention from the researchers because it reduces the burden of annotating the data
- The major source of weak supervision lies in text data itself (Agichtein and Gravano, 2000; Kuipers et al, 2006; Riloff et al, 2003; Tao et al, 2015; Meng et al, 2018; Mekala and Shang, 2020)
- We develop a unified, principled ranking mechanism to select label-indicative motif instances and words, forming expanded weak supervision
- We explore to incorporate metadata information as an additional source of weak supervision for text classification along with seed words
- We propose META, a novel framework that leverages metadata information as an additional source of weak supervision and incorporates it into the classification framework
- Experimental results and case studies demonstrate that our model outperforms previous methods significantly, thereby signifying the advantages of leveraging metadata as weak supervision
- The authors compare the proposed method with a wide range of methods described below: IR-TF-IDF treats seed words as a query.
- The authors present results of all the baselines on the metadata-augmented datasets, where a token for every relevant motif instance is appended to the text data of a document.
- This is denoted by ++ in Table 2, e.g., WeSTClass++ represents the performance of WeSTClass on metadata-augmented datasets.
- The results of HAN-Sup reported are on the test set which follows an 80-10-10 train-dev-test split
- One can observe that META prunes out many motif instances, as the final selection ratio is far less than 100%.
- The authors propose META, a novel framework that leverages metadata information as an additional source of weak supervision and incorporates it into the classification framework.
- Experimental results and case studies demonstrate that the model outperforms previous methods significantly, thereby signifying the advantages of leveraging metadata as weak supervision.
- There should be negatively label-indicative combinations as well which can eliminate some classes from potential labels
- This is another potential direction for the extension of the method
- Table1: Dataset statistics
- Table2: Evaluation Results on Two Datasets. ++ represents that the input is metadata-augmented
- Table3: Case Study: Expanded motif instances
- Table4: Case Study: Percentage of motif instances expanded for Book Graph dataset. A stands for author, P for publisher and Y for year
- Table5: Expanded seed words of comics, history, and mystery classes in Books dataset
- We review the literature about (1) weakly supervised text classification methods, (2) text classification with metadata, and (3) document classifiers.
5.1 Weakly Supervised Text Classification
Due to the training data bottleneck in supervised classification, weakly supervised classification has recently attracted much attention from researchers. The majority of weakly supervised classification techniques require seeds in various forms, including label surface names (Li et al, 2018; Song and Roth, 2014; Tao et al, 2015), label-indicative words (Chang et al, 2008; Meng et al, 2018; Tao et al, 2015; Mekala and Shang, 2020), and labeleddocuments (Tang et al, 2015b; Xu et al, 2017; Miyato et al, 2016; Meng et al, 2018).
Dataless (Song and Roth, 2014) considers label surface names as seeds and classifies documents by embedding both labels and documents in a semantic space and computing semantic similarity between a document and a potential label; Along similar lines, Doc2Cube (Tao et al, 2015) expands label-indicative words using label surface names and performs multi-dimensional document classification by learning dimension-aware embedding; WeSTClass (Meng et al, 2018) considers both word-level and document level supervision sources. It first generates bag-of-words pseudo documents for neural model pre-training, then bootstraps the model on unlabeled data. This method is later extended to a hierarchical setting with a pre-defined hierarchy (Meng et al, 2019); ConWea (Mekala and Shang, 2020) leverages contextualized representation techniques to provide contextualized weak supervision for text classification. However, all these techniques consider only the text data and don’t leverage metadata information for classification. In this paper, we focus on user-provided seed words and mine label-indicative words and metadata in an iterative manner.
- The research was sponsored in part by National Science Foundation CA-2040727
Study subjects and analysis
• DBLP dataset: The DBLP dataset contains a comprehensive set of research papers in computer science. We select 38, 128 papers published in flagship venues. In addition to text data, it has information about authors, published year, and venue for each paper
- Eugene Agichtein and Luis Gravano. 2000. Snowball: Extracting relations from large plain-text collections. In Proceedings of the fifth ACM conference on Digital libraries, pages 85–94.
- Austin R Benson, David F Gleich, and Jure Leskovec. 2016. Higher-order organization of complex networks. Science, 353(6295):163–166.
- Ming-Wei Chang, Lev-Arie Ratinov, Dan Roth, and Vivek Srikumar. 2008. Importance of semantic representation: Dataless classification. In Aaai, volume 2, pages 830–835.
- Huimin Chen, Maosong Sun, Cunchao Tu, Yankai Lin, and Zhiyuan Liu. 2016. Neural sentiment classification with user and product attention. In Proceedings of the 2016 conference on empirical methods in natural language processing, pages 1650–1659.
- Yuxiao Dong, Nitesh V Chawla, and Ananthram Swami. 2017. metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pages 135–144.
- A. Hensley, A. Doboli, R. Mangoubi, and S. Doboli. 2015. Generalized label propagation. In 2015 International Joint Conference on Neural Networks (IJCNN), pages 1–8.
- Rie Johnson and Tong Zhang. 2014. Effective use of word order for text categorization with convolutional neural networks. arXiv preprint arXiv:1412.1058.
- Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
- Benjamin J Kuipers, Patrick Beeson, Joseph Modayil, and Jefferson Provost. 2006. Bootstrap learning of foundational representations. Connection Science, 18(2):145–158.
- Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convolutional neural networks for text classification. In Twenty-ninth AAAI conference on artificial intelligence.
- Keqian Li, Hanwen Zha, Yu Su, and Xifeng Yan. 2018. Unsupervised neural categorization for scientific publications. In SIAM Data Mining, pages 37–45. SIAM.
- Jialu Liu, Jingbo Shang, Chi Wang, Xiang Ren, and Jiawei Han. 2015. Mining quality phrases from massive text corpora. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1729–1744.
- Dheeraj Mekala and Jingbo Shang. 2020. Contextualized weak supervision for text classification. arXiv preprint arXiv:1612.06778.
- Yu Meng, Jiaming Shen, Chao Zhang, and Jiawei Han. 2018. Weakly-supervised neural text classification. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pages 983–992. ACM.
- Yu Meng, Jiaming Shen, Chao Zhang, and Jiawei Han. 2019. Weakly-supervised hierarchical text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6826–6833.
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
- Ron Milo, Shai Shen-Orr, Shalev Itzkovitz, Nadav Kashtan, Dmitri Chklovskii, and Uri Alon. 2002. Network motifs: simple building blocks of complex networks. Science, 298(5594):824–827.
- Takeru Miyato, Andrew M Dai, and Ian Goodfellow. 2016. Adversarial training methods for semisupervised text classification. arXiv:1605.07725.
- Ellen Riloff, Janyce Wiebe, and Theresa Wilson. 2003. Learning subjective nouns using extraction pattern bootstrapping. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 25–32. Association for Computational Linguistics.
- Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. 2012. The author-topic model for authors and documents. arXiv preprint arXiv:1207.4169.
- Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R Voss, and Jiawei Han. 2018. Automated phrase mining from massive text corpora. IEEE Transactions on Knowledge and Data Engineering, 30(10):1825–1837.
- Jingbo Shang, Meng Qu, Jialu Liu, Lance M Kaplan, Jiawei Han, and Jian Peng. 2016. Meta-path guided embedding for similarity search in large-scale heterogeneous information networks. arXiv preprint arXiv:1610.09769.
- Jingbo Shang, Xinyang Zhang, Liyuan Liu, Sha Li, and Jiawei Han. 2020. Nettaxo: Automated topic taxonomy construction from text-rich network. In Proceedings of The Web Conference 2020, pages 1908– 1919.
- Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
- Yangqiu Song and Dan Roth. 2014. On dataless hierarchical text classification. In AAAI.
- Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S Yu, and Tianyi Wu. 2011. Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. Proceedings of the VLDB Endowment, 4(11):992–1003.
- Duyu Tang, Bing Qin, and Ting Liu. 2015a. Learning semantic representations of users and products for document level sentiment classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1014–1023.
- Yu Zhang, Wei Wei, Binxuan Huang, Kathleen M Carley, and Yan Zhang. 2017. Rate: Overcoming noise and sparsity of textual features in real-time location estimation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pages 2423–2426.
- Jian Tang, Meng Qu, and Qiaozhu Mei. 2015b. Pte: Predictive text embedding through large-scale heterogeneous text networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1165–1174. ACM.
- Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. Arnetminer: Extraction and mining of academic social networks. In KDD’08, pages 990–998.
- Fangbo Tao, Chao Zhang, Xiusi Chen, Meng Jiang, Tim Hanratty, Lance Kaplan, and Jiawei Han. 2015. Doc2cube: Automated document allocation to text cube via dimension-aware joint embedding. Dimension, 2016:2017.
- Mengting Wan and Julian J. McAuley. 2018. Item recommendation on monotonic behavior chains. In Proceedings of the 12th ACM Conference on Recommender Systems, RecSys 2018, Vancouver, BC, Canada, October 2-7, 2018, pages 86–94. ACM.
- Mengting Wan, Rishabh Misra, Ndapa Nakashole, and Julian J. McAuley. 2019. Fine-grained spoiler detection from large-scale review corpora. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 2605–2610. Association for Computational Linguistics.
- Weidi Xu, Haoze Sun, Chao Deng, and Ying Tan. 2017. Variational autoencoder for semi-supervised text classification. In AAAI.
- Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 1480–1489.
- Yu Zhang, Yu Meng, Jiaxin Huang, Frank F Xu, Xuan Wang, and Jiawei Han. 2020. Minimally supervised categorization of text with metadata. arXiv preprint arXiv:2005.00624.