Weakly-Supervised Neural Text Classification

CIKM, pp. 983-992, 2018.

Cited by: 30|Bibtex|Views45|Links
EI
Keywords:
different typeconvolution neural networksPseudo Document Generationrecurrent neural networkssupervised text classificationMore(17+)
Weibo:
WeSTClass-convolutional neural networks yields the best performance among all methods; WeSTClassHAN performs slightly worse than WeSTClass-convolutional neural networks but still

Abstract:

Deep neural networks are gaining increasing popularity for the classic text classification task, due to their strong expressive power and less requirement for feature engineering. Despite such attractiveness, neural text classification models suffer from the lack of training data in many real-world applications. Although many semi-supervi...More

Code:

Data:

0
Introduction
  • Text classification plays a fundamental role in a wide variety of applications, ranging from sentiment analysis [27] to document categorization [32] and query intent classification [29].
  • Deep neural models — including convolutional neural networks (CNNs) [11, 12, 35, 36] and recurrent neural networks (RNNs) [22, 23, 32] — have demonstrated superiority for this classic task
  • The attractiveness of these neural models for text classification is mainly two-fold.
  • Training a deep neural model for text classification can consume million-scale labeled documents.
  • Generating 500 to 1000 pseudo documents of each class for pre-training will strike a good balance between pre-training time and model performance
Highlights
  • Text classification plays a fundamental role in a wide variety of applications, ranging from sentiment analysis [27] to document categorization [32] and query intent classification [29]
  • In almost every case, WeSTClass-convolutional neural networks yields the best performance among all methods; WeSTClassHAN performs slightly worse than WeSTClass-convolutional neural networks but still
  • Iterations (e) WeSTClass-convolutional neural networks – Yelp Review (f) WeSTClass-Hierarchical Attention Network – Yelp Review we find that the self-training module generally has the least effect when supervision comes from labeled documents
  • We have proposed a weakly-supervised text classification method built upon neural classifiers
  • With (1) a pseudo document generator for generating pseudo training data and (2) a self-training module that bootstraps on real unlabled data for model refining, our method effectively addresses the key bottleneck for existing neural text classifiers—the lack of labeled training data
  • An interesting finding based on the experiments in Section 6 is that different types of weak supervision are all highly helpful for the good performances of neural models
Methods
  • When fewer labeled documents are provided, PTE, CNN and HAN exhibit obvious performance drop, and become very sensitive to the seed documents.
  • WeSTClass-based models, especially WeSTClass-CNN, yield stable performance with varying amount of labeled documents.
  • This phenomenon shows that the method can more effectively take advantage of the limited amount of seed information to achieve better performance
Results
  • The authors report the experimental results and the findings.

    6.4.1 Overall Text Classification Performance.
  • In the first set of experiments, the authors compare the classification performance of the method against all the baseline methods on the three datasets.
  • Both macro-F1 and micro-F1 metrics are used to quantify the performance of different methods.
  • As shown in Tables 5 and 6, the proposed framework achieves the overall best performances among all the baselines on three datasets with different weak supervision sources.
  • In almost every case, WeSTClass-CNN yields the best performance among all methods; WeSTClassHAN performs slightly worse than WeSTClass-CNN but still
Conclusion
  • The authors have proposed a weakly-supervised text classification method built upon neural classifiers.
  • The authors' method is flexible in incorporating difference sources of weak supervision, and generic enough to support different neural models (CNN and RNN).
  • An interesting finding based on the experiments in Section 6 is that different types of weak supervision are all highly helpful for the good performances of neural models.
  • It is interesting to study how to effectively integrate different types of seed information to further boost the performance of the method
Summary
  • Introduction:

    Text classification plays a fundamental role in a wide variety of applications, ranging from sentiment analysis [27] to document categorization [32] and query intent classification [29].
  • Deep neural models — including convolutional neural networks (CNNs) [11, 12, 35, 36] and recurrent neural networks (RNNs) [22, 23, 32] — have demonstrated superiority for this classic task
  • The attractiveness of these neural models for text classification is mainly two-fold.
  • Training a deep neural model for text classification can consume million-scale labeled documents.
  • Generating 500 to 1000 pseudo documents of each class for pre-training will strike a good balance between pre-training time and model performance
  • Methods:

    When fewer labeled documents are provided, PTE, CNN and HAN exhibit obvious performance drop, and become very sensitive to the seed documents.
  • WeSTClass-based models, especially WeSTClass-CNN, yield stable performance with varying amount of labeled documents.
  • This phenomenon shows that the method can more effectively take advantage of the limited amount of seed information to achieve better performance
  • Results:

    The authors report the experimental results and the findings.

    6.4.1 Overall Text Classification Performance.
  • In the first set of experiments, the authors compare the classification performance of the method against all the baseline methods on the three datasets.
  • Both macro-F1 and micro-F1 metrics are used to quantify the performance of different methods.
  • As shown in Tables 5 and 6, the proposed framework achieves the overall best performances among all the baselines on three datasets with different weak supervision sources.
  • In almost every case, WeSTClass-CNN yields the best performance among all methods; WeSTClassHAN performs slightly worse than WeSTClass-CNN but still
  • Conclusion:

    The authors have proposed a weakly-supervised text classification method built upon neural classifiers.
  • The authors' method is flexible in incorporating difference sources of weak supervision, and generic enough to support different neural models (CNN and RNN).
  • An interesting finding based on the experiments in Section 6 is that different types of weak supervision are all highly helpful for the good performances of neural models.
  • It is interesting to study how to effectively integrate different types of seed information to further boost the performance of the method
Tables
  • Table1: Dataset Statistics
  • Table2: Keyword Lists for The New York Times Dataset
  • Table3: Keyword Lists for AG’s News Dataset
  • Table4: Keyword Lists for Yelp Review Dataset
  • Table5: Macro-F1 scores for all methods on three datasets. LABELS, KEYWORDS, and DOCS means the type of seed supervision is label surface name, class-related keywords, and labeled documents, respectively
  • Table6: Micro-F1 scores for all methods on three datasets. LABELS, KEYWORDS, and DOCS means the type of seed supervision is label surface name, class-related keywords, and labeled documents, respectively
  • Table7: Keyword Lists at Top Percentages of Average Tf-idf
Download tables as Excel
Related work
  • In this section, we review existing studies for weakly-supervised text classification, which can be categorized into two classes: (1) latent variable models; and (2) embedding-based models.

    2.1 Latent Variable Models

    Existing latent variable models for weakly-supervised text classification mainly extend topic models by incorporating user-provided seed information. Specifically, semi-supervised PLSA [16] extends the classic PLSA model by incorporating a conjugate prior based on expert review segments (topic keywords or phrases) to force extracted topics to be aligned with provided review segments. [9] encodes prior knowledge and indirect supervision in constraints on posteriors of latent variable probabilistic models. Descriptive LDA [6] uses an LDA model as the describing device to infer Dirichlet priors from given category labels and descriptions. The Dirichlet priors guides LDA to induce the category-aware topics. Seed-guided topic model [14] takes a small set of seed words that are relevant to the semantic meaning of the category, and then predicts the category labels of the documents through two kinds of topic influence: category-topics and general-topics. The labels of the documents are inferred based on posterior category-topic assignment. Our method differs from these latent variable models in that it is a weaklysupervised neural model. As such, it enjoys two advantages over these latent variable models: (1) it has more flexibility to handle different types of seed information which can be a collection of labeled documents or a set of seed keywords related to each class; (2) it does not need to impose assumptions on document-topic or topic-keyword distributions, but instead directly uses massive data to learn distributed representations to capture text semantics.
Funding
  • This research is sponsored in part by U.S Army Research Lab. under Cooperative Agreement No W911NF-09-2-0053 (NSCTA), DARPA under Agreement No W911NF-17-C-0099, National Science Foundation IIS 16-18481, IIS 17-04532, and IIS-17-41317, DTRA HDTRA11810026, and grant 1U54GM114838 awarded by NIGMS through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative (www.bd2k.nih.gov)
Reference
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. CoRR abs/1409.0473 (2014).
    Findings
  • Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, and Suvrit Sra. 2005. Clustering on the Unit Hypersphere using von Mises-Fisher Distributions. Journal of Machine Learning Research (2005).
    Google ScholarLocate open access versionFindings
  • Kayhan Batmanghelich, Ardavan Saeedi, Karthik Narasimhan, and Samuel Gershman. 2016. Nonparametric Spherical Topic Modeling with Word Embeddings. In ACL.
    Google ScholarFindings
  • David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. In NIPS.
    Google ScholarLocate open access versionFindings
  • Ming-Wei Chang, Lev-Arie Ratinov, Dan Roth, and Vivek Srikumar. 2008. Importance of Semantic Representation: Dataless Classification. In AAAI.
    Google ScholarFindings
  • Xingyuan Chen, Yunqing Xia, Peng Jin, and John A. Carroll. 2015. Dataless Text Classification with Descriptive LDA. In AAAI.
    Google ScholarFindings
  • Ronald Fisher. 1953. Dispersion on a sphere. Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences (1953).
    Google ScholarLocate open access versionFindings
  • Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis. In IJCAI.
    Google ScholarFindings
  • Kuzman Ganchev, João GraÃğa, Jennifer Gillenwater, and Ben Taskar. 2010. Posterior Regularization for Structured Latent Variable Models. Journal of Machine Learning Research (2010).
    Google ScholarLocate open access versionFindings
  • Siddharth Gopal and Yiming Yang. 2014. Von Mises-Fisher Clustering Models. In ICML.
    Google ScholarFindings
  • Rie Johnson and Tong Zhang. 2015. Effective Use of Word Order for Text Categorization with Convolutional Neural Networks. In HLT-NAACL.
    Google ScholarLocate open access versionFindings
  • Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In EMNLP.
    Google ScholarFindings
  • Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving Distributional Similarity with Lessons Learned from Word Embeddings. TACL (2015).
    Google ScholarLocate open access versionFindings
  • Chenliang Li, Jian Xing, Aixin Sun, and Zongyang Ma. 2016. Effective Document Labeling with Very Few Seed Words: A Topic Model Approach. In CIKM.
    Google ScholarFindings
  • Keqian Li, Hanwen Zha, Yu Su, and Xifeng Yan. 2018. Unsupervised Neural Categorization for Scientific Publications. In SDM.
    Google ScholarLocate open access versionFindings
  • Yue Lu and Chengxiang Zhai. 2008. Opinion integration through semi-supervised topic modeling. In WWW.
    Google ScholarFindings
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In NIPS.
    Google ScholarFindings
  • Takeru Miyato, Andrew M. Dai, and Ian Goodfellow. 2016. Adversarial Training Methods for Semi-Supervised Text Classification.
    Google ScholarFindings
  • Kamal Nigam and Rayid Ghani. 2000. Analyzing the Effectiveness and Applicability of Co-training. In CIKM.
    Google ScholarFindings
  • Avital Oliver, Augustus Odena, Colin Raffel, Ekin D. Cubuk, and Ian J. Goodfellow. 2018. Realistic Evaluation of Semi-Supervised Learning Algorithms.
    Google ScholarFindings
  • Chuck Rosenberg, Martial Hebert, and Henry Schneiderman. 2005. SemiSupervised Self-Training of Object Detection Models. In WACV/MOTION.
    Google ScholarFindings
  • Richard Socher, Eric H. Huang, Jeffrey Pennington, Andrew Y. Ng, and Christopher D. Manning. 2011. Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection. In NIPS.
    Google ScholarFindings
  • Richard Socher, Jeffrey Pennington, Eric H. Huang, Andrew Y. Ng, and Christopher D. Manning. 2011. Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions. In EMNLP.
    Google ScholarFindings
  • Yangqiu Song and Dan Roth. 2014. On Dataless Hierarchical Text Classification. In AAAI.
    Google ScholarFindings
  • Suvrit Sra. 2016. Directional statistics in machine learning: a brief review. arXiv preprint arXiv:1605.00316 (2016).
    Findings
  • Suvrit Sra and Sharon K Sra. 2011. A short note on parameter approximation for von Mises-Fisher distributions: and a fast implementation of Is(x).
    Google ScholarFindings
  • Duyu Tang, Bing Qin, and Ting Liu. 2015. Document Modeling with Gated Recurrent Neural Network for Sentiment Classification. In EMNLP.
    Google ScholarFindings
  • Jian Tang, Meng Qu, and Qiaozhu Mei. 2015. PTE: Predictive Text Embedding through Large-scale Heterogeneous Text Networks. In KDD.
    Google ScholarFindings
  • Gilad Tsur, Yuval Pinter, Idan Szpektor, and David Carmel. 2016. Identifying Web Queries with Question Intent. In WWW.
    Google ScholarFindings
  • Junyuan Xie, Ross B. Girshick, and Ali Farhadi. 2016. Unsupervised Deep Embedding for Clustering Analysis. In ICML.
    Google ScholarFindings
  • Weidi Xu, Haoze Sun, Chao Deng, and Ying Tan. 2017. Variational Autoencoder for Semi-Supervised Text Classification. In AAAI.
    Google ScholarFindings
  • Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alexander J. Smola, and Eduard H. Hovy. 2016. Hierarchical Attention Networks for Document Classification. In HLT-NAACL.
    Google ScholarFindings
  • Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alexander J. Smola, and Eduard H. Hovy. 2016. Hierarchical Attention Networks for Document Classification. In HLT-NAACL.
    Google ScholarFindings
  • Chao Zhang, Liyuan Liu, Dongming Lei, Quan Yuan, Honglei Zhuang, Tim Hanratty, and Jiawei Han. 2017. TrioVecEvent: Embedding-Based Online Local Event Detection in Geo-Tagged Tweet Streams. In KDD. 595–604.
    Google ScholarLocate open access versionFindings
  • Xiang Zhang and Yann LeCun. 2015. Text Understanding from Scratch. CoRR abs/1502.01710 (2015).
    Findings
  • Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015. Character-level Convolutional Networks for Text Classification. In NIPS.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments