Convolutional Neural Networks for Sentence Classification

EMNLP, pp. 1746-1751, 2014.

Cited by: 5978|Bibtex|Views731
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
In the present work we have described a series of experiments with convolutional neural networks built on top of word2vec

Abstract:

We report on a series of experiments with convolutional neural networks (CNN) trained on top of pre-trained word vectors for sentence-level classification tasks. We show that a simple CNN with little hyperparameter tuning and static vectors achieves excellent results on multiple benchmarks. Learning task-specific vectors through fine-tu...More

Code:

Data:

Introduction
  • Deep learning models have achieved remarkable results in computer vision (Krizhevsky et al, 2012) and speech recognition (Graves et al, 2013) in recent years.
  • Wherein words are projected from a sparse, 1-of-V encoding onto a lower dimensional vector space via a hidden layer, are essentially feature extractors that encode semantic features of words in their dimensions.
  • In such dense representations, semantically close words are likewise close—in euclidean or cosine distance—in the lower dimensional vector space.
  • Invented for computer vision, CNN models have subsequently been shown to be effective for NLP and have achieved excellent results in semantic parsing (Yih et al, 2014), search query retrieval (Shen et al, 2014), sentence modeling (Kalchbrenner et al, 2014), and other traditional NLP tasks (Collobert et al, 2011)
Highlights
  • Deep learning models have achieved remarkable results in computer vision (Krizhevsky et al, 2012) and speech recognition (Graves et al, 2013) in recent years
  • Wherein words are projected from a sparse, 1-of-V encoding onto a lower dimensional vector space via a hidden layer, are essentially feature extractors that encode semantic features of words in their dimensions
  • Invented for computer vision, Convolutional neural networks models have subsequently been shown to be effective for NLP and have achieved excellent results in semantic parsing (Yih et al, 2014), search query retrieval (Shen et al, 2014), sentence modeling (Kalchbrenner et al, 2014), and other traditional NLP tasks (Collobert et al, 2011)
  • In the present work we have described a series of experiments with convolutional neural networks built on top of word2vec
  • Our results add to the well-established evidence that unsupervised pre-training of word vectors is an important ingredient in deep learning for NLP
Results
  • Results and Discussion

    Results of the models against other methods are listed in table 2.
  • Even a simple model with static vectors (CNN-static) performs remarkably well, giving competitive results against the more sophisticated deep learning models that utilize complex pooling schemes (Kalchbrenner et al, 2014) or require parse trees to be computed beforehand (Socher et al, 2013).
  • Finetuning the pre-trained vectors for each task gives still further improvements (CNN-non-static)
Conclusion
  • In the present work the authors have described a series of experiments with convolutional neural networks built on top of word2vec.
Summary
  • Introduction:

    Deep learning models have achieved remarkable results in computer vision (Krizhevsky et al, 2012) and speech recognition (Graves et al, 2013) in recent years.
  • Wherein words are projected from a sparse, 1-of-V encoding onto a lower dimensional vector space via a hidden layer, are essentially feature extractors that encode semantic features of words in their dimensions.
  • In such dense representations, semantically close words are likewise close—in euclidean or cosine distance—in the lower dimensional vector space.
  • Invented for computer vision, CNN models have subsequently been shown to be effective for NLP and have achieved excellent results in semantic parsing (Yih et al, 2014), search query retrieval (Shen et al, 2014), sentence modeling (Kalchbrenner et al, 2014), and other traditional NLP tasks (Collobert et al, 2011)
  • Results:

    Results and Discussion

    Results of the models against other methods are listed in table 2.
  • Even a simple model with static vectors (CNN-static) performs remarkably well, giving competitive results against the more sophisticated deep learning models that utilize complex pooling schemes (Kalchbrenner et al, 2014) or require parse trees to be computed beforehand (Socher et al, 2013).
  • Finetuning the pre-trained vectors for each task gives still further improvements (CNN-non-static)
  • Conclusion:

    In the present work the authors have described a series of experiments with convolutional neural networks built on top of word2vec.
Tables
  • Table1: Summary statistics for the datasets after tokenization. c: Number of target classes. l: Average sentence length. N : Dataset size. |V |: Vocabulary size. |Vpre|: Number of words present in the set of pre-trained word vectors. Test: Test set size (CV means there was no standard train/test split and thus 10-fold CV was used)
  • Table2: Results of our CNN models against other methods. RAE: Recursive Autoencoders with pre-trained word vectors from Wikipedia (<a class="ref-link" id="cSocher_et+al_2011_a" href="#rSocher_et+al_2011_a">Socher et al, 2011</a>). MV-RNN: Matrix-Vector Recursive Neural Network with parse trees (<a class="ref-link" id="cSocher_et+al_2012_a" href="#rSocher_et+al_2012_a">Socher et al, 2012</a>). RNTN: Recursive Neural Tensor Network with tensor-based feature function and parse trees (<a class="ref-link" id="cSocher_et+al_2013_a" href="#rSocher_et+al_2013_a">Socher et al, 2013</a>). DCNN: Dynamic Convolutional Neural Network with k-max pooling (<a class="ref-link" id="cKalchbrenner_et+al_2014_a" href="#rKalchbrenner_et+al_2014_a">Kalchbrenner et al, 2014</a>). Paragraph-Vec: Logistic regression on top of paragraph vectors (<a class="ref-link" id="cLe_2014_a" href="#rLe_2014_a">Le and Mikolov, 2014</a>). CCAE: Combinatorial Category Autoencoders with combinatorial category grammar operators (<a class="ref-link" id="cHermann_2013_a" href="#rHermann_2013_a">Hermann and Blunsom, 2013</a>). Sent-Parser: Sentiment analysis-specific parser (<a class="ref-link" id="cDong_et+al_2014_a" href="#rDong_et+al_2014_a">Dong et al, 2014</a>). NBSVM, MNB: Naive Bayes SVM and Multinomial Naive Bayes with uni-bigrams from <a class="ref-link" id="cWang_2012_a" href="#rWang_2012_a">Wang and Manning (2012</a>). G-Dropout, F-Dropout: Gaussian Dropout and Fast Dropout from <a class="ref-link" id="cWang_2013_a" href="#rWang_2013_a">Wang and Manning (2013</a>). Tree-CRF: Dependency tree with Conditional Random Fields (<a class="ref-link" id="cNakagawa_et+al_2010_a" href="#rNakagawa_et+al_2010_a">Nakagawa et al, 2010</a>). CRF-PR: Conditional Random Fields with Posterior Regularization (<a class="ref-link" id="cYang_2014_a" href="#rYang_2014_a">Yang and Cardie, 2014</a>). SVMS: SVM with uni-bi-trigrams, wh word, head word, POS, parser, hypernyms, and 60 hand-coded rules as features from <a class="ref-link" id="cSilva_et+al_2011_a" href="#rSilva_et+al_2011_a">Silva et al (2011</a>)
  • Table3: Top 4 neighboring words—based on cosine similarity—for vectors in the static channel (left) and finetuned vectors in the non-static channel (right) from the multichannel model on the SST-2 dataset after training
Download tables as Excel
Funding
  • Reports on a series of experiments with convolutional neural networks trained on top of pre-trained word vectors for sentence-level classification tasks
  • Shows that a simple CNN with little hyperparameter tuning and static vectors achieves excellent results on multiple benchmarks
  • Proposes a simple modification to the architecture to allow for the use of both task-specific and static vectors
  • Describes a simple modification to the architecture to allow for the use of both pre-trained and task-specific vectors by having multiple channels
  • Has described the process by which one feature is extracted from one filter
Reference
  • Y. Bengio, R. Ducharme, P. Vincent. 2003. Neural Probabilitistic Language Model. Journal of Machine Learning Research 3:1137–1155.
    Google ScholarLocate open access versionFindings
  • R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuglu, P. Kuksa. 2011. Natural Language Processing (Almost) from Scratch. Journal of Machine Learning Research 12:2493–2537.
    Google ScholarLocate open access versionFindings
  • J. Duchi, E. Hazan, Y. Singer. 2011 Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159.
    Google ScholarLocate open access versionFindings
  • L. Dong, F. Wei, S. Liu, M. Zhou, K. Xu. 201A Statistical Parsing Framework for Sentiment Classification. CoRR, abs/1401.6330.
    Findings
  • A. Graves, A. Mohamed, G. Hinton. 2013. Speech recognition with deep recurrent neural networks. In Proceedings of ICASSP 2013.
    Google ScholarLocate open access versionFindings
  • B. Pang, L. Lee. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of ACL 2004.
    Google ScholarLocate open access versionFindings
  • B. Pang, L. Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of ACL 2005.
    Google ScholarLocate open access versionFindings
  • A.S. Razavian, H. Azizpour, J. Sullivan, S. Carlsson 2014. CNN Features off-the-shelf: an Astounding Baseline. CoRR, abs/1403.6382.
    Findings
  • Y. Shen, X. He, J. Gao, L. Deng, G. Mesnil. 2014. Learning Semantic Representations Using Convolutional Neural Networks for Web Search. In Proceedings of WWW 2014.
    Google ScholarLocate open access versionFindings
  • J. Silva, L. Coheur, A. Mendes, A. Wichert. 2011. From symbolic to sub-symbolic information in question classification. Artificial Intelligence Review, 35(2):137–154.
    Google ScholarLocate open access versionFindings
  • G. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, R. Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580.
    Findings
  • R. Socher, J. Pennington, E. Huang, A. Ng, C. Manning. 2011. Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions. In Proceedings of EMNLP 2011.
    Google ScholarLocate open access versionFindings
  • K. Hermann, P. Blunsom. 20The Role of Syntax in Vector Space Models of Compositional Semantics. In Proceedings of ACL 2013.
    Google ScholarLocate open access versionFindings
  • R. Socher, B. Huval, C. Manning, A. Ng. 2012. Semantic Compositionality through Recursive MatrixVector Spaces. In Proceedings of EMNLP 2012.
    Google ScholarLocate open access versionFindings
  • M. Hu, B. Liu. 2004. Mining and Summarizing Customer Reviews. In Proceedings of ACM SIGKDD 2004.
    Google ScholarLocate open access versionFindings
  • M. Iyyer, P. Enns, J. Boyd-Graber, P. Resnik 2014. Political Ideology Detection Using Recursive Neural Networks. In Proceedings of ACL 2014.
    Google ScholarLocate open access versionFindings
  • N. Kalchbrenner, E. Grefenstette, P. Blunsom. 2014. A Convolutional Neural Network for Modelling Sentences. In Proceedings of ACL 2014.
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky, I. Sutskever, G. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of NIPS 2012.
    Google ScholarLocate open access versionFindings
  • Q. Le, T. Mikolov. 2014. Distributed Represenations of Sentences and Documents. In Proceedings of ICML 2014.
    Google ScholarLocate open access versionFindings
  • Y. LeCun, L. Bottou, Y. Bengio, P. Haffner. 1998. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 86(11):2278– 2324, November.
    Google ScholarLocate open access versionFindings
  • X. Li, D. Roth. 2002. Learning Question Classifiers. In Proceedings of ACL 2002.
    Google ScholarLocate open access versionFindings
  • R. Socher, A. Perelygin, J. Wu, J. Chuang, C. Manning, A. Ng, C. Potts. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of EMNLP 2013.
    Google ScholarLocate open access versionFindings
  • J. Wiebe, T. Wilson, C. Cardie. 2005. Annotating Expressions of Opinions and Emotions in Language. Language Resources and Evaluation, 39(2-3): 165– 210.
    Google ScholarLocate open access versionFindings
  • S. Wang, C. Manning. 2012. Baselines and Bigrams: Simple, Good Sentiment and Topic Classification. In Proceedings of ACL 2012.
    Google ScholarLocate open access versionFindings
  • S. Wang, C. Manning. 2013. Fast Dropout Training. In Proceedings of ICML 2013.
    Google ScholarLocate open access versionFindings
  • B. Yang, C. Cardie. 2014. Context-aware Learning for Sentence-level Sentiment Analysis with Posterior Regularization. In Proceedings of ACL 2014.
    Google ScholarLocate open access versionFindings
  • W. Yih, K. Toutanova, J. Platt, C. Meek. 2011. Learning Discriminative Projections for Text Similarity Measures. Proceedings of the Fifteenth Conference on Computational Natural Language Learning, 247–256.
    Google ScholarLocate open access versionFindings
  • T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS 2013.
    Google ScholarLocate open access versionFindings
  • T. Nakagawa, K. Inui, S. Kurohashi. 2010. Dependency tree-based sentiment classification using CRFs with hidden variables. In Proceedings of ACL 2010.
    Google ScholarLocate open access versionFindings
  • W. Yih, X. He, C. Meek. 2014. Semantic Parsing for Single-Relation Question Answering. In Proceedings of ACL 2014.
    Google ScholarLocate open access versionFindings
  • M. Zeiler. 2012. Adadelta: An adaptive learning rate method. CoRR, abs/1212.5701.
    Findings
Full Text
Your rating :
0

 

Tags
Comments