Learning Structured Representation for Text Classification via Reinforcement Learning
AAAI, 2018.
EI
Weibo:
Abstract:
Representation learning is a fundamental problem in natural language processing. This paper studies how to learn a structured representation for text classification. Unlike most existing representation models that either use no structure or rely on pre-specified structures, we propose a reinforcement learning (RL) method to learn sentence...More
Code:
Data:
Introduction
- Representation learning is a fundamental problem in AI, and important for natural language processing (NLP) (Bengio, Courville, and Vincent 2013; Le and Mikolov 2014).
- Bag-of-words representation models ignore the order of words, including deep average network (Iyyer et al 2015; Joulin et al 2017) and autoencoders (Liu et al 2015)
- Sequence representation models such as convolutional neural network (Kim 2014; Kalchbrenner, Grefenstette, and Blunsom 2014; Lei, Barzilay, and Jaakkola 2015) and recurrent neural network (Hochreiter and Schmidhuber 1997; Chung et al 2014) consider word order but do not use any structure.
- Structured representation models such as tree-structured LSTM (Zhu, Sobihani, and Guo 2015; Tai, Socher, and Manning 2015)
Highlights
- Representation learning is a fundamental problem in AI, and important for natural language processing (NLP) (Bengio, Courville, and Vincent 2013; Le and Mikolov 2014)
- We propose a reinforcement learning (RL) method to build structured sentence representations by identifying task-relevant structures without explicit structure annotations
- We propose two structured representation models: information distilled LSTM (ID-LSTM) and hierarchical structured LSTM (HS-LSTM)
- The model consists of three components: Policy Network (PNet), structured representation models, and Classification Network (CNet)
- This paper has presented a reinforcement learning method which learns sentence representation by discovering taskrelevant structures
Methods
- Overview
The goal of this paper is to learn structured representation for text classification by discovering important, task-relevant structures. - PNet adopts a stochastic policy and samples an action at each state.
- The structured representation models translate the actions into a structured representation.
- CNet makes classification based on the structured representation and offers reward computation to PNet. Since the reward can be computed once the final representation is available, the process can be naturally addressed by policy gradient method (Sutton et al 2000)
Results
- Classification results as listed in Table 2 show that the models perform competitively across different datasets and different tasks.
- Comparing to pre-specified parsing structures, automatically discovered structures seem to be more friendly for classification.
- These results demonstrate the effectiveness of learning structured representations by discovering task-relevant structures.
- Origin text ID-LSTM HS-LSTM Origin text ID-LSTM HS-LSTM Origin text ID-LSTM HS-LSTM.
- Cho continues her exploration of the outer limits of raunch with considerable brio.
- Offers an interesting look at the rapidly changing face of Beijing
Conclusion
- This paper has presented a reinforcement learning method which learns sentence representation by discovering taskrelevant structures.
- In the framework of RL, the authors adopted two representation models: ID-LSTM that distills task-relevant words to form purified sentence representation, and HSLSTM that discovers phrase structures to form hierarchical sentence representation.
- Extensive experiments show that the method has state-of-the-art performance and is able to discover interesting task-relevant structures without explicit structure annotations.
- The authors will apply the method to other types of sequences since the idea of structure discovery can be generalized to other tasks and domains
Summary
- Representation learning is a fundamental problem in AI, and important for natural language processing (NLP) (Bengio, Courville, and Vincent 2013; Le and Mikolov 2014).
- In our RL method, we design two structured representation models: Information Distilled LSTM (ID-LSTM) which selects important, task-relevant words to build sentence representation, and Hierarchical Structured LSTM (HS-LSTM) which discovers phrase structures and builds sentence representation with a two-level LSTM.
- We propose a reinforcement learning method which discovers task-relevant structures to build structured sentence representations for text classification problems.
- The goal of this paper is to learn structured representation for text classification by discovering important, task-relevant structures.
- The model consists of three components: Policy Network (PNet), structured representation models, and Classification Network (CNet).
- Once all the actions are decided, the representation models will obtain a structured representation of the sentence, and it will be used by CNet to compute P (y|X).
- Reward Once all the actions are sampled by the policy network, the structured representation of a sentence is determined by our representation models, and the representation will be passed to CNet to obtain P (y|X) where y is the class label.
- ID-LSTM translates the actions obtained from PNet to a structured representation of a sentence.
- HS-LSTM translates the actions to a hierarchical structured representation of the sentence.
- If action at−1 is End, the word at position t is the start of a phrase and the wordlevel LSTM starts with a zero-initialized state.
- The classification network produces a probability distribution over class labels based on the structured representation obtained from ID-LSTM or HS-LSTM.
- Taking sentiment classification as an example, we observed that the retained words by ID-LSTM are mostly sentiment words and negation words, indicating that the model can distill important, task-relevant words.
- The most and least deleted words by ID-LSTM in the SST dataset are listed in Table 5, ordered by deletion percentage (Deleted/Count).
- The qualitative and quantitative results demonstrate that ID-LSTM is able to remove irrelevant words and distill task-relevant ones in a sentence.
- Quantitative analysis: First of all, we compared HS-LSTM with other structured models to investigate whether classification tasks can benefit from the discovered structure.
- The results in Table 8 show that HS-LSTM outperforms other structured models, indicating that the discovered structure may be more task-relevant and advantageous than that given by parser.
- Our HS-LSTM has the ability of discovering task-relevant structures and building better structured sentence representations.
- In the framework of RL, we adopted two representation models: ID-LSTM that distills task-relevant words to form purified sentence representation, and HSLSTM that discovers phrase structures to form hierarchical sentence representation.
- We will apply the method to other types of sequences since the idea of structure discovery can be generalized to other tasks and domains
Tables
- Table1: The behavior of HS-LSTM according to action at−1 and at
- Table2: Classification accuracy on different datasets. Results marked with * are re-printed from (<a class="ref-link" id="cTai_et+al_2015_a" href="#rTai_et+al_2015_a">Tai, Socher, and Manning 2015</a>), (<a class="ref-link" id="cKim_2014_a" href="#rKim_2014_a">Kim 2014</a>), and (<a class="ref-link" id="cHuang_et+al_2017_a" href="#rHuang_et+al_2017_a">Huang, Qian, and Zhu 2017</a>). The rest are obtained by our own implementation
- Table3: Examples of the structures distilled and discovered by ID-LSTM and HS-LSTM
- Table4: The original average length and distilled average length by ID-LSTM in the test set of each dataset
- Table5: The most/least deleted words in the test set of SST
- Table6: The comparison of the predefined structures and those discovered by HS-LSTM
- Table7: Phrase examples discovered by HS-LSTM
- Table8: Classification accuracy from structured models. The result marked with * is re-printed from (<a class="ref-link" id="cYogatama_et+al_2017_a" href="#rYogatama_et+al_2017_a">Yogatama et al 2017</a>)
- Table9: Statistics of structures discovered by HS-LSTM in the test set of each dataset
Funding
- This work was partly supported by the National Science Foundation of China under grant No.61272227/61332007
Reference
- Bengio, Y.; Courville, A.; and Vincent, P. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35(8):1798–1828.
- Bowman, S. R.; Angeli, G.; Potts, C.; and Manning, C. D. 2015. A large annotated corpus for learning natural language inference. In EMNLP, 632–642.
- Chung, J.; Ahn, S.; and Bengio, Y. 2017. Hierarchical multiscale recurrent neural networks. In ICLR.
- Chung, J.; Gulcehre, C.; Cho, K.; and Bengio, Y. 201Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop.
- Ghosh, S.; Vinyals, O.; Strope, B.; Roy, S.; Dean, T.; and Heck, L. 2016. Contextual lstm (clstm) models for large scale nlp tasks. In SIGKDD workshop Oral presentation.
- Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
- Huang, M.; Qian, Q.; and Zhu, X. 201Encoding syntactic knowledge in neural networks for sentiment classification. ACM Transactions on Information Systems (TOIS) 35(3):26.
- Iyyer, M.; Manjunatha, V.; Boyd-Graber, J.; and Daume III, H. 2015. Deep unordered composition rivals syntactic methods for text classification. In ACL, 1681–1691.
- Joulin, A.; Grave, E.; Bojanowski, P.; and Mikolov, T. 2017. Bag of tricks for efficient text classification. In EACL, 427– 431.
- Kalchbrenner, N.; Grefenstette, E.; and Blunsom, P. 2014. A convolutional neural network for modelling sentences. In ACL, 655–665.
- Kim, Y. 2014. Convolutional neural networks for sentence classification. In EMNLP, 1746–1751.
- Kingma, D., and Ba, J. 2015. Adam: A method for stochastic optimization. In ICLR.
- Klein, D., and Manning, C. D. 2003. Accurate unlexicalized parsing. In ACL, 423–430.
- Le, Q., and Mikolov, T. 20Distributed representations of sentences and documents. In ICML, 1188–1196.
- Lei, T.; Barzilay, R.; and Jaakkola, T. 20Molding cnns for text: non-linear, non-consecutive convolutions. In EMNLP, 1565–1575.
- Lin, Z.; Feng, M.; Santos, C. N. d.; Yu, M.; Xiang, B.; Zhou, B.; and Bengio, Y. 2017. A structured self-attentive sentence embedding. In ICLR.
- Liu, B.; Huang, M.; Sun, J.; and Zhu, X. 2015. Incorporating domain and sentiment supervision in representation learning for domain adaptation. In IJCAI, 1277–1283.
- Pang, B., and Lee, L. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In ACL, 271.
- Pang, B., and Lee, L. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In ACL, 115–124.
- Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global vectors for word representation. In EMNLP, volume 14, 1532–1543.
- Qian, Q.; Tian, B.; Huang, M.; Liu, Y.; Zhu, X.; and Zhu, X. 2015. Learning tag embeddings and tag-specific composition functions in recursive neural network. In ACL, 1365– 1374.
- Qian, Q.; Huang, M.; Lei, J.; and Zhu, X. 2017. Linguistically regularized lstm for sentiment classification. In ACL, 1679–1689.
- Socher, R.; Pennington, J.; Huang, E. H.; Ng, A. Y.; and Manning, C. D. 2011. Semi-supervised recursive autoencoders for predicting sentiment distributions. In EMNLP, 151–161.
- Socher, R.; Perelygin, A.; Wu, J. Y.; Chuang, J.; Manning, C. D.; Ng, A. Y.; Potts, C.; et al. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, volume 1631, 1642.
- Sutton, R. S.; McAllester, D. A.; Singh, S. P.; and Mansour, Y. 2000. Policy gradient methods for reinforcement learning with function approximation. In NIPS, 1057–1063.
- Tai, K. S.; Socher, R.; and Manning, C. D. 2015. Improved semantic representations from tree-structured long short-term memory networks. In ACL, 1556–1566.
- Tang, D.; Qin, B.; and Liu, T. 2015. Document modeling with gated recurrent neural network for sentiment classification. In EMNLP, 1422–1432.
- Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8(3-4):229–256.
- Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A. J.; and Hovy, E. H. 2016. Hierarchical attention networks for document classification. In NAACL-HLT, 1480–1489.
- Yogatama, D.; Blunsom, P.; Dyer, C.; Grefenstette, E.; and Ling, W. 2017. Learning to compose words into sentences with reinforcement learning. In ICLR.
- Zhang, X.; Zhao, J.; and LeCun, Y. 2015. Character-level convolutional networks for text classification. In NIPS, 649– 657.
- Zhou, X.; Wan, X.; and Xiao, J. 2016. Attention-based lstm network for cross-lingual sentiment classification. In EMNLP, 247–256.
- Zhu, X.; Guo, H.; Mohammad, S.; and Kiritchenko, S. 2014. An empirical study on the effect of negation words on sentiment. In ACL, 304–313.
- Zhu, X.; Sobihani, P.; and Guo, H. 2015. Long short-term memory over recursive structures. In International Conference on Machine Learning, 1604–1612.
Full Text
Tags
Comments