AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Instead of the tradition of exploiting complicated operations by stacking CNNs and RNNs, or attaching overparameterized attention mechanisms, our work provides a lightweight method for improving the ability of neural models for sentence classification

MODE LSTM: A Parameter efficient Recurrent Network with Multi Scale for Sentence Classification

EMNLP 2020, pp.6705-6715, (2020)

Cited by: 0|Views118
Full Text
Bibtex
Weibo

Abstract

The central problem of sentence classification is to extract multi-scale n-gram features for understanding the semantic meaning of sentences. Most existing models tackle this problem by stacking CNN and RNN models, which easily leads to feature redundancy and overfitting because of relatively limited datasets. In this paper, we propose a ...More

Code:

Data:

0
Introduction
  • Sentence classification (SC) is a fundamental and traditional task in natural language processing (NLP), which is widely used in many subareas, such as sentiment analysis (Wang et al, 2016a, 2018) and question classification (Shi et al, 2016).
  • The convolution operation itself is linear, which may not be sufficient to model the non-consecutive dependency of the phrase (Lei et al, 2015) and may lose the sequential information (Madasu and Anvesh Rao, 2019).
  • As shown in Figure 1, the weighted sum of the phrase “not almost as bad” does not capture the non-consecutive dependency of “not bad” very well and ignores the sequential information.
  • Some researchers (Zhao et al, 2018a; Zhou et al, 2018; Madasu and Anvesh Rao, 2019) attach an over-parameterized attention mechanism to enhance salient features and remove redundancy, but overfitting still occurs due to the increase in parameters for limited datasets
Highlights
  • Sentence classification (SC) is a fundamental and traditional task in natural language processing (NLP), which is widely used in many subareas, such as sentiment analysis (Wang et al, 2016a, 2018) and question classification (Shi et al, 2016)
  • CNNs excel at extracting n-gram features of sentences through a convolution operation followed by non-linear and pooling layers and have achieved impressive results in sentence classification (Kalchbrenner et al, 2014; Kim, 2014)
  • The parameters of TextCNN are less than ours, its parameters increase with the size of the filter window, whereas the parameters of our model are independent of the window size
  • This study presents a novel parameter-efficient model called MODE-LSTM that can capture multiscale n-gram features in sentences
  • Instead of the tradition of exploiting complicated operations by stacking CNNs and RNNs, or attaching overparameterized attention mechanisms, our work provides a lightweight method for improving the ability of neural models for sentence classification
Methods
  • Baseline Methods The authors compare

    MODELSTM with three types of strong baselines: 1) CNN/RNN-based model: TextCNN (Kim, 2014), LSTM (Tai et al, 2015) and HM-LSTM (Zhang et al, 2018). 2) Hybrid models: C-LSTM (Zhou et al, 2015) which directly stacks CNN and LSTM, while DARLM (Zhou et al, 2018) and Selfattentive (Lin et al, 2017) includes an attention mechanism for distilling important information.
  • MODELSTM with three types of strong baselines: 1) CNN/RNN-based model: TextCNN (Kim, 2014), LSTM (Tai et al, 2015) and HM-LSTM (Zhang et al, 2018).
  • The authors use the LSTM as the basic unit of DRNN, called DLSTM.
  • In addition to the above models, the authors use ODE-LSTM as a baseline.
  • To 6 and the size p of small hidden states to 50 to make the number of parameters consistent with MODE-LSTM
Results
  • MODE-LSTM significantly outperforms the compared models and is superior to DLSTM with an average accuracy gain over 1.0% because ours disentangles the RNN hidden states and considers multi-scale features in sentences.
  • The authors' model achieves better or similar performance with recent state-of-the-art model HAC.
  • The authors' model is simple yet effective, like the one-layer TextCNN.
  • ODE-LSTM outperforms LSTM with an average accuracy gain 0.7%, which verifies the effectiveness of disentangling the hidden states
Conclusion
  • This study presents a novel parameter-efficient model called MODE-LSTM that can capture multiscale n-gram features in sentences.
  • Instead of the tradition of exploiting complicated operations by stacking CNNs and RNNs, or attaching overparameterized attention mechanisms, the work provides a lightweight method for improving the ability of neural models for sentence classification.
  • Through disentangling the hidden states of the LSTM and equipping the structure with multiple sliding windows of different scales, MODE-LSTM outperforms popular CNN/RNN-based methods and hybrid methods on various benchmark datasets.
  • Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Summary
  • Introduction:

    Sentence classification (SC) is a fundamental and traditional task in natural language processing (NLP), which is widely used in many subareas, such as sentiment analysis (Wang et al, 2016a, 2018) and question classification (Shi et al, 2016).
  • The convolution operation itself is linear, which may not be sufficient to model the non-consecutive dependency of the phrase (Lei et al, 2015) and may lose the sequential information (Madasu and Anvesh Rao, 2019).
  • As shown in Figure 1, the weighted sum of the phrase “not almost as bad” does not capture the non-consecutive dependency of “not bad” very well and ignores the sequential information.
  • Some researchers (Zhao et al, 2018a; Zhou et al, 2018; Madasu and Anvesh Rao, 2019) attach an over-parameterized attention mechanism to enhance salient features and remove redundancy, but overfitting still occurs due to the increase in parameters for limited datasets
  • Methods:

    Baseline Methods The authors compare

    MODELSTM with three types of strong baselines: 1) CNN/RNN-based model: TextCNN (Kim, 2014), LSTM (Tai et al, 2015) and HM-LSTM (Zhang et al, 2018). 2) Hybrid models: C-LSTM (Zhou et al, 2015) which directly stacks CNN and LSTM, while DARLM (Zhou et al, 2018) and Selfattentive (Lin et al, 2017) includes an attention mechanism for distilling important information.
  • MODELSTM with three types of strong baselines: 1) CNN/RNN-based model: TextCNN (Kim, 2014), LSTM (Tai et al, 2015) and HM-LSTM (Zhang et al, 2018).
  • The authors use the LSTM as the basic unit of DRNN, called DLSTM.
  • In addition to the above models, the authors use ODE-LSTM as a baseline.
  • To 6 and the size p of small hidden states to 50 to make the number of parameters consistent with MODE-LSTM
  • Results:

    MODE-LSTM significantly outperforms the compared models and is superior to DLSTM with an average accuracy gain over 1.0% because ours disentangles the RNN hidden states and considers multi-scale features in sentences.
  • The authors' model achieves better or similar performance with recent state-of-the-art model HAC.
  • The authors' model is simple yet effective, like the one-layer TextCNN.
  • ODE-LSTM outperforms LSTM with an average accuracy gain 0.7%, which verifies the effectiveness of disentangling the hidden states
  • Conclusion:

    This study presents a novel parameter-efficient model called MODE-LSTM that can capture multiscale n-gram features in sentences.
  • Instead of the tradition of exploiting complicated operations by stacking CNNs and RNNs, or attaching overparameterized attention mechanisms, the work provides a lightweight method for improving the ability of neural models for sentence classification.
  • Through disentangling the hidden states of the LSTM and equipping the structure with multiple sliding windows of different scales, MODE-LSTM outperforms popular CNN/RNN-based methods and hybrid methods on various benchmark datasets.
  • Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Tables
  • Table1: Statistics of eight datasets for sentence classification. c: Number of target classes. l: Average sentence length. ml: Maximum sentence length. T rain/Dev/T est: Size of train/development/test set (CV means 10-fold cross validation is used)
  • Table2: Experimental accuracy comparison of our model and baselines on eight sentence classification benchmarks. “#Params” represents the approximate number of parameters except input embedddings for models. The results of models marked with * are obtained by our implementation. The input embeddings used in these baselines are the same as our models. Other parameter settings of models are consistent with their references. The remaining results are collected from the corresponding papers. The model marked with † (‡) means MODE-LSTM (with BERTbase) is significantly superior to compared model by paired t-test (<a class="ref-link" id="cWilcoxon_1945_a" href="#rWilcoxon_1945_a">Wilcoxon, 1945</a>) at p < 0.05 level
  • Table3: Ablation study on some datasets. “Pena.” denotes penalization loss. “Char.” denotes character embeddings
  • Table4: Case study of our model compared to TextCNN and DLSTM. “G.T.” is ground-truth. “N” and “P” represent Negative and Positive. Words with dotted lines, underlines, and wavy lines correspond to the important positions extracted by TextCNN, DLSTM, and MODE-LSTM respectively
Download tables as Excel
Related work
  • CNN-based models Kalchbrenner et al (2014) propose a deep CNN model with a dynamic k-max pooling operation for the semantic modeling of sentences. However, a simple one-layer CNN with fine-tuned word embeddings also achieves remarkable results (Kim, 2014). Some researchers also use multiple word embeddings as inputs to further improve performance (Yin and Schutze, 2015; Zhang et al, 2016b). Xiao et al (2018) propose a transformable CNN that can adaptively adjust the scope of the convolution filters. Although the above CNNbased methods perform excellently in extracting local semantic features, linear convolution operation limits the ability of modeling non-consecutive dependency and sequential information.

    RNN-based models RNNs are suitable for processing text sequences and modeling long-term dependencies, so it is also used for sentence modeling. Recently, some work incorporate residual connections (Wang and Tian, 2016) or dense connections (Ding et al, 2018) into recurrent structures to avoid vanishing gradients. Dangovski et al (2019) introduce a rotational unit of memory into RNNs for recalling long-distance information. Zhang et al (2018) propose an HS-LSTM that can automatically discover structured representation in a sentence via reinforcement learning. However, these RNN-based models still display the bias problem where later words are more dominant than earlier words (Yin et al, 2017).
Funding
  • The work described in this paper was partially funded by the National Natural Science Foundation of China (Grant Nos. 61502174, 61872148), the Natural Science Foundation of Guangdong Province (Grant Nos. 2017A030313355, 2019A1515010768), the Guangzhou Science and Technology Planning Project (Grant Nos. 201704030051, 201902010020), and the Key RD Program of Guangdong Province (No 2018B010107002)
Study subjects and analysis
benchmark datasets: 8
We then equip this structure with sliding windows of different sizes for extracting multi-scale n-gram features. Extensive experiments demonstrate that our model achieves better or competitive performance against state-of-the-art baselines on eight benchmark datasets. We also combine our model with BERT to further boost the generalization performance

sentence classification datasets: 8
MODE-LSTM is analogous to a 1D CNN using multiple filters with different window sizes, but it uses recurrent transitions instead of the convolution operation. We conduct experiments on eight sentence classification datasets. The experimental results show that our proposed model achieves comparable or better results on these datasets with fewer parameters than other models

widelystudied datasets: 8
4.1 Experimental Setup. Datasets To evaluate the effectiveness of our model, we conduct experiments on eight widelystudied datasets (Kim, 2014; Liu et al, 2017) for sentence classification. Statistics of these datasets are listed in Table 1

training samples: 100
The results on MR are shown in Figure 5(a). MODE-LSTM outperforms others with an accuracy gain over 8% when only having 100 training samples. As the size continues to increase, the gain gradually decreases but our model is still superior to the others

datasets: 8
. Statistics of eight datasets for sentence classification. c: Number of target classes. l: Average sentence length. ml: Maximum sentence length. T rain/Dev/T est: Size of train/development/test set (CV means 10-fold cross validation is used). Experimental accuracy comparison of our model and baselines on eight sentence classification benchmarks. “#Params” represents the approximate number of parameters except input embedddings for models. The results of models marked with * are obtained by our implementation. The input embeddings used in these baselines are the same as our models. Other parameter settings of models are consistent with their references. The remaining results are collected from the corresponding papers. The model marked with † (‡) means MODE-LSTM (with BERTbase) is significantly superior to compared model by paired t-test (Wilcoxon, 1945) at p < 0.05 level

Reference
  • Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, et al. 2018. Universal sentence encoder. arXiv preprint arXiv:1803.11175.
    Findings
  • Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364.
    Findings
  • HSepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.
    Google ScholarLocate open access versionFindings
  • Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 201A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188.
    Findings
  • Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
    Findings
  • Oleksii Kuchaiev and Boris Ginsburg. 2017. Factorization tricks for lstm networks. arXiv preprint arXiv:1703.10722.
    Findings
  • Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convolutional neural networks for text classification. In Twenty-ninth AAAI conference on artificial intelligence.
    Google ScholarLocate open access versionFindings
  • Ji Young Lee and Franck Dernoncourt. 2016. Sequential short-text classification with recurrent and convolutional neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 515–520.
    Google ScholarLocate open access versionFindings
  • Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2015. Molding cnns for text: non-linear, non-consecutive convolutions. arXiv preprint arXiv:1508.04112.
    Findings
  • alovic, and Marin Soljacic. 2019. Rotational unit of
    Google ScholarFindings
  • 2015. Visualizing and understanding neural models memory: A novel representation unit for rnns with in nlp. arXiv preprint arXiv:1506.01066. Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130.
    Findings
  • Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2017. Dynamic compositional neural networks over tree structure. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 4054–4060.
    Google ScholarLocate open access versionFindings
  • Avinash Madasu and Vijjini Anvesh Rao. 2019. Sequential learning of convolutional features for effective text classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5662–5671.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher Manning. 20Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
    Google ScholarLocate open access versionFindings
  • Christian S Perone, Roberto Silveira, and Thomas S Paula. 2018. Evaluation of sentence embeddings in downstream and linguistic probing tasks. arXiv preprint arXiv:1806.06259.
    Findings
  • Yangyang Shi, Kaisheng Yao, Le Tian, and Daxin Jiang. 20Deep lstm based feature mapping for query classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1501–1511.
    Google ScholarLocate open access versionFindings
  • Xingyi Song, Johann Petrak, and Angus Roberts. 2018. A deep neural network sentence level classification method with context information. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 900–904.
    Google ScholarLocate open access versionFindings
  • Kai Sheng Tai, Richard Socher, and Christopher D Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075.
    Findings
  • Baoxin Wang. 2018. Disconnected recurrent neural networks for text categorization. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2311–2320.
    Google ScholarLocate open access versionFindings
  • Jin Wang, Liang-Chih Yu, K Robert Lai, and Xuejie Zhang. 2016a. Dimensional sentiment analysis using a regional cnn-lstm model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 225–230.
    Google ScholarLocate open access versionFindings
  • Jin Wang, Liang-Chih Yu, K Robert Lai, and Xuejie Zhang. 2019. Investigating dynamic routing in treestructured lstm for sentiment analysis. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3423–3428.
    Google ScholarLocate open access versionFindings
  • Peng Wang, Jiaming Xu, Bo Xu, Chenglin Liu, Heng Zhang, Fangyuan Wang, and Hongwei Hao. 2015. Semantic clustering and convolutional neural network for short text categorization. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 352–357.
    Google ScholarLocate open access versionFindings
  • Shuai Wang, Sahisnu Mazumder, Bing Liu, Mianwei Zhou, and Yi Chang. 2018. Target-sensitive memory networks for aspect sentiment classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 957–967.
    Google ScholarLocate open access versionFindings
  • Xingyou Wang, Weijie Jiang, and Zhiyong Luo. 2016b. Combination of convolutional and recurrent neural network for sentiment analysis of short texts. In Proceedings of COLING 2016, the 26th international conference on computational linguistics: Technical papers, pages 2428–2437.
    Google ScholarLocate open access versionFindings
  • Yiren Wang and Fei Tian. 2016. Recurrent residual learning for sequence classification. In Proceedings of the 2016 conference on empirical methods in natural language processing, pages 938–943.
    Google ScholarLocate open access versionFindings
  • Frank Wilcoxon. 1945. Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80–83.
    Google ScholarLocate open access versionFindings
  • Liqiang Xiao, Honglun Zhang, Wenqing Chen, Yongkun Wang, and Yaohui Jin. 2018. Transformable convolutional neural network for text classification. In IJCAI, pages 4496–4502.
    Google ScholarLocate open access versionFindings
  • Mingzhou Xu, Derek F. Wong, Baosong Yang, Yue Zhang, and Lidia S. Chao. 2019. Leveraging local and global patterns for self-attention networks. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3069– 3075.
    Google ScholarLocate open access versionFindings
  • Baosong Yang, Zhaopeng Tu, Derek F. Wong, Fandong Meng, Lidia S. Chao, and Tong Zhang. 2018. Modeling localness for self-attention networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4449– 4458.
    Google ScholarLocate open access versionFindings
  • Baosong Yang, Longyue Wang, Derek F. Wong, Lidia S. Chao, and Zhaopeng Tu. 2019. Convolutional self-attention networks. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4040–4045.
    Google ScholarLocate open access versionFindings
  • Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich Schutze. 2017. Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923.
    Findings
  • Wenpeng Yin and Hinrich Schutze. 2015. Multichannel variable-size convolution for sentence classification. In Proceedings of the Nineteenth Conference on Computational Natural Language Learning, pages 204–214.
    Google ScholarLocate open access versionFindings
  • Kun Zhang, Guangyi Lv, Linyuan Wang, Le Wu, Enhong Chen, Fangzhao Wu, and Xing Xie. 2019. Drrnet: Dynamic re-read network for sentence semantic matching. In Thirty-three AAAI conference on artificial intelligence.
    Google ScholarLocate open access versionFindings
  • Rui Zhang, Honglak Lee, and Dragomir R. Radev. 2016a. Dependency sensitive convolutional neural networks for modeling sentences and documents. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1512–1521.
    Google ScholarLocate open access versionFindings
  • Tianyang Zhang, Minlie Huang, and Li Zhao. 2018. Learning structured representation for text classification via reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Ye Zhang, Stephen Roller, and Byron C. Wallace. 2016b. MGNC-CNN: A simple approach to exploiting multiple word embeddings for sentence classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1522–1527.
    Google ScholarLocate open access versionFindings
  • Jianyu Zhao, Zhiqiang Zhan, Qichuan Yang, Yang Zhang, Changjian Hu, Zhensheng Li, Liuxin Zhang, and Zhiqiang He. 2018a. Adaptive learning of local semantic and global structure representations for text classification. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2033–2043.
    Google ScholarLocate open access versionFindings
  • Wei Zhao, Jianbo Ye, Min Yang, Zeyang Lei, Suofei Zhang, and Zhou Zhao. 2018b. Investigating capsule networks with dynamic routing for text classification. arXiv preprint arXiv:1804.00538.
    Findings
  • Wanshan Zheng, Zibin Zheng, Hai Wan, and Chuan Chen. 2019. Dynamically route hierarchical structure representation to attentive capsule for text classification. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 5464–5470.
    Google ScholarLocate open access versionFindings
  • Chunting Zhou, Chonglin Sun, Zhiyuan Liu, and Francis Lau. 2015. A c-lstm neural network for text classification. arXiv preprint arXiv:1511.08630.
    Findings
  • Qianrong Zhou, Xiaojie Wang, and Xuan Dong. 2018. Differentiated attentive representation learning for sentence classification. In IJCAI, pages 4630–4636.
    Google ScholarLocate open access versionFindings
  • 784. In detail, the sentence is fed into BERTbase model, and the hidden representation of the last layer of BERTbase is used as the input embeddings of MODE-LSTM rather than GloVe and character embeddings. Then the BERT representation is fed into MODE-LSTM for extracting multi-scale feature representation.
    Google ScholarFindings
Author
Zhenxi Lin
Zhenxi Lin
Jiangyue Yan
Jiangyue Yan
Zipeng Chen
Zipeng Chen
Liuhong Yu
Liuhong Yu
Your rating :
0

 

Tags
Comments
小科