Cold Start and Interpretability: Turning Regular Expressions into Trainable Recurrent Neural Networks

Chengyue Jiang
Chengyue Jiang
Yinggong Zhao
Yinggong Zhao
Shanbo Chu
Shanbo Chu
Libin Shen
Libin Shen

EMNLP 2020, 2020.

Cited by: 0|Bibtex|Views21
Keywords:
symbolic ruleFinite-state automatalanguage processingfinite-automaton recurrent neural networkslow resourceMore(15+)
Weibo:
We propose a type of recurrent neural networks called finite-automaton recurrent neural networks

Abstract:

Neural networks can achieve impressive performance on many natural language processing applications, but they typically need large labeled data for training and are not easily interpretable. On the other hand, symbolic rules such as regular expressions are interpretable, require no training, and often achieve decent accuracy; but rules ca...More

Code:

Data:

0
Introduction
  • Over the past several years, neural network approaches have rapidly gained popularity in natural language processing (NLP) because of their impressive performance and flexible modeling capacity.
  • REs rely on human experts to write and often have high precision but moderate to low recall; RE-based systems cannot evolve by training on labeled data when available and usually underperform neural networks in rich-resource scenarios.
  • Aggregate the matching results to produce a final label for sentence x based on a set of propositional logic rules.
  • The whole procedure is shown in the top half of Figure.1
Highlights
  • Over the past several years, neural network approaches have rapidly gained popularity in natural language processing (NLP) because of their impressive performance and flexible modeling capacity
  • Regular expressions (RE) rely on human experts to write and often have high precision but moderate to low recall; RE-based systems cannot evolve by training on labeled data when available and usually underperform neural networks in rich-resource scenarios
  • We propose finite-automaton recurrent neural networks (FA-RNN), a novel type of recurrent neural networks that is designed based on the computation process of weighted finite-state automata
  • Our experiments find that FARNNs show clear advantages in both zero-shot and low-resource settings and remain very competitive in rich-resource settings
  • We evaluate the performance of our methods on three text classification datasets that have been used in previous work of integrating REs and neural networks: ATIS (Hemphill et al, 1990), Question Classification (QC) (Li and Roth, 2002) and SMS (Alberto et al, 2015)
  • We propose a type of recurrent neural networks called Finite-state automata (FA)-RNN
Methods
  • RE to FA As mentioned in Sec.2.3, the authors can convert an RE into an m-DFA.
  • In order to obtain a concise FA with better interpretability and faster computation speed, the authors treat the wildcard ‘$’ as a special word in the vocabulary and run the algorithms mentioned in Sec.2.3 to obtain a “pseudo” m-DFA A.
  • The computation of the WFA forward score (Eqa.2) can be rewritten into a recurrent form.
  • The authors can view a WFA as a form of recurrent neural networks (RNN) parameterized by Θ
Results
  • The reconstructed RE systems achieve 73.6% accuracy for QC (+9.2% compared with the original REs) and 87.45% for ATIS (+0.45% compared with the original REs).
Conclusion
  • The authors propose a type of recurrent neural networks called FA-RNN.
  • It can be initialized from REs and can learn from data, applicable to various scenarios including zero-shot, cold-start, low-resource and rich-resource scenarios.
  • It is interpretable and can be converted back into REs. The authors' experiments on text classification show that it outperforms previous neural approaches in both zero-shot and low-resource scenarios and is very competitive in rich-resource scenarios.
  • RE rules and code at https://github.com/ jeffchy/RE2RNN
Summary
  • Introduction:

    Over the past several years, neural network approaches have rapidly gained popularity in natural language processing (NLP) because of their impressive performance and flexible modeling capacity.
  • REs rely on human experts to write and often have high precision but moderate to low recall; RE-based systems cannot evolve by training on labeled data when available and usually underperform neural networks in rich-resource scenarios.
  • Aggregate the matching results to produce a final label for sentence x based on a set of propositional logic rules.
  • The whole procedure is shown in the top half of Figure.1
  • Methods:

    RE to FA As mentioned in Sec.2.3, the authors can convert an RE into an m-DFA.
  • In order to obtain a concise FA with better interpretability and faster computation speed, the authors treat the wildcard ‘$’ as a special word in the vocabulary and run the algorithms mentioned in Sec.2.3 to obtain a “pseudo” m-DFA A.
  • The computation of the WFA forward score (Eqa.2) can be rewritten into a recurrent form.
  • The authors can view a WFA as a form of recurrent neural networks (RNN) parameterized by Θ
  • Results:

    The reconstructed RE systems achieve 73.6% accuracy for QC (+9.2% compared with the original REs) and 87.45% for ATIS (+0.45% compared with the original REs).
  • Conclusion:

    The authors propose a type of recurrent neural networks called FA-RNN.
  • It can be initialized from REs and can learn from data, applicable to various scenarios including zero-shot, cold-start, low-resource and rich-resource scenarios.
  • It is interpretable and can be converted back into REs. The authors' experiments on text classification show that it outperforms previous neural approaches in both zero-shot and low-resource scenarios and is very competitive in rich-resource scenarios.
  • RE rules and code at https://github.com/ jeffchy/RE2RNN
Tables
  • Table1: RE for matching sentences asking about distance, and a matched sentence. ‘$’ is the wildcard. ‘|’ is the OR operator. ‘*’ is the Kleene star operator. We also show the finite automaton converted from the RE. s2 is the final state
  • Table2: Soft logic. A, B are proposition symbols with soft truth values a, b
  • Table3: Dataset statistics and example REs. L is the label set. R is the RE set. K is the state number of the converted WFA. %Acc is the classification accuracy of the RE system. We provide an example RE and its targeting label for each dataset
  • Table4: Accuracy of zero-shot classification. The RE system and baselines trained on RE-labeled data are included for reference
  • Table5: Classification accuracy with different amounts of training data
  • Table6: Ablation Study. -F denotes the default method using forward scoring. -V denotes Viterbi scoring. -O denotes the undecomposed version described in Sec.3.1. Rand denotes random initialization. RandEw denotes using random word embedding. -TrainER denotes training ER
  • Table7: Formulas of parameter numbers
  • Table8: Numbers of model parameters after tuning on different datasets
  • Table9: Full results on ATIS dataset
  • Table10: Full results on QC dataset
  • Table11: Full results on SMS dataset
Download tables as Excel
Related work
  • Neural Networks Enhanced by Rules Hu et al (2016); Li and Rush (2020) use rules to constrain neural networks by knowledge distillation and posterior regularization. Awasthi et al (2020) inject rule knowledge into neural networks using multitask learning. Lin et al (2020) train a trigger matching network using additional annotation and use the output of trigger matching results as the attention of a sequence labeler. Rocktäschel et al (2015); Xu et al (2018); Hsu et al (2018) use parsed rule results to regularize neural network predictions by additional loss terms. Li and Srikumar (2019); Luo et al (2018) inject declarative knowledge in the form of parsed RE results or first-order expressions into neural networks by hacking the prediction logits or the attention scores. Hu et al (2016); Hsu et al (2018) use rules as additional input features.
Funding
  • This work was supported by the National Natural Science Foundation of China (61976139)
Study subjects and analysis
text classification datasets: 3
4.1 Datasets. We evaluate the performance of our methods on three text classification datasets that have been used in previous work of integrating REs and neural networks: ATIS (Hemphill et al, 1990), Question Classification (QC) (Li and Roth, 2002) and SMS (Alberto et al, 2015). ATIS is a popular dataset consisting of queries about airline information and services

Reference
  • Túlio C Alberto, Johannes V Lochter, and Tiago A Almeida. 2015. Tubespam: Comment spam filtering on youtube. In 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), pages 138–143. IEEE.
    Google ScholarLocate open access versionFindings
  • Abhijeet Awasthi, Sabyasachi Ghosh, Rasna Goyal, and Sunita Sarawagi. 2020. Learning from rules generalizing labeled exemplars. ICLR.
    Google ScholarLocate open access versionFindings
  • Leonard E. Baum and Ted Petrie. 1966. Statistical inference for probabilistic functions of finite state markov chains. Ann. Math. Statist., 37(6):1554– 1563.
    Google ScholarLocate open access versionFindings
  • Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 201Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
    Findings
  • Jesse Dodge, Roy Schwartz, Hao Peng, and Noah A. Smith. 2019. RNN architecture learning with sparse regularization. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 1179–1184, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jeffrey L Elman. 1990. Finding structure in time. Cognitive science, 14(2):179–211.
    Google ScholarLocate open access versionFindings
  • C. L. Giles, C. W. Omlin, and K. K. Thornber. 1999. Equivalence in knowledge representation: automata, recurrent neural networks, and dynamical fuzzy systems. Proceedings of the IEEE, 87(9):1623–1640.
    Google ScholarLocate open access versionFindings
  • Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256.
    Google ScholarLocate open access versionFindings
  • Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. Neural turing machines. arXiv preprint arXiv:1410.5401.
    Findings
  • Charles T Hemphill, John J Godfrey, and George R Doddington. 1990. The atis spoken language systems pilot corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990.
    Google ScholarLocate open access versionFindings
  • Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
    Findings
  • Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
    Google ScholarLocate open access versionFindings
  • John Hopcroft. 1971. An n log n algorithm for minimizing states in a finite automaton. In Theory of machines and computations, pages 189–196. Elsevier.
    Google ScholarLocate open access versionFindings
  • John E Hopcroft, Rajeev Motwani, and Jeffrey D Ullman. 2001. Introduction to automata theory, languages, and computation. Acm Sigact News, 32(1):60–65.
    Google ScholarLocate open access versionFindings
  • Haruo Hosoya and Benjamin Pierce. 2001. Regular expression pattern matching for xml. In Proceedings of the 28th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’01, page 67–80, New York, NY, USA. Association for Computing Machinery.
    Google ScholarLocate open access versionFindings
  • Wan-Ting Hsu, Chieh-Kai Lin, Ming-Ying Lee, Kerui Min, Jing Tang, and Min Sun. 2018. A unified model for extractive and abstractive summarization using inconsistency loss. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 132–141, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard Hovy, and Eric Xing. 2016. Harnessing deep neural networks with logic rules. ACL.
    Google ScholarLocate open access versionFindings
  • Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III. 2015. Deep unordered composition rivals syntactic methods for text classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1681–1691.
    Google ScholarLocate open access versionFindings
  • Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
    Findings
  • Angelika Kimmig, Stephen H. Bach, Matthias Broecheler, Bert Huang, and Lise Getoor. 2012. A short introduction to probabilistic soft logic. In NIPS 2012.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
    Findings
  • Tao Li and Vivek Srikumar. 2019. Augmenting neural networks with first-order logic. arXiv preprint arXiv:1906.06298.
    Findings
  • Xiang Lisa Li and Alexander Rush. 2020. Posterior control of blackbox generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2731–2743, Online. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Xin Li and Dan Roth. 2002. Learning question classifiers. In Proceedings of the 19th international conference on Computational linguistics-Volume 1, pages 1–7. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Bill Yuchen Lin, Dong-Ho Lee, Ming Shen, Ryan Moreno, Xiao Huang, Prashant Shiralkar, and Xiang Ren. 2020. TriggerNER: Learning with entity triggers as explanations for named entity recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8503– 8511, Online. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Chu-Cheng Lin, Hao Zhu, Matthew R Gormley, and Jason Eisner. 2019. Neural finite-state transducers: Beyond rational relations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 272–283.
    Google ScholarLocate open access versionFindings
  • Bingfeng Luo, Yansong Feng, Zheng Wang, Songfang Huang, Rui Yan, and Dongyan Zhao. 2018. Marrying up regular expressions with neural networks: A case study for spoken language understanding. ACL.
    Google ScholarLocate open access versionFindings
  • William Merrill. 2019. Sequential neural networks as automata. CoRR, abs/1906.01615.
    Findings
  • Christian W Omlin, Karvel K Thornber, and C Lee Giles. 1998. Fuzzy finite-state automata can be deterministically encoded into recurrent neural networks. IEEE Transactions on Fuzzy Systems, 6(1):76–89.
    Google ScholarLocate open access versionFindings
  • Hao Peng, Roy Schwartz, Sam Thomson, and Noah A Smith. 2018. Rational recurrences. EMNLP.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
    Google ScholarLocate open access versionFindings
  • Michael O Rabin and Dana Scott. 1959. Finite automata and their decision problems. IBM journal of research and development, 3(2):114–125.
    Google ScholarLocate open access versionFindings
  • Tim Rocktäschel, Sameer Singh, and Sebastian Riedel. 2015. Injecting logical background knowledge into embeddings for relation extraction. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1119–1129, Denver, Colorado. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Roy Schwartz, Sam Thomson, and Noah A Smith. 2018. Sopa: Bridging cnns, rnns, and weighted finite-state machines. ACL.
    Google ScholarLocate open access versionFindings
  • Ken Thompson. 1968. Programming techniques: Regular expression search algorithm. Communications of the ACM, 11(6):419–422.
    Google ScholarLocate open access versionFindings
  • A. Viterbi. 1967. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2):260–269.
    Google ScholarLocate open access versionFindings
  • Gail Weiss, Yoav Goldberg, and Eran Yahav. 2018. On the practical computational power of finite precision rnns for language recognition. arXiv preprint arXiv:1805.04908.
    Findings
  • Jingyi Xu, Zilu Zhang, Tal Friedman, Yitao Liang, and Guy Van Den Broeck. 2018. A semantic loss function for deep learning with symbolic knowledge. 35th International Conference on Machine Learning, ICML 2018.
    Google ScholarLocate open access versionFindings
  • Shanshan Zhang, Lihong He, Slobodan Vucetic, and Eduard Dragut. 2018. Regular expression guided entity mention mining from noisy web data. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1991– 2000, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments