KERMIT: Complementing Transformer Architectures with Encoders of Explicit Syntactic Interpretations
empirical methods in natural language processing, pp. 256-267, 2020.
Weibo:
Abstract:
Syntactic parsers have dominated natural language understanding for decades. Yet, their syntactic interpretations are losing centrality in downstream tasks due to the success of large-scale textual representation learners. In this paper, we propose KERMIT (Kernel-inspired Encoder with Recursive Mechanism for Interpretable Trees) to embed ...More
Code:
Data:
Introduction
- Universal sentence embeddings (Conneau et al, 2018), which are task-independent, distributed sentence representations, are redesigning the way linguistic models in natural language processing are defined.
- Socher et al (2011) have defined the notion of Recursive Neural Networks (RecNN) that are Recurrent Neural Networks applied to binary trees
- These RecNNs have been used to parse sentences and not to include preexisting syntax in a final task (Socher et al, 2011).
- Munkhdalai and Yu (2017) have specialized LSTM for binary and n-ry trees with their Neural Tree Indexers and Strubell et al (2018) have encoded syntactic information by using multi-head attention within a transformer architecture
Highlights
- Universal sentence embeddings (Conneau et al, 2018), which are task-independent, distributed sentence representations, are redesigning the way linguistic models in natural language processing are defined
- We propose KERMIT (Kernelinspired Encoder with Recursive Mechanism for Interpretable Trees) to embed symbolic syntactic parse trees into artificial neural networks and to visualize how syntax is used in inference
- We investigate whether explicit universal syntactic interpretations can be used to improve state-of-the-art universal sentence embeddings and to create neural network architectures where syntax decisions are less obscure and, syntactically explainable
- Results from the completely universal experimental setting suggest that universal syntactic interpretations complement syntax in universal sentence embeddings
- Universal syntactic interpretations are valuable language interpretations, which have been developed in years of study
- We introduced KERMIT to show that these interpretations can be effectively used in combination with universal sentence embeddings produced from scratch
Methods
- The authors aim to investigate whether KERMIT can be used to create neural network architectures where universal syntactic interpretations are useful: (1) to improve state-of-the-art universal sentence embeddings, especially in computationally light environments, and (2) to syntactically explain decisions.
The rest of the section describes the experimental set-up, the quantitative experimental results of KERMIT and discusses how KERMITviz can be used to explain inferences made by neural networks over examples.
4.1 Experimental Set-up
This section describes the general experimental set-up of the experiments, the specific configurations adopted in the completely universal and task-specific settings, the used computational architecture and the datasets.
The general experimental settings are described hereafter. - This section describes the general experimental set-up of the experiments, the specific configurations adopted in the completely universal and task-specific settings, the used computational architecture and the datasets.
- As the experiments are text classification tasks, the decoder layer of the KERMIT+Tranformer architecture is a fully connected layer with the softmax activation function applied to the concatenation of the KERMIT output and the final [CLS] token representation of the selected transformer model.
- The optimizer used to train the whole architecture is AdamW (Loshchilov and Hutter, 2019) with the learning rate set to 3e−5
Results
- Results from the completely universal experimental setting suggest that universal syntactic interpretations complement syntax in universal sentence embeddings.
- This conclusion is derived from the following observations of Table 1, which reports results in terms of the accuracy of the different models based on the different datasets.
- Syntactic information in AGNews seems to be irrelevant as there is a small difference in results between BERTBASE, on the one side, with 82.88(±0.09) and BERTBASE-Reverse with 79.72(±0.11) and
Conclusion
- Universal syntactic interpretations are valuable language interpretations, which have been developed in years of study.
- KERMITviz allows them to explain how syntactic information is used in classification decisions within networks combining KERMIT, on the one side, and BERT or XLNet on the other.
- As KERMIT has a clear description of the used syntactic subtrees and gives the possibility of visualizing how syntactic information is exploited during inference, it opens the possibility of devising models to include explicit syntactic inference rules in the training process.
Summary
Introduction:
Universal sentence embeddings (Conneau et al, 2018), which are task-independent, distributed sentence representations, are redesigning the way linguistic models in natural language processing are defined.- Socher et al (2011) have defined the notion of Recursive Neural Networks (RecNN) that are Recurrent Neural Networks applied to binary trees
- These RecNNs have been used to parse sentences and not to include preexisting syntax in a final task (Socher et al, 2011).
- Munkhdalai and Yu (2017) have specialized LSTM for binary and n-ry trees with their Neural Tree Indexers and Strubell et al (2018) have encoded syntactic information by using multi-head attention within a transformer architecture
Objectives:
The authors aim to investigate whether KERMIT can be used to create neural network architectures where universal syntactic interpretations are useful: (1) to improve state-of-the-art universal sentence embeddings, especially in computationally light environments, and (2) to syntactically explain decisions.Methods:
The authors aim to investigate whether KERMIT can be used to create neural network architectures where universal syntactic interpretations are useful: (1) to improve state-of-the-art universal sentence embeddings, especially in computationally light environments, and (2) to syntactically explain decisions.
The rest of the section describes the experimental set-up, the quantitative experimental results of KERMIT and discusses how KERMITviz can be used to explain inferences made by neural networks over examples.
4.1 Experimental Set-up
This section describes the general experimental set-up of the experiments, the specific configurations adopted in the completely universal and task-specific settings, the used computational architecture and the datasets.
The general experimental settings are described hereafter.- This section describes the general experimental set-up of the experiments, the specific configurations adopted in the completely universal and task-specific settings, the used computational architecture and the datasets.
- As the experiments are text classification tasks, the decoder layer of the KERMIT+Tranformer architecture is a fully connected layer with the softmax activation function applied to the concatenation of the KERMIT output and the final [CLS] token representation of the selected transformer model.
- The optimizer used to train the whole architecture is AdamW (Loshchilov and Hutter, 2019) with the learning rate set to 3e−5
Results:
Results from the completely universal experimental setting suggest that universal syntactic interpretations complement syntax in universal sentence embeddings.- This conclusion is derived from the following observations of Table 1, which reports results in terms of the accuracy of the different models based on the different datasets.
- Syntactic information in AGNews seems to be irrelevant as there is a small difference in results between BERTBASE, on the one side, with 82.88(±0.09) and BERTBASE-Reverse with 79.72(±0.11) and
Conclusion:
Universal syntactic interpretations are valuable language interpretations, which have been developed in years of study.- KERMITviz allows them to explain how syntactic information is used in classification decisions within networks combining KERMIT, on the one side, and BERT or XLNet on the other.
- As KERMIT has a clear description of the used syntactic subtrees and gives the possibility of visualizing how syntactic information is exploited during inference, it opens the possibility of devising models to include explicit syntactic inference rules in the training process.
Tables
- Table1: Universal Setting - Average accuracy and standard deviation on four text classification tasks. Results derive from 5 runs and and indicate a statistically significant difference between two results with a 95% confidence level with the sign test
Funding
- In this paper, we investigate whether explicit universal syntactic interpretations can be used to improve state-of-the-art universal sentence embeddings and to create neural network architectures where syntax decisions are less obscure and, thus, syntactically explainable
Study subjects and analysis
cases: 3
This may be justified as both XLNet and BERTBASE are trained on Wikipedia, thus universal sentence embeddings are already adapted to the specific dataset. Thirdly, in the three cases where syntactic information is relevant (Yelp Review, Yelp Polarity and DBPedia), the complete KERMIT+Transformer outperforms the model that is based only on the related Transformer, and the difference is statistically significant: 53.72(±0.14) vs. 46.26(±0.13) in Yelp Review, 94.51(±0.05) vs. 92.46(±0.09) in DBPedia and 88.99(±0.17) vs. 81.99(±0.15). in Yelp Polarity for XLNet and 52.02(±0.06) vs. 42.90(±0.05) in Yelp Review, 97.73(±0.16) vs. 97.11(±0.27) in DBPedia and 87.58(±0.17) vs. 79.21(±0.50) in Yelp Polarity for BERTBASE
Reference
- Abien Fred Agarap. 2018. Deep Learning using Rectified Linear Units (ReLU). CoRR, abs/1803.0.
- Sebastian Bach, Alexander Binder, Gregoire Montavon, Frederick Klauschen, Klaus Robert Muller, and Wojciech Samek. 2015. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE, 10(7):1–46.
- Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, pages 1–15.
- Marco Baroni and Roberto Zamparelli. 2010. Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1183–1193, Cambridge, MA. Association for Computational Linguistics.
- Yonatan Belinkov and James Glass. 2019. Analysis Methods in Neural Language Processing: A Survey. Transactions of the Association for Computational Linguistics, 7:49–72.
- Daniel Cer, Yinfei Yang, Sheng yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Universal sentence encoder for English. EMNLP 2018 - Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Proceedings, pages 169–174.
- David J Chalmers. 1992. Syntactic Transformations on Distributed Representations. In Noel Sharkey, editor, Connectionist Natural Language Processing: Readings from Connection Science, pages 46–55. Springer Netherlands, Dordrecht.
- Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using {RNN} Encoder{– }Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ({EMNLP}), pages 1724–1734, Doha, Qatar. Association for Computational Linguistics.
- Jihun Choi, Kang Min Yoo, and Sang Goo Lee. 2018. Learning to compose task-specific tree structures. In 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, pages 5094–5101.
- Stephen Clark and Stephen Pulman. 2007. Combining Symbolic and Distributional Models of Meaning. In Proceedings of the AAAI Spring Symposium on Quantum Interaction, Stanford, CA, 2007, pages 52– 55.
- Michael Collins and Nigel Duffy. 2002. New Ranking Algorithms for Parsing and Tagging: Kernels over Discrete Structures, and the Voted Perceptron. In Proceedings of {ACL}02.
- Alexis Conneau, Douwe Kiela, Holger Schwenk, Loıc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In EMNLP 2017 Conference on Empirical Methods in Natural Language Processing, Proceedings, pages 670–680.
- Alexis Conneau, German Kruszewski, Guillaume Lample, Loıc Barrault, and Marco Baroni. 2018. What you can cram into a single amp;!#* vector: Probing sentence embeddings for linguistic properties. ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers), 1:2126–2136.
- Nello Cristianini and John Shawe-Taylor. 2000. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press.
- Danilo Croce, Daniele Rossini, and Roberto Basili. 2019a. Auditing Deep Learning processes through Kernel-based Explanatory Models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4037–4046, Hong Kong, China. Association for Computational Linguistics.
- Danilo Croce, Daniele Rossini, and Roberto Basili. 2019b. Neural embeddings: Accurate and readable inferences based on semantic kernels. Natural Language Engineering, 25(4):519–541.
- Aron Culotta and Jeffrey Sorensen. 2004. Dependency tree kernels for relation extraction. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics - ACL ’04, pages 423–es, Morristown, NJ, USA. Association for Computational Linguistics.
- Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. CoRR, abs/1901.0.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. {BERT:} Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR, abs/1810.0.
- Allyson Ettinger. 2019. What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models.
- Jerry A. Fodor and Zenon W. Pylyshyn. 1988. Connectionism and cognitive architecture: A critical analysis. Cognition, 28(1-2):3–71.
- Zachary S L Foster, Thomas J Sharpton, and Niklaus J Grunwald. 2017. Metacoder: An {R} package for visualization and manipulation of community taxonomic diversity data. PLoS Computational Biology, 13(2).
- Yoav Goldberg. 2019. Assessing BERT’s Syntactic Abilities.
- Christoph Goller and Andreas Kuechler. 1996. Learning task-dependent distributed representations by backpropagation through structure. In IEEE International Conference on Neural Networks - Conference Proceedings, volume 1, pages 347–352. IEEE.
- John Hewitt and Christopher D Manning. 2019. {A} Structural Probe for Finding Syntax in Word Representations. In Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, Minneapolis, Minnesota. Association for Computational Linguistics.
- Alon Jacovi, Oren Sar Shalom, and Yoav Goldberg. 2018. Understanding Convolutional Neural Networks for Text Classification. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 56–65, Stroudsburg, PA, USA. Association for Computational Linguistics.
- Ganesh Jawahar,, Benoıt Sagot,, and Djame Seddah. 2019. What Does BERT Learn about the Structure of Language? In Proceedings of the Conference of the Association for Computational Linguistics, pages 3651–3657. Association for Computational Linguistics (ACL).
- W Johnson and J Lindenstrauss. 1984. Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math., 26:189–206.
- Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2015. Skip-Thought Vectors. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, page 3294–3302, Cambridge, MA, USA. MIT Press.
- Aran Komatsuzaki. 2019. One Epoch Is All You Need. pages 1–13.
- Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. 2019. Revealing the Dark Secrets of BERT.
- Adhiguna Kuncoro, Lingpeng Kong, Daniel Fried, Dani Yogatama, Laura Rimell, Chris Dyer, and Phil Blunsom. 2020. Syntactic Structure Distillation Pretraining For Bidirectional Encoders.
- Zachary C. Lipton. 2016. The Mythos of Model Interpretability. ICML Workshop on Human Interpretability in Machine Learning, 61(Whi):36–43.
- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: {A} Robustly Optimized {BERT} Pretraining Approach. CoRR, abs/1907.1.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. 7th International Conference on Learning Representations, ICLR 2019.
- Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The {S}tanford {C}ore{NLP} Natural Language Processing Toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55–60, Baltimore, Maryland. Association for Computational Linguistics.
- David Marecek and Rudolf Rosa. 2019. Extracting Syntactic Trees from Transformer Encoder SelfAttentions. pages 347–349.
- Jeff Mitchell and Mirella Lapata. 2008. Vector-based Models of Semantic Composition. In Proceedings of ACL-08: HLT, pages 236–244, Columbus, Ohio. Association for Computational Linguistics.
- Alessandro Moschitti. 2006. Making Tree Kernels practical for Natural Language Learning. In Proceedings of EACL’06. Trento, Italy.
- Tsendsuren Munkhdalai and Hong Yu. 2017. Neural Tree Indexers for Text Understanding. In Proceedings of the conference of the Association for Computational Linguistics, volume 1, pages 11–21. NIH Public Access.
- Daniele Pighin and Alessandro Moschitti. 2010. On Reverse Feature Engineering of Syntactic Tree Kernels. In Conference on Natural Language Learning (CoNLL-2010), Uppsala, Sweden.
- T A Plate. 1995. Holographic reduced representations. IEEE Transactions on Neural Networks, 6(3):623– 641.
- Jordan B. Pollack. 1990. Recursive distributed representations. Artificial Intelligence, 46(1-2):77–105.
- Andrea Santilli and Fabio Massimo Zanzotto. 2018. SyntNN at SemEval-2018 Task 2: is Syntax Useful for Emoji Prediction? Embedding Syntactic Trees in Multi Layer Perceptrons. In Proceedings of The 12th International Workshop on Semantic Evaluation, pages 477–481, New Orleans, Louisiana. Association for Computational Linguistics (ACL).
- Richard Socher, Cliff Chiung Yu Lin, Andrew Y Ng, and Christopher D Manning. 2011. Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, pages 129–136.
- Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP 2013 - 2013 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, pages 1631–1642. Association for Computational Linguistics.
- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15(56):1929–1958.
- Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. 2018. Linguistically-informed self-attention for semantic role labeling. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5027–5038.
- Sandeep Subramanian, Adam Trischler, Yoshua Bengio, and Christopher J Pal. 2018. Learning general purpose distributed sentence representations via large scale multitask learning. In 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings.
- Kai Sheng Tai, Richard Socher, and Christopher D Manning. 2015. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 1556–1566, Stroudsburg, PA, USA. Association for Computational Linguistics.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NIPS.
- Jesse Vig. 2019. A multiscale visualization of attention in the transformer model. ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of System Demonstrations, pages 37–42.
- Richard Socher, Brody Huval, Christopher D Manning, and Andrew Y Ng. 2012. Semantic compositionality through recursive matrix-vector spaces. In EMNLP-CoNLL 2012 - 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Proceedings of the Conference, pages 1201–1211.
- Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not Explanation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 11–20, Stroudsburg, PA, USA. Association for Computational Linguistics.
- Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. ArXiv, abs/1910.0.
- Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), volume abs/1906.0, pages 5754–5764.
- Fabio Massimo Zanzotto. 2019. Viewpoint: Human-inthe-loop Artificial Intelligence. J. Artif. Intell. Res., 64:243–252.
- Fabio Massimo Zanzotto and Lorenzo Dell’Arciprete. 2012. Distributed tree kernels. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, volume 1, pages 193–200.
- Fabio Massimo Zanzotto and Lorenzo Ferrone. 2017. Can we explain natural language inference decisions taken with neural networks? Inference rules in distributed representations. In Proceedings of the International Joint Conference on Neural Networks, volume 2017-May, pages 3680–3687.
- Fabio Massimo Zanzotto, Ioannis Korkontzelos, Francesca Fallucchi, and Suresh Manandhar. 2010. Estimating linear models for compositional distributional semantics. In Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference, volume 2, pages 1263–1271.
- Richong Zhang, Zhiyuan Hu, Hongyu Guo, and Yongyi Mao. 2018. Syntax encoding with application in authorship attribution. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2742–2753. Association for Computational Linguistics.
- Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level Convolutional Networks for Text Classification. In C Cortes, N D Lawrence, D D Lee, M Sugiyama, and R Garnett, editors, Advances in Neural Information Processing Systems 28, pages 649–657. Curran Associates, Inc.
- Xingxing Zhang, Liang Lu, and Mirella Lapata. 2016. Top-down tree long short-term memory networks. In 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2016 Proceedings of the Conference, pages 310–320.
- Xiaodan Zhu, Parinaz Sobhani, and Hongyu Guo. 2015. Long short-term memory over recursive structures. In 32nd International Conference on Machine Learning, ICML 2015, volume 2, pages 1604–1612.
Full Text
Tags
Comments