AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We investigate how the structured knowledge of a graph neural networks can be distilled into various natural language processing models in order to improve their performance

Distilling Structured Knowledge for Text Based Relational Reasoning

EMNLP 2020, pp.6782-6791, (2020)

Cited by: 0|Views147
Full Text
Bibtex
Weibo

Abstract

There is an increasing interest in developing text-based relational reasoning systems, which are capable of systematically reasoning about the relationships between entities mentioned in a text. However, there remains a substantial performance gap between NLP models for relational reasoning and models based on graph neural networks (GNNs)...More

Code:

Data:

0
Introduction
  • The task of text-based relational reasoning—where an agent must infer and compose relations between entities based on a passage of text—has received increasing attention in natural language processing (NLP) (Andreas, 2019)
  • This task has been especially prominent in the context of systematic generalization in NLP, with synthetic datasets, such as CLUTTR and SCAN, being used to probe the ability of NLP models to reason in a systematic and logical way (Lake and Baroni, 2018; Sinha et al, 2019).
  • CLUTRR includes relational reasoning problems that can be posed both in textual or symbolic form, and preliminary investigations using CLUTRR show that GNN-based models—which leverage the structured symbolic input—are able to achieve higher accuracy, better generalization, and are more robust than purely text-based systems (Sinha et al, 2019)
Highlights
  • The task of text-based relational reasoning—where an agent must infer and compose relations between entities based on a passage of text—has received increasing attention in natural language processing (NLP) (Andreas, 2019)
  • Perhaps one of the biggest challenges is the persistent gap between the performance that can be achieved using NLP models and the performance of structured models—such as graph neural networks (GNNs)—which perform relational reasoning based on structured or symbolic inputs
  • We describe our approach for structured distillation, which involves improving the performance of an NLP model by distilling structured knowledge from a GNN (Fig. 1)
  • Our key experimental question is whether an NLP model can be improved by distilling structured knowledge from a GNN. We investigate this question using the GNN and NLP models defined in the previous section, and we follow the experimental protocol from Sinha et al (2019)
  • The structured distillation approach significantly improved the performance of the NLP models in settings where noisy facts were added to the CLUTRR reasoning problems
  • We find that extending two state-of-the-art NLP models using our structured distillation approach significantly improves performance and that the gains are espe
  • Despite the improvements we observed, the performance of the NLP models is still substantially below the performance of the GNN teacher used for distillation, highlighting that significant work that remains to close the gap between the reasoning performance of text-based and GNN-based models
Methods
  • The authors describe the approach for structured distillation, which involves improving the performance of an NLP model by distilling structured knowledge from a GNN (Fig. 1).

    Graph encoder and text encoder.
  • The authors describe the approach for structured distillation, which involves improving the performance of an NLP model by distilling structured knowledge from a GNN (Fig. 1).
  • The authors experiment with the two top-performing NLP models from Sinha et al (2019): (1) a variation of an LSTM model with attention (Bahdanau et al, 2015) and (2) an adapted version of the MAC architecture (Hudson and Manning, 2018).
Results
  • The authors find that extending two state-of-the-art NLP models using the structured distillation approach significantly improves performance and that the gains are espe-.
  • The authors see that structured distillation consistently and substantially improves performance of both NLP models, providing an average 13.6% relative improvement on accuracy
Conclusion
  • The authors' structured distillation approach achieves promising results.
  • The structured distillation approach significantly improved the performance of the NLP models in settings where noisy facts were added to the CLUTRR reasoning problems.
  • Despite the improvements the authors observed, the performance of the NLP models is still substantially below the performance of the GNN teacher used for distillation, highlighting that significant work that remains to close the gap between the reasoning performance of text-based and GNN-based models
Summary
  • Introduction:

    The task of text-based relational reasoning—where an agent must infer and compose relations between entities based on a passage of text—has received increasing attention in natural language processing (NLP) (Andreas, 2019)
  • This task has been especially prominent in the context of systematic generalization in NLP, with synthetic datasets, such as CLUTTR and SCAN, being used to probe the ability of NLP models to reason in a systematic and logical way (Lake and Baroni, 2018; Sinha et al, 2019).
  • CLUTRR includes relational reasoning problems that can be posed both in textual or symbolic form, and preliminary investigations using CLUTRR show that GNN-based models—which leverage the structured symbolic input—are able to achieve higher accuracy, better generalization, and are more robust than purely text-based systems (Sinha et al, 2019)
  • Objectives:

    The authors' goal is to do this knowledge distillation (Hinton et al, 2015) only during training so that the NLP model can achieve higher performance at test time, when only unstructured textual inputs are available.
  • Methods:

    The authors describe the approach for structured distillation, which involves improving the performance of an NLP model by distilling structured knowledge from a GNN (Fig. 1).

    Graph encoder and text encoder.
  • The authors describe the approach for structured distillation, which involves improving the performance of an NLP model by distilling structured knowledge from a GNN (Fig. 1).
  • The authors experiment with the two top-performing NLP models from Sinha et al (2019): (1) a variation of an LSTM model with attention (Bahdanau et al, 2015) and (2) an adapted version of the MAC architecture (Hudson and Manning, 2018).
  • Results:

    The authors find that extending two state-of-the-art NLP models using the structured distillation approach significantly improves performance and that the gains are espe-.
  • The authors see that structured distillation consistently and substantially improves performance of both NLP models, providing an average 13.6% relative improvement on accuracy
  • Conclusion:

    The authors' structured distillation approach achieves promising results.
  • The structured distillation approach significantly improved the performance of the NLP models in settings where noisy facts were added to the CLUTRR reasoning problems.
  • Despite the improvements the authors observed, the performance of the NLP models is still substantially below the performance of the GNN teacher used for distillation, highlighting that significant work that remains to close the gap between the reasoning performance of text-based and GNN-based models
Tables
  • Table1: Accuracy on test sets with different distractors. All results are averaged over 5 runs with different random seeds. The maximum standard deviation is less than 0.05
  • Table2: Accuracy on test sets with relation length of 2-10. KD denotes knowledge distillation; CL denotes the MI-based contrastive learning. All results are averaged over 5 runs with different random seeds. The maximum standard deviation is less than 0.05
  • Table3: Accuracy on test sets with different distractors. The distractor types in training sets are given in the table. We augment the MAC network and LSTM by incorporating graph knowledge from GNNs, via knowledge distillaton (KD) and contrastive learning (CL). All results are averaged over 5 runs with different random seeds. The maximum standard deviation is less than 0.05
  • Table4: Ablation study on different learning objectives. MAC means a MAC network trained with only supervised signals. MAC+KD is a MAC network with knowledge distillation, and we can choose to use labels together with KD (w/ label) or only use soft target produced by a teacher model (w/o label). MAC+KD+CL is a MAC network trained with all three objectives: supervised loss, knowledge distillation loss, and contrastive learning loss. We also tried a model trained with only contrastive learning objective. Its performance is too worse and thus we didn’t include it in comparison. A possible reason is that a solo contrastive learning based model is usually trained in two separate periods in which we train an encoder first with contrastive learning, and then train a decoder with labels according to the evaluation task. In our setting, however, we train an encoder and a decoder all together in an end-to-end manner. All results are averaged over 5 runs with different random seeds. The maximum standard deviations is less than 0.05
Download tables as Excel
Related work
  • Our work is closely related to recent research on machine reading comprehension (MRC), question answering (QA), and relational reasoning in NLP.

    Prominent examples of large-scale QA benchmarks include datasets such as SQuAD (Rajpurkar et al, 2016) and TriviaQA (Joshi et al, 2017). However, these traditional datasets do not consider the reasoning aspect of MRC and only target extractive QA tasks. Usually, these tasks only require extracting a single fact (or span of text) and do not necessitate complex relational reasoning.

    To address this shortcoming, there has been a surge of work tackling the relational reasoning and systematic generalization. Johnson et al (2017) first proposed the CLEVR dataset that focuses on the relational reasoning aspect of visual question answering (VQA). Similarly, Sinha et al (2019) released CLUTRR involving both text and graphs. These relational reasoning datasets also share inspirations with multi-hop QA, such as HotPotQA (Yang et al, 2018). Generally, the key distinction in the multi-hop setting is that an agent must reason about the relationship between multiple entities in order to answer a query.
Funding
  • This research was funded in part by an academic grant from Microsoft Research, as well as a Canada CIFAR AI Chair, held by Prof
Reference
  • Patricia A Alexander, Sophie Jablansky, Lauren M Singer, and Denis Dumas. 2016. Relational reasoning: what we know and why it matters. Policy insights from the behavioral and brain sciences, 3(1):36–44.
    Google ScholarLocate open access versionFindings
  • Jacob Andreas. 2019. Measuring compositionality in representation learning. 7th International Conference on Learning Representations.
    Google ScholarFindings
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. 3th International Conference on Learning Representations.
    Google ScholarFindings
  • Dzmitry Bahdanau, Shikhar Murty, Michael Noukhovitch, Thien Huu Nguyen, Harm de Vries, and Aaron Courville. 2019. Systematic generalization: what is required and can it be learned? 7th International Conference on Learning Representations.
    Google ScholarFindings
  • Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. 2018. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261.
    Findings
  • Yuwei Fang, Siqi Sun, Zhe Gan, Rohit Pillai, Shuohang Wang, and Jingjing Liu. 2019. Hierarchical graph network for multi-hop question answering. arXiv preprint arXiv:1911.03631.
    Findings
  • William L Hamilton, Rex Ying, and Jure Leskovec. 201Representation learning on graphs: Methods and applications. IEEE Data Engineering Bulletin, September 2017.
    Google ScholarLocate open access versionFindings
  • Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
    Findings
  • R Devon Hjelm, Alex Fedorov, Samuel LavoieMarchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. 201Learning deep representations by mutual information estimation and maximization. 7th International Conference on Learning Representations.
    Google ScholarFindings
  • Drew A Hudson and Christopher D Manning. 2018. Compositional attention networks for machine reasoning. In International Conference on Learning Representations (ICLR).
    Google ScholarLocate open access versionFindings
  • Justin Johnson, Li Fei-Fei, Bharath Hariharan, C Lawrence Zitnick, Laurens Van Der Maaten, and Ross Girshick. 2017. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, volume 2017-Janua, pages 1988–1997.
    Google ScholarLocate open access versionFindings
  • Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. 3rd International Conference for Learning Representations, ICLR 2015.
    Google ScholarFindings
  • Brenden M Lake and Marco Baroni. 2018. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. 35th International Conference on Machine Learning, ICML 2018.
    Google ScholarLocate open access versionFindings
  • Diego Marcheggiani and Ivan Titov. 2017. Encoding sentences with graph convolutional networks for semantic role labeling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1506–1515, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 20SQuad: 100,000+ questions for machine comprehension of text. In EMNLP 2016 Conference on Empirical Methods in Natural Language Processing, Proceedings, pages 2383–2392.
    Google ScholarLocate open access versionFindings
  • Koustuv Sinha, Shagun Sodhani, Jin Dong, Joelle Pineau, and William L Hamilton. 2019. CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4505–4514, Stroudsburg, PA, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958.
    Google ScholarLocate open access versionFindings
  • Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2020. Contrastive representation distillation. 8th International Conference on Learning Representations.
    Google ScholarFindings
  • Keyulu Xu, Stefanie Jegelka, Weihua Hu, and Jure Leskovec. 2019. How powerful are graph neural networks? In 7th International Conference on Learning Representations, ICLR 2019.
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • For all experiments in this section, we train the model for 50 epochs with a batch size of 100. We use the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 0.001.
    Google ScholarFindings
  • In the encoder part, we use 100-dimensional word embeddings and train them from scratch for all NLP models. For LSTM-based models, we use a 2-layer bidirectional LSTM with 100 hidden units. For the MAC network, we use 6 MAC cell units (6 reasoning steps), and 0.2 dropout (Srivastava et al., 2014) on all updates in the three units to avoid overfitting. We use a two-layer MLP with 100 hidden units as the score function for all attention modules. For the GIN model, we use 2 GIN layers with 100dimensional node embeddings and 20-dimensional edge embeddings. All node embeddings and edge embeddings are uniformly initialized.
    Google ScholarFindings
  • NLP models still cannot learn the superb generalization ability of GNNs regardless of the difficulty of the tasks. The improvement of reasoning ability, measured by accuracy, is most significant when the training set and test set have the same reasoning length. This is not surprising as the generalization ability is a known issue in modern NLP models and is an ongoing research topic (Bahdanau et al., 2019; Andreas, 2019). However, the generalization is in parallel with our contribution that is to improve the reasoning ability of NLP models. We refer readers to (Bahdanau et al., 2019; Andreas, 2019) for a comprehensive understanding of current progress in generalization of NLP models.
    Google ScholarLocate open access versionFindings
Author
Jin Dong
Jin Dong
Marc-Antoine Rondeau
Marc-Antoine Rondeau
Your rating :
0

 

Tags
Comments
小科