QuASE: Question-Answer Driven Sentence Encoding

ACL, pp. 8743-8758, 2020.

Cited by: 2|Bibtex|Views71
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com
Weibo:
For single-sentence target tasks, questionanswer driven sentence encoding should restrict the interaction between the two sentence inputs when it further pre-trains on QA data

Abstract:

Question-answering (QA) data often encodes essential information in many facets. This paper studies a natural question: Can we get supervision from QA data for other tasks (typically, non-QA ones)? For example, can we use QAMR (<a class="ref-link" id="cMichael_et+al_2017_a" href="#rMichael_et+al_2017_a">Michael et al., 2017</a>) to improv...More
0
Introduction
  • It is labor-intensive to acquire human annotations for NLP tasks which require research expertise.
  • One needs to know thousands of semantic frames in order to provide semantic role labelings (SRL) (Palmer et al, 2010).
  • It is an important research direction to investigate how to get supervision signals from indirect data and improve one’s target task.
  • QA-SRL (He et al, 2015), which uses QA pairs to represent those predicateargument structures in SRL, should be intuitively helpful for SRL parsing, but the significant difference in their surface forms prevents them from using the same model in both tasks
Highlights
  • It is labor-intensive to acquire human annotations for NLP tasks which require research expertise
  • We argue that, for single-sentence target tasks, questionanswer driven sentence encoding should restrict the interaction between the two sentence inputs when it further pre-trains on QA data
  • We propose a new neural structure for this and name the resulting implementation sQUASE, where “s” stands for “single;” in contrast, we name the straightforward implementation mentioned above p-questionanswer driven sentence encoding for “paired.” Results show that s-questionanswer driven sentence encoding outperforms p-questionanswer driven sentence encoding significantly on 3 single-sentence tasks—semantic role labelings, named entity recognition, and semantic dependency parsing (SDP)—indicating the importance of this distinction
  • We investigate an important problem in NLP: Can we make use of low-cost signals, such as QA data, to help related tasks? We retrieve signals from sentence-level QA pairs to help NLP tasks via two types of sentence encoding approaches
  • For tasks with a single-sentence input, such as semantic role labelings and named entity recognition, we propose s-questionanswer driven sentence encoding that provides latent sentence-level representations; for tasks with a sentence pair input, such as textual entailment and machine reading comprehension we propose p-questionanswer driven sentence encoding, that generates latent
  • Experiments on a wide range of tasks show that the distinction of s-questionanswer driven sentence encoding and p-questionanswer driven sentence encoding is highly effective, and QUASEQAMR has the potential to improve on many tasks, especially in the low-resource setting
Conclusion
  • The authors discuss a few issues pertaining to improving QUASE by using additional QA datasets and the comparison of QUASE with related symbolic representations.

    4.1 Further Pre-training QUASE on Multiple QA Datasets

    The authors investigate whether adding the Large QA-SRL dataset (FitzGerald et al, 2018) or the QA-RE9 dataset into QAMR in the further pre-training stage can help SRL and RE.
  • Noteworthy is the fact that QA-RE can help SRL (Large QA-SRL can help RE), though the improvement is minor compared to Large QA-SRL (QARE)
  • These results indicate that adding more QA signals related to the sentence can help get a better sentence representation in general.In this paper, the authors investigate an important problem in NLP: Can the authors make use of low-cost signals, such as QA data, to help related tasks?
  • Experiments on a wide range of tasks show that the distinction of s-QUASE and p-QUASE is highly effective, and QUASEQAMR has the potential to improve on many tasks, especially in the low-resource setting
Summary
  • Introduction:

    It is labor-intensive to acquire human annotations for NLP tasks which require research expertise.
  • One needs to know thousands of semantic frames in order to provide semantic role labelings (SRL) (Palmer et al, 2010).
  • It is an important research direction to investigate how to get supervision signals from indirect data and improve one’s target task.
  • QA-SRL (He et al, 2015), which uses QA pairs to represent those predicateargument structures in SRL, should be intuitively helpful for SRL parsing, but the significant difference in their surface forms prevents them from using the same model in both tasks
  • Conclusion:

    The authors discuss a few issues pertaining to improving QUASE by using additional QA datasets and the comparison of QUASE with related symbolic representations.

    4.1 Further Pre-training QUASE on Multiple QA Datasets

    The authors investigate whether adding the Large QA-SRL dataset (FitzGerald et al, 2018) or the QA-RE9 dataset into QAMR in the further pre-training stage can help SRL and RE.
  • Noteworthy is the fact that QA-RE can help SRL (Large QA-SRL can help RE), though the improvement is minor compared to Large QA-SRL (QARE)
  • These results indicate that adding more QA signals related to the sentence can help get a better sentence representation in general.In this paper, the authors investigate an important problem in NLP: Can the authors make use of low-cost signals, such as QA data, to help related tasks?
  • Experiments on a wide range of tasks show that the distinction of s-QUASE and p-QUASE is highly effective, and QUASEQAMR has the potential to improve on many tasks, especially in the low-resource setting
Tables
  • Table1: The naive way of training BERT on QAMR (BERTQAMR) negatively impacts singlesentence tasks. We only use 10% training data for simplicity. We use BERT/BERTQAMR to produce feature vectors for a BiLSTM model (SRL) and a CNN model (RE); for TE and MRC, we fine-tune BERT/BERTQAM R
  • Table2: Probing results of the sentence encoders from s-QUASE and p-QUASE. In all tasks, we fix the model QUASE and use the sentence encodings as input feature vectors for the model of each task. In order to keep the model structure as simple as possible, we use BiLSTM for SRL, NER, and TE, Biaffine for SDP, and BiDAF for MRC. We compare on 10% and 100% of the data in all tasks except TE, where we use 30% to save run-time
  • Table3: Further pre-training QUASE on different QA datasets of the same number of QA pairs (51K). As we propose, s-QUASE is used as features for single-sentence tasks, and p-QUASE is further fine-tuned for the paired-sentence task. The specific models are all strong baselines except for SRL, where we use a simple BiLSTM model to save run-time. “Small” means 10% training examples for all tasks except NER, where “small” means the dev set (about 23%) of the corresponding training set. We further show the results of QUASE with the best QA dataset, which are significantly better than those of BERT
  • Table4: QUASEQAMR (almost) universally improves on 5 single-sentence tasks and 2 paired-sentence tasks. Note BERT is close to the state of the art for these tasks. Both absolute improvement (abs. imp.) and relative improvement (rel. imp.; error reduction rate) are reported. “Small/Full” refers to the size of training data for each target task. For SDP, RE, TE, and MRC, “small” means 10% of the training set, while for NER, SRL, and Coref, “small” means the development set (about 10%-30% compared to each training set)
  • Table5: The potential of further improving QUASEQAMR by further pre-training it on more QA data. The “+” between datasets means union with shuffling. Both Large QA-SRL and QA-RE help achieve better results than QAMR alone. For simplicity, we use a simple BiLSTM model for SRL and a simple CNN model for RE. See more in Appendix B
  • Table6: The results of five variants of s-QUASEQAMR on the development set of QAMR. We use the average exact match (EM) and average F1 as our evaluation metrics
  • Table7: Comparison between s-QUASEQAMR and other STOA embeddings. We use the same experimental settings as Section 3.4 for the three single-sentence tasks, SRL, Coref and NER. We use ELMo embeddings for SRL and Coref, and Flair embeddings for NER as our baselines
  • Table8: Some examples of question-answer pairs in QA-SRL and QAMR datasets. The first two examples are from QA-SRL dataset and predicates are bolded. The last two examples are from QAMR dataset. We show two phenomena that are not modeled by traditional symbolic representations of predicate-argument structure (e.g SRL and AMR), inferred relations (INF) and implicit arguments (IMP)
  • Table9: Results of learning an SRL parser from question-answer pairs
Download tables as Excel
Related work
  • Related Work on Sentence Encoding

    Modern LMs are essentially sentence encoders pretrained on unlabeled data and they outperform early sentence encoders such as skip-thoughts (Kiros et al, 2015). While an LM like BERT can handle lexical and syntactic variations quite well, it still needs to learn from some annotations to acquire the “definition” of many tasks, especially those requiring complex semantics (Tenney et al, 2019). Although we extensively use BERT here, we think that the specific choice of LM is orthogonal to our proposal of learning from QA data. Stronger LMs, e.g., RoBERTa (Liu et al, 2019) or XLNet (Yang et al, 2019), may only strengthen the proposal here. This is because a stronger LM represents unlabeled data better, while the proposed work is about how to represent labeled data better.

    CoVe (McCann et al, 2017) is another attempt to learn from indirect data, translation data specifically. However, it does not outperform ELMo or BERT in many NLP tasks (Peters et al, 2018) and probing analysis (Tenney et al, 2019). In contrast, our QUASE will show stronger experimental results than BERT on multiple tasks. In addition, we think QA data is generally cheaper to collect than translation data.
Funding
  • This material is based upon work supported by the US Defense Advanced Research Projects Agency (DARPA) under contracts FA8750-192-0201, W911NF-15-1-0461, and FA8750-19-21004, a grant from the Army Research Office (ARO), and Google Cloud
Reference
  • Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual string embeddings for sequence labeling. In COLING, pages 1638–1649.
    Google ScholarLocate open access versionFindings
  • M. Chang, L. Ratinov, and D. Roth. 2007. Guiding semi-supervision with constraint-driven learning. In Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 280–287.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, pages 4171–4186.
    Google ScholarLocate open access versionFindings
  • Timothy Dozat and Christopher D. Manning. 2018. Simpler but more accurate semantic dependency parsing. In ACL, pages 484–490.
    Google ScholarLocate open access versionFindings
  • Nicholas FitzGerald, Julian Michael, Luheng He, and Luke Zettlemoyer. 2018. Large-scale QA-SRL parsing. In ACL, pages 2051–2060.
    Google ScholarLocate open access versionFindings
  • Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke S. Zettlemoyer. 2017. AllenNLP: A deep semantic natural language processing platform.
    Google ScholarFindings
  • Sahil Garg, Aram Galstyan, Greg Ver Steeg, Irina Rish, Guillermo Cecchi, and Shuyang Gao. 2019. Kernelized hashcode representations for relation extraction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6431–6440.
    Google ScholarLocate open access versionFindings
  • Luheng He, Kenton Lee, Mike Lewis, and Luke Zettlemoyer. 2017. Deep semantic role labeling: What works and what’s next. In ACL, pages 473–483.
    Google ScholarLocate open access versionFindings
  • Luheng He, Mike Lewis, and Luke Zettlemoyer. 2015. Question-answer driven semantic role labeling: Using natural language to annotate natural language. In Proc. of the Conference on Empirical Methods for Natural Language Processing (EMNLP), pages 643–653.
    Google ScholarLocate open access versionFindings
  • Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid O Seaghdha, Sebastian Pado, Marco Pennacchiotti, Lorenza Romano, and Stan Szpakowicz. 2009. Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. In Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions, pages 94–99.
    Google ScholarLocate open access versionFindings
  • Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–16Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Daniel Khashabi, Tushar Khot, Ashish Sabharwal, and Dan Roth. 2018. Question answering as global reasoning over semantic abstractions. In Proceedings of The Conference on Artificial Intelligence (Proc. of the Conference on Artificial Intelligence (AAAI)).
    Google ScholarLocate open access versionFindings
  • Paul Kingsbury and Martha Palmer. 2002. From treebank to propbank. In LREC, pages 1989–1993. Citeseer.
    Google ScholarFindings
  • Ryan Kiros, Yukun Zhu, Russ R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302.
    Google ScholarLocate open access versionFindings
  • Kenton Lee, Luheng He, Mike Lewis, and Luke Zettlemoyer. 2017. End-to-end neural coreference resolution. In EMNLP, pages 188–197.
    Google ScholarLocate open access versionFindings
  • Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. 2017. Zero-shot relation extraction via reading comprehension. In CoNLL, pages 333–342.
    Google ScholarLocate open access versionFindings
  • Elisabeth Lien and Milen Kouylekov. 2015. Semantic parsing for textual entailment. In Proceedings of the 14th International Conference on Parsing Technologies, pages 40–49.
    Google ScholarLocate open access versionFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
    Findings
  • Diego Marcheggiani and Ivan Titov. 2017. Encoding sentences with graph convolutional networks for semantic role labeling. In EMNLP, pages 1506–1515.
    Google ScholarLocate open access versionFindings
  • Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In NeurIPS, pages 6294– 6305.
    Google ScholarLocate open access versionFindings
  • Julian Michael, Gabriel Stanovsky, Luheng He, Ido Dagan, and Luke Zettlemoyer. 2017. Crowdsourcing question-answer meaning representations. NAACL.
    Google ScholarLocate open access versionFindings
  • Stephan Oepen, Marco Kuhlmann, Yusuke Miyao, Daniel Zeman, Silvie Cinkova, Dan Flickinger, Jan Hajic, and Zdenka Uresova. 2015. SemEval 2015 task 18: Broad-coverage semantic dependency parsing. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 915–926.
    Google ScholarLocate open access versionFindings
  • M. Palmer, D. Gildea, and N. Xue. 2010. Semantic Role Labeling, volume 3.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In EMNLP, pages 1532–1543.
    Google ScholarLocate open access versionFindings
  • Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In NAACL, pages 2227–2237.
    Google ScholarLocate open access versionFindings
  • Jason Phang, Thibault Fevry, and Samuel R Bowman. 2018. Sentence encoders on stilts: Supplementary training on intermediate labeled-data tasks. arXiv preprint arXiv:1811.01088.
    Findings
  • Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Hwee Tou Ng, Anders Bjorkelund, Olga Uryupina, Yuchen Zhang, and Zhi Zhong. 2013. Towards robust linguistic analysis using ontonotes. In CoNLL, pages 143–152.
    Google ScholarLocate open access versionFindings
  • Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang. 2012. Conll2012 shared task: Modeling multilingual unrestricted coreference in ontonotes. In Joint Conference on EMNLP and CoNLL-Shared Task, pages 1– 40.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3-us-west-2.amazonaws.com/openaiassets/researchcovers/languageunsupervised/language understanding paper.pdf.
    Findings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In EMNLP, pages 2383–2392.
    Google ScholarLocate open access versionFindings
  • Nils Reimers and Iryna Gurevych. 2019. SentenceBERT: Sentence embeddings using siamese BERTnetworks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 3973–3983.
    Google ScholarLocate open access versionFindings
  • Drew Reisinger, Rachel Rudinger, Francis Ferraro, Craig Harman, Kyle Rawlins, and Benjamin Van Durme. 2015. Semantic proto-roles. Transactions of the Association for Computational Linguistics, 3:475–488.
    Google ScholarLocate open access versionFindings
  • Dan Roth. 2017. Incidental supervision: Moving beyond supervised learning. In Proc. of the Conference on Artificial Intelligence (AAAI).
    Google ScholarLocate open access versionFindings
  • Mrinmaya Sachan and Eric Xing. 2016. Machine comprehension using rich semantic representations. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 486–492.
    Google ScholarLocate open access versionFindings
  • Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension. ICLR.
    Google ScholarLocate open access versionFindings
  • Li Song, Yuan Wen, Sijia Ge, Bin Li, Junsheng Zhou, Weiguang Qu, and Nianwen Xue. 2018. An easier and efficient framework to annotate semantic roles: Evidence from the chinese amr corpus. In The 13th Workshop on Asian Language Resources, page 29.
    Google ScholarLocate open access versionFindings
  • Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. 2018. Linguistically-informed self-attention for semantic role labeling. In EMNLP, pages 5027–5038.
    Google ScholarLocate open access versionFindings
  • Kai Sun, Dian Yu, Dong Yu, and Claire Cardie. 2019. Improving machine reading comprehension with general reading strategies. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2633–2643, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alon Talmor and Jonathan Berant. 2019. MultiQA: An empirical investigation of generalization and transfer in reading comprehension. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4911–4921. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R Thomas McCoy, Najoung Kim, Benjamin Van Durme, Samuel R Bowman, Dipanjan Das, et al. 2019. What do you learn from context? probing for sentence structure in contextualized word representations. ICLR.
    Google ScholarLocate open access versionFindings
  • Naftali Tishby, Fernando C Pereira, and William Bialek. 1999. The information bottleneck method. In Proc. of the Annual Allerton Conference on Communication, Control and Computing.
    Google ScholarLocate open access versionFindings
  • Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2017. NewsQA: A machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 191–200. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
    Google ScholarLocate open access versionFindings
  • Alex Wang, Jan Hula, Patrick Xia, Raghavendra Pappagari, R Thomas McCoy, Roma Patel, Najoung Kim, Ian Tenney, Yinghui Huang, Katherin Yu, et al. 2019. Can you tell me how to get past sesame street? sentence-level pretraining beyond language modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4465–4476.
    Google ScholarLocate open access versionFindings
  • Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355.
    Google ScholarLocate open access versionFindings
  • Aaron Steven White, Drew Reisinger, Keisuke Sakaguchi, Tim Vieira, Sheng Zhang, Rachel Rudinger, Kyle Rawlins, and Benjamin Van Durme. 2016. Universal Decompositional Semantics on Universal Dependencies. In Empirical Methods in Natural Language Processing (EMNLP).
    Google ScholarLocate open access versionFindings
  • Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In NAACL, pages 1112–1122.
    Google ScholarLocate open access versionFindings
  • Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.
    Findings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.
    Findings
  • Naftali Tishby and Noga Zaslavsky. 2015. Deep learning and the information bottleneck principle. In IEEE Information Theory Workshop (ITW).
    Google ScholarLocate open access versionFindings
  • Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proc. of the Annual Meeting of the North American Association of Computational Linguistics (NAACL).
    Google ScholarLocate open access versionFindings
  • Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li, Hongwei Hao, and Bo Xu. 2016. Attentionbased bidirectional long short-term memory networks for relation classification. In ACL, pages 207– 212.
    Google ScholarLocate open access versionFindings
  • Our QUASE is based on the re-implementation of BERT with pytorch (Wolf et al., 2019). Although we might change a bit to fit the memory of GPU sometimes, the common hyper parameters for further pre-training s-QUASE and p-QUASE are as follows: Further pre-training p-QUASE. For sentencelevel QA datasets (QAMR, Large QA-SRL, and QA-RE), we further pre-train BERT for 4 epochs with a learning rate of 5e-5, a batch size of 32, a maximum sequence length of 128. For paragraph-level QA datasets (SQuAD, TrivaQA, and NewsQA), we further pre-train BERT for 4 epochs with a learning rate of 5e-5, a batch size of 16, a maximum sequence length of 384.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments