Span Selection Pre-training for Question Answering

Glass Michael
Glass Michael
Gliozzo Alfio
Gliozzo Alfio
Chakravarti Rishav
Chakravarti Rishav
Ferritto Anthony
Ferritto Anthony
Bhargav G P Shrivatsa
Bhargav G P Shrivatsa
Garg Dinesh
Garg Dinesh

ACL, pp. 2773-2782, 2020.

Cited by: 4|Bibtex|Views101
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
Span selection pre-training is effective in improving reading comprehension across four diverse datasets, including both generated and natural questions, and with provided contexts of passages, documents and even passage sets

Abstract:

BERT (Bidirectional Encoder Representations from Transformers) and related pre-trained Transformers have provided large gains across many language understanding tasks, achieving a new state-of-the-art (SOTA). BERT is pre-trained on two auxiliary tasks: Masked Language Model and Next Sentence Prediction. In this paper we introduce a new ...More

Code:

Data:

0
Introduction
  • State-of-the-art approaches for NLP tasks are based on language models that are pre-trained on tasks which do not require labeled data [Peters et al, 2018, Howard and Ruder, 2018, Devlin et al, 2018, Yang et al, 2019, Liu et al, 2019, Sun et al, 2019].
  • Fine tuning language models to downstream tasks, such as question answering or sentence paraphrasing, has been shown to be a general and effective strategy.
  • The general BERT adaptation approach is to alter the model used for pre-training while retaining the transformer encoder layers.
  • YN , [SEP ] such that M + N < S where S is the maximum sequence length allowed during training1
  • This is first pre-trained on a large amount of unlabeled data and fine-tuned on downstream tasks which has labeled data
Highlights
  • State-of-the-art approaches for NLP tasks are based on language models that are pre-trained on tasks which do not require labeled data [Peters et al, 2018, Howard and Ruder, 2018, Devlin et al, 2018, Yang et al, 2019, Liu et al, 2019, Sun et al, 2019]
  • We provide an extensive evaluation of the span selection pre-training method across six tasks: the first four are reading comprehension based: the Stanford Question Answering Dataset (SQuAD) in both version 1.1 and 2.0; followed by the Google Natural Questions dataset [Kwiatkowski et al, 2019] and a multi-hop Question Answering dataset, HotpotQA [Yang et al, 2018]
  • We find the most substantial gains of almost 4 F1 points for answer selection, the QA task most similar to span selection pre-training
  • Span selection pre-training is effective in improving reading comprehension across four diverse datasets, including both generated and natural questions, and with provided contexts of passages, documents and even passage sets
  • The span selection task is suitable for pre-training on any domain, since it makes no assumptions about document structure or availability of summary/article pairs
Methods
  • The authors measure how much of the improvement is due to this final layer pre-training versus the extended pre-training for the transformer encoder layers by discarding the pre-trained pointer network and randomly initializing.
  • This configuration is indicated as BERTBASE+SSPT-PN.
  • The pre-training of the pointer network is not a significant factor in the improved performance on reading comprehension, indicating the improvement is instead coming through a better language understanding in the transformer
Results
  • Unlike the efforts of XLNet or RoBERTa which increased training by a factor of ten relative to BERT, the additional data in SSPT represents less than a 40% increase in the pre-training of the transformer.
Conclusion
  • Conclusion and Future

    Work

    Span selection pre-training is effective in improving reading comprehension across four diverse datasets, including both generated and natural questions, and with provided contexts of passages, documents and even passage sets.
  • The span selection task is suitable for pre-training on any domain, since it makes no assumptions about document structure or availability of summary/article pairs.
  • This allows pre-training of language understanding models in a very generalizable way.
  • The authors hope to progress to a model of general purpose language modeling that uses an indexed long term memory to retrieve world knowledge, rather than holding it in the densely activated transformer encoder layers
Summary
  • Introduction:

    State-of-the-art approaches for NLP tasks are based on language models that are pre-trained on tasks which do not require labeled data [Peters et al, 2018, Howard and Ruder, 2018, Devlin et al, 2018, Yang et al, 2019, Liu et al, 2019, Sun et al, 2019].
  • Fine tuning language models to downstream tasks, such as question answering or sentence paraphrasing, has been shown to be a general and effective strategy.
  • The general BERT adaptation approach is to alter the model used for pre-training while retaining the transformer encoder layers.
  • YN , [SEP ] such that M + N < S where S is the maximum sequence length allowed during training1
  • This is first pre-trained on a large amount of unlabeled data and fine-tuned on downstream tasks which has labeled data
  • Methods:

    The authors measure how much of the improvement is due to this final layer pre-training versus the extended pre-training for the transformer encoder layers by discarding the pre-trained pointer network and randomly initializing.
  • This configuration is indicated as BERTBASE+SSPT-PN.
  • The pre-training of the pointer network is not a significant factor in the improved performance on reading comprehension, indicating the improvement is instead coming through a better language understanding in the transformer
  • Results:

    Unlike the efforts of XLNet or RoBERTa which increased training by a factor of ten relative to BERT, the additional data in SSPT represents less than a 40% increase in the pre-training of the transformer.
  • Conclusion:

    Conclusion and Future

    Work

    Span selection pre-training is effective in improving reading comprehension across four diverse datasets, including both generated and natural questions, and with provided contexts of passages, documents and even passage sets.
  • The span selection task is suitable for pre-training on any domain, since it makes no assumptions about document structure or availability of summary/article pairs.
  • This allows pre-training of language understanding models in a very generalizable way.
  • The authors hope to progress to a model of general purpose language modeling that uses an indexed long term memory to retrieve world knowledge, rather than holding it in the densely activated transformer encoder layers
Tables
  • Table1: Cloze instances of different types
  • Table2: Span Selection instances of different types
  • Table3: Comparison of QA Datasets
  • Table4: Results on SQuAD
  • Table5: Dev Set Results on Natural Questions
  • Table6: Results on HotpotQA
  • Table7: Results on MRPC and QQP
Download tables as Excel
Related work
  • We review related work in three categories: other efforts to use automatically constructed tasks similar to extractive QA, research towards adding new pre-training tasks, and work that extends the pre-training with more data.

    Previous work has explored tasks similar to span selection pre-training. These are typically cast as approaches to augment the training data for question answering systems, rather than alleviating the need for encoding world knowledge in a language model or general pre-training for language understanding.

    Hermann et al [2015] introduces a reading comprehension task constructed automatically from news articles with summaries. In this view the constructed dataset is used both for training and test. Also, entities were replaced with anonymized markers to limit the influence of world knowledge. Unlike our span selection pre-training task, this requires summaries paired with articles and focuses only on entities. A similar approach was taken by Dhingra et al [2018] to augment training data for question answering. Wikipedia articles were divided into introduction and body with sentences from the introduction used to construct queries for the body passage. Phrases and entities are used as possible answer terms.
Funding
  • Introduces a new pre-training task inspired by reading comprehension and an effort to avoid encoding general knowledge in the transformer network itself
  • Finds significant and consistent improvements over both BERTBASE and BERTLARGE on multiple reading comprehension and paraphrasing datasets
  • Provides a relevant and answer-bearing passage to play the role of an instance specific information retrieval system
  • Provides an extensive evaluation of the span selection pre-training method across six tasks: the first four are reading comprehension based: the Stanford Question Answering Dataset in both version 1.1 and 2.0; followed by the Google Natural Questions dataset and a multi-hop Question Answering dataset, HotpotQA
  • Reports consistent improvements over both BERTBASE and BERTLARGE models in the reading comprehension benchmarks and some positive signal in paraphrasing
Reference
  • Chris Alberti, Daniel Andor, Emily Pitler, Jacob Devlin, and Michael Collins. Synthetic QA corpora generation with roundtrip consistency. CoRR, abs/1906.05416, 2019a. URL http://arxiv.org/abs/1906.05416.
    Findings
  • Chris Alberti, Kenton Lee, and Michael Collins. A bert baseline for the natural questions, 2019b. URL https://arxiv.org/abs/1901.08634v2.
    Findings
  • Rie Kubota Ando and Tong Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(Nov):1817–1853, 2005.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. URL https://arxiv.org/pdf/1810.04805.pdf.
    Findings
  • Bhuwan Dhingra, Danish Danish, and Dheeraj Rajagopal. Simple and effective semi-supervised question answering. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), volume 2, pages 582–587, 2018. URL https://aclweb.org/anthology/N18-2092.
    Locate open access versionFindings
  • Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1693–1701. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/5945-teaching-machines-to-read-and-comprehend.pdf.
    Locate open access versionFindings
  • Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339. Association for Computational Linguistics, 2018. URL http://aclweb.org/anthology/P18-1031.
    Locate open access versionFindings
  • Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. Spanbert: Improving pre-training by representing and predicting spans. arXiv preprint arXiv:1907.10529, 2019.
    Findings
  • Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: a benchmark for question answering research. TACL, 201URL https://tomkwiat.users.x20web.corp.google.com/papers/natural-questions/main-1455-kwiatkowski.pdf.
    Locate open access versionFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019. URL http://arxiv.org/abs/1907.11692.
    Findings
  • Takeshi Onishi, Hai Wang, Mohit Bansal, Kevin Gimpel, and David McAllester. Who did what: A large-scale person-centered cloze dataset. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2230–2235, 2016. URL https://aclweb.org/anthology/D16-1241.
    Locate open access versionFindings
  • Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237. Association for Computational Linguistics, 2018. doi: 10.18653/v1/N18-1202. URL http://aclweb.org/anthology/N18-1202.
    Locate open access versionFindings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Preprint, 2019. URL https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf.
    Locate open access versionFindings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, 2016. URL https://aclweb.org/anthology/D16-1264.
    Locate open access versionFindings
  • Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822, 2018.
    Findings
  • Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr., 3(4):333–389, April 2009. ISSN 1554-0669. doi: 10.1561/1500000019. URL http://dx.doi.org/10.1561/1500000019.
    Locate open access versionFindings
  • Mrinmaya Sachan and Eric Xing. Self-training for jointly learning to ask and answer questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 629–640. Association for Computational Linguistics, 2018. doi: 10.18653/v1/N18-1058. URL http://aclweb.org/anthology/N18-1058.
    Locate open access versionFindings
  • Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. CoRR, abs/1701.06538, 2017. URL http://arxiv.org/abs/1701.06538.
    Findings
  • Leslie N. Smith. A disciplined approach to neural network hyper-parameters: Part 1 – learning rate, batch size, momentum, and weight decay, 2018.
    Google ScholarFindings
  • Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. Ernie 2.0: A continual pre-training framework for language understanding. arXiv preprint arXiv:1907.12412, 2019.
    Findings
  • Wilson L Taylor. “Cloze procedure”: A new tool for measuring readability. Journalism Bulletin, 30(4):415–433, 1953.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.
    Locate open access versionFindings
  • Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355. Association for Computational Linguistics, 2018. URL https://aclweb.org/anthology/W18-5446.
    Locate open access versionFindings
  • Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018.
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. Xlnet: Generalized autoregressive pretraining for language understanding. CoRR, abs/1906.08237, 2019. URL http://arxiv.org/abs/1906.08237.
    Findings
  • Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27, 2015.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments