AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Our experiments showed that by proper decoding, significant improvements in domain adaptation of Reading Comprehension models can be achieved

End to End Synthetic Data Generation for Domain Adaptation of Question Answering Systems

EMNLP 2020, pp.5445-5460, (2020)

Cited by: 0|Views559
Full Text
Bibtex
Weibo

Abstract

We propose an end-to-end approach for synthetic QA data generation. Our model comprises a single transformer-based encoder-decoder network that is trained end-to-end to generate both answers and questions. In a nutshell, we feed a passage to the encoder and ask the decoder to generate a question and an answer token-by-token. The likelihoo...More

Code:

Data:

0
Introduction
  • Improving question answering (QA) systems through automatically generated synthetic data is a long standing research goal (Mitkov and Ha, 2003; Rus et al, 2010).
  • Some recent approaches for synthetic QA data generation based on large pretrained language models (LM) have started to demonstrate success in improving the downstream Reading Comprehension (RC) task with automatically generated data (Alberti et al, 2019; Puri et al, 2020)
  • These approaches typically consist of multi-stage systems that use three modules: span/answer detector, question generator and question filtering.
  • Each module is expensive to be computed because all use large transformer networks (Vaswani et al, 2017)
Highlights
  • Improving question answering (QA) systems through automatically generated synthetic data is a long standing research goal (Mitkov and Ha, 2003; Rus et al, 2010)
  • The main contributions of this work can be summarized as follows: (1) we propose the first effective end-to-end approach for synthetic QA data generation; (2) our approach solves an important issue in previous methods for QA data generation: the detection of good spans
  • Each experiment was performed by training the Reading Comprehension (RC) model on the synthetic data generated on the target domain corpus
  • Our QAGen and QAGen2S models outperform by wide margins the baseline models trained on SQuAD 1.1 only, as well as unsupersived domain adaptation approaches (UDA) suggested by Nishida et al (2019) and Lee et al (2020)
  • Comparing our proposed language models (LM) filtering-based models in Tab. 2, we propose the following explanations: (1) QAGen2S and QAGen outperform AQGen because generating answers conditioned on the question results in better spans, which is crucial in the training of the downstream RC task
  • Our experiments showed that by proper decoding, significant improvements in domain adaptation of RC models can be achieved
Methods
  • Experiments with Large QA Models

    The downstream RC models presented in previous sections were based on fine-tuning BERT-base model, which has 110 million parameters.
  • The authors assess the efficacy of the proposed domain adaptation approach on a higher capacity transformer as the RC model
  • For these experiments, the authors chose pretrained RoBERTa-large (Liu et al, 2019) model from transformers library (Wolf et al, 2019), which has 355 million parameters.
  • 1/0.5 gains in EM/F1 are observed in SQuAD 1.1 dev set
  • These results demonstrate that the proposed end-to-end synthetic data generation approach is capable of achieving substantial gains even on state-of-the-art RC baselines such as RoBERTa-large
Results
  • Each experiment was performed by training the RC model on the synthetic data generated on the target domain corpus.
  • The authors refer to the dataset to which the downstream model is being adapted as the target domain.
  • The authors' QAGen and QAGen2S models outperform by wide margins the baseline models trained on SQuAD 1.1 only, as well as unsupersived domain adaptation approaches (UDA) suggested by Nishida et al (2019) and Lee et al (2020).
  • QAGen and QAGen2S significantly outperforms QGen, the implementation of the three-stage pipeline of Puri et al (2020)
Conclusion
  • The authors presented a novel end-to-end approach to generate question-answer pairs by using a single transformer-based model.
  • The authors concluded that using LM filtering improves the quality of synthetic question-answer pairs; there is still a gap with round-trip filtering with some of the target domains.
  • It would be interesting to explore how one can adapt the generative models to the type of target domain questions
Tables
  • Table1: Samples of generated question-answer pairs using QAGen2S model for four target domains. The generated answers are shown in bold. The paragraphs are truncated from their original sizes due to space limitations
  • Table2: Domain adaptation results for different methods. Bold cells indicate the best performing model on each of the target domain dev sets, excluding supervised target domain training results
  • Table3: Cross domain experiments using QAGen2S as the generative model. Underlined cells indicate best EM/F1 value for each of the target domain dev sets (column-wise) and individual target domain corpus
  • Table4: Beam search vs. Topk+Nucleus sampling with various sample sizes per passage. NQ is used as target domain and QAGen2S with LM filtering is used as generator. For N > 5, top 5 samples per passage were selected according to LM scores
  • Table5: Comparison of using LM filtering versus no filtering. Bold values indicate best performance on each target domain for each model (per rows separated by sold lines)
  • Table6: Source and target domain performance with RoBERTa-large as downstream RC model
  • Table7: Performance on SQuAD 1.1 development set when training with LM-filtered synthetically generated question-answer pairs on IMDB corpus. Bold values indicate best performance per each model (row-wise). Our baseline EM and F1 numbers (on SQuAD 1.1 training set) are 80.78 and 88.20, respectively
  • Table8: Table 8
  • Table9: Comparison of using average versus summation of LM scores when doing LM filtering. Bold values indicate the best performance on each target domain for each model (per rows separated by solid lines)
  • Table10: Samples of generated question-answers pairs using QAGen2S model from Natural Questions passages with their LM scores. Sum of answer likelihood scores is used to sort the pairs decreasingly. The generated answers are shown in bold. Samples shown from Beam Search with beam size of 5, and Topk+Nucleus with sample size of 10
  • Table11: Samples of generated question-answers pairs from randomly selected passage from CNN/Daily Mail corpus. Samples are sorted according to LM scores
  • Table12: Generated samples using QAGen2S model from a Natural Questions passage consisting of a table. Sum of answer likelihood scores are chosen to sort the pairs decreasingly
Download tables as Excel
Related work
  • Question generation (QG) has been extensively studied from the early heuristic-based methods (Mitkov and Ha, 2003; Rus et al, 2010) to the recent neural-base approaches. However, most work (Du et al, 2017; Sun et al, 2018; Zhao et al, 2018; Kumar et al, 2019; Wang et al, 2020; Ma et al, 2020; Tuan et al, 2019; Chen et al, 2020) only takes QG as a stand-alone task, and evaluates the quality of generated questions with either automatic metrics such as BLEU, or human evaluation. Tang et al (2017), Duan et al (2017) and Sachan and Xing (2018) verified that generated questions can improve the downstream answer sentence selection tasks. Song et al (2018) and Klein and Nabi (2019) leveraged QG to augment the training set for machine reading comprehend tasks. However, they only got improvement when only a small amount of human labeled data is available. Recently, with the help of large pre-trained language models, Alberti et al (2019) and Puri et al (2020) have been able to improve the performance of RC models using generated questions. However, they need two extra BERT models to identify high-quality answer spans, and filter out low-quality questionanswer pairs. Lee et al (2020) follow a similar approach while using InfoMax Hierarchical Conditional VAEs. Nishida et al (2019) showed improvements by fine-tuning the language model on the target domains.
Funding
  • We postulate that the LM score correlates with the F1 score used in round-trip filtering
Study subjects and analysis
non-cancer control patients: 41
Specificity of. K20 expression was established against a range of tissue types and 289 lymph nodes from 41 non-cancer control patients. K20 expression was restricted to gastrointestinal epithelia and was only present in one of the 289 control lymph nodes, giving a calculated specificity of 97.6 % (95% confidence limits: 87.1-99.9%)

datasets: 4
We used the default train and dev splits, which contain 87,599 and 10,570 (q, a) pairs, respectively. Similar to (Nishida et al, 2019), we selected the following four datasets as target domains: Natural Questions (Kwiatkowski et al, 2019), which consist of Google search questions and the annotated answers from Wikipedia. We used MRQA Shared Task (Fisch et al, 2019) preprocessed training and dev sets, which consist of 104,071 and 12,836 (q, a) pairs, respectively

samples: 1504
Passages from CNN/Daily Mail corpus of Hermann et al (2015) are used as unlabeled target domain corpus. BioASQ (Tsatsaronis et al, 2015): we employed MRQA shared task version of BioASQ, which consists of a dev set with 1,504 samples. We collected PubMed abstracts to use as target domain unlabeled passages

pairs: 13111
DuoRC (Saha et al, 2018) contains questionanswer pairs from movie plots which are extracted from both Wikipedia and IMDB. ParaphraseRC task of DuoRC dataset was used in our evaluations, consisting of 13,111 pairs. We crawled IMDB movie plots to use as the unlabeled target domain corpus

samples: 10
Question-answer generation with AQGen, QAGen, and QAGen2S is performed using Topk+Nucleus, as discussed in Sec. 2.3. For each passage, 10 samples are generated. Unless otherwise mentioned, LM filtering is applied by sorting the 10 samples of each passage according to LM scores as detailed in Sec. 2.4, and the top 5 samples are selected

samples: 10
For each passage, 10 samples are generated. Unless otherwise mentioned, LM filtering is applied by sorting the 10 samples of each passage according to LM scores as detailed in Sec. 2.4, and the top 5 samples are selected. The number of synthetically generated pairs is between 860k to 890k without filtering and 480k to 500k after LM filtering

pairs: 39144
We postulate this is due to two reasons: Firstly, both BioASQ and DuoRC domains are more dissimilar to the source domain, SQuAD, compared to NewsQA and Natural Questions; Secondly, BioASQ and DuoRC are more difficult datasets. Comparing our results with supervised target domain training of DuoRC, we observe that with using only synthetic data outperforms the DuoRC training set, which consists of 39144 pairs. While our domain adaptation methods show substantial gains with NewsQA and Natural Questions domain, there is still room for improvements to match the performance of supervised target domain training (last row in Tab. 2)

samples: 5
Appendix C.1 examines this issue. When training the RC model we only used the top 5 samples based on LM score per each passage. We can observe that sampling 10 pairs per document leads to the best EM/F1 on the target domain

pairs: 10
When training the RC model we only used the top 5 samples based on LM score per each passage. We can observe that sampling 10 pairs per document leads to the best EM/F1 on the target domain. By sampling many QA pairs per passage, we increase the chance of generating good samples

pairs: 10
Tables 10 and 12 in the Appendix show examples of QA pairs and their LM scores. Fig. 4 shows experimental results when varying the number of (q, a) pairs selected from the 10 pairs sampled per each passage. We chose the value of 5 as this configuration outperforms other values overall

samples: 200
We postulate that the LM score correlates with the F1 score used in round-trip filtering. To more thoroughly examine this, we devised an experiment where we sorted the generated samples by their answer LM scores, divided them into contiguous buckets each with 200 samples, and calculated the average F1 score of the samples in each bucket. Fig. 5 shows the results of this experiment

pairs: 5
Impact of Synthetic Dataset Size In Fig. 6, we present plots that correlate synthetic dataset size (in # of passages) and RC model performance (EM/F1). We can see that with increasing the number of generated (q, a) pairs (5 pairs per passage), RC model performance improves. Such correlation is more evident when not using the SQuAD training data

Reference
  • Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, and Quoc V. Le. 2020. Towards a human-like opendomain chatbot.
    Google ScholarFindings
  • Chris Alberti, Daniel Andor, Emily Pitler, Jacob Devlin, and Michael Collins. 2019. Synthetic QA corpora generation with roundtrip consistency. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6168– 6173.
    Google ScholarLocate open access versionFindings
  • Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 200A neural probabilistic language model. J. Mach. Learn. Res., 3:1137–1155.
    Google ScholarLocate open access versionFindings
  • Yu Chen, Lingfei Wu, and Mohammed J. Zaki. 2020. Reinforcement learning based graph-to-sequence model for natural question generation. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
    Findings
  • Xinya Du, Junru Shao, and Claire Cardie. 2017. Learning to ask: Neural question generation for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1342–1352, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Nan Duan, Duyu Tang, Peng Chen, and Ming Zhou. 201Question generation for question answering. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 866–874, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eunsol Choi, and Danqi Chen. 2019. MRQA 2019 shared task: Evaluating generalization in reading comprehension. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pages 1–13, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Karl Moritz Hermann, Tomáš Kociský, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, page 1693–1701, Cambridge, MA, USA. MIT Press.
    Google ScholarLocate open access versionFindings
  • Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration. CoRR, abs/1904.09751.
    Findings
  • Tassilo Klein and Moin Nabi. 2019. Learning to answer by learning to ask: Getting the best of gpt-2 and bert worlds. arXiv preprint arXiv:1911.02365.
    Findings
  • Vishwajeet Kumar, Ganesh Ramakrishnan, and YuanFang Li. 2019. Putting the horse before the cart: A generator-evaluator framework for question generation from text. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 812–821.
    Google ScholarLocate open access versionFindings
  • Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
    Google ScholarLocate open access versionFindings
  • Dong Bok Lee, Seanie Lee, Woo Tae Jeong, Donghwan Kim, and Sung Ju Hwang. 2020. Generating diverse and consistent qa pairs from contexts with information-maximizing hierarchical conditional vaes.
    Google ScholarFindings
  • Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension.
    Google ScholarFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
    Findings
  • Ilya Loshchilov and Frank Hutter. 20Fixing weight decay regularization in adam. CoRR, abs/1711.05101.
    Findings
  • Xiyao Ma, Qile Zhu, Yanlin Zhou, and Xiaolin Li. 2020. Improving question generation with sentencelevel semantic matching and answer position inferring. In AAAI 2020.
    Google ScholarFindings
  • Ruslan Mitkov and Le An Ha. 2003. Computer-aided generation of multiple-choice tests. In Proceedings of the HLT-NAACL 03 Workshop on Building Educational Applications Using Natural Language Processing - Volume 2, HLT-NAACL-EDUC ’03, page 17–22, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Kosuke Nishida, Kyosuke Nishida, Itsumi Saito, Hisako Asano, and Junji Tomita. 2019. Unsupervised domain adaptation of language models for reading comprehension.
    Google ScholarFindings
  • Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pages 8024–8035.
    Google ScholarLocate open access versionFindings
  • Raul Puri, Ryan Spring, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. 2020. Training question answering models from synthetic data.
    Google ScholarFindings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9.
    Google ScholarLocate open access versionFindings
  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv, abs/1910.10683.
    Findings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100, 000+ questions for machine comprehension of text. CoRR, abs/1606.05250.
    Findings
  • Vasile Rus, Brendan Wyse, Paul Piwek, Mihai Lintean, Svetlana Stoyanchev, and Christian Moldovan. 2010. The first question generation shared task evaluation challenge. In Proceedings of the 6th International Natural Language Generation Conference.
    Google ScholarLocate open access versionFindings
  • Mrinmaya Sachan and Eric Xing. 2018. Self-training for jointly learning to ask and answer questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 629–640, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Amrita Saha, Rahul Aralikatte, Mitesh M. Khapra, and Karthik Sankaranarayanan. 2018. DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension. In Meeting of the Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • Linfeng Song, Zhiguo Wang, Wael Hamza, Yue Zhang, and Daniel Gildea. 2018. Leveraging context information for natural question generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 569–574, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, Yannis Almirantis, John Pavlopoulos, Nicolas Baskiotis, Patrick Gallinari, Thierry Artieres, Axel Ngonga, Norman Heino, Eric Gaussier, Liliana Barrio-Alvers, Michael Schroeder, Ion Androutsopoulos, and Georgios Paliouras. 2015. An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics, 16:138.
    Google ScholarLocate open access versionFindings
  • Luu Anh Tuan, Darsh J Shah, and Regina Barzilay. 2019. Capturing greater context for question generation.
    Google ScholarFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Xiaochuan Wang, Bingning Wang, Ting Yao, Qi Zhang, and Jingfang Xu. 2020. Neural question generation with answer pivot. In AAAI 2020.
    Google ScholarFindings
  • Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.
    Findings
  • Yao Zhao, Xiaochuan Ni, Yuanyuan Ding, and Qifa Ke. 2018. Paragraph-level neural question generation with maxout pointer and gated self-attention networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3901–3910.
    Google ScholarLocate open access versionFindings
  • Xingwu Sun, Jing Liu, Yajuan Lyu, Wei He, Yanjun Ma, and Shi Wang. 2018. Answer-focused and position-aware neural question generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3930– 3939, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Duyu Tang, Nan Duan, Tao Qin, Zhao Yan, and Ming Zhou. 2017. Question answering and question generation as dual tasks. arXiv preprint arXiv:1706.02027.
    Findings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科