AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We show that it is beneficial to extend the vocabulary of the language model while fine-tuning it on the target domain language

Multi Stage Pre training for Low Resource Domain Adaptation

EMNLP 2020, pp.5461-5468, (2020)

Cited by: 0|Views404
Full Text
Bibtex
Weibo

Abstract

Transfer learning techniques are particularly useful for NLP tasks where a sizable amount of high-quality annotated data is difficult to obtain. Current approaches directly adapt a pretrained language model (LM) on in-domain text before fine-tuning to downstream tasks. We show that extending the vocabulary of the LM with domain-specific t...More

Code:

Data:

Introduction
  • Pre-trained language models (Radford et al, 2019; Devlin et al, 2019; Liu et al, 2019) have pushed performance in many natural language processing tasks to new heights.
  • Directly fine-tuning to a task in a new domain may not be optimal when the domain is distant in content and terminology from the pre-training corpora.
  • Many specialized domains contain their own specific terms that are not part of the pre-trained LM vocabulary
  • In many such domains, large enough corpora may not be available to support LM training from scratch.
  • To resolve this out-of-vocabulary issue, in this work, the authors extend the open-domain vocabulary with in-domain terms while adapting the LM, and show that it helps improve performance on downstream tasks
Highlights
  • Pre-trained language models (Radford et al, 2019; Devlin et al, 2019; Liu et al, 2019) have pushed performance in many natural language processing tasks to new heights
  • (3) In our experiments, we show considerable improvements in performance over directly fine-tuning an underlying RoBERTa-large language model (LM) (Liu et al, 2019) on multiple tasks in the IT domain: extractive reading comprehension (RC), document ranking (DR) and duplicate question detection (DQD)
  • For each of our approaches, we show performance of the model when fine-tuned on the downstream tasks in TechQA and AskUbuntu datasets
  • We show that it is beneficial to extend the vocabulary of the LM while fine-tuning it on the target domain language
  • We empirically demonstrate that structure in the unsupervised domain data can be used to formulate auxillary pre-training tasks that can help downstream low-resource tasks like question answering and document ranking
  • We empirically show considerable improvements in performance over a standard RoBERTa-large LM on multiple tasks
Methods
  • Ten other documents are sampled from the Technotes corpus as negatives to simulate unanswerable examples.
  • This auxiliary task trains an intermediate RC model which predicts the start and end positions of the solution section as the answer given the document and the problem description.
  • While the main goal here is to generate long-answer examples common in TechQA, the general idea of utilizing the document structure can be applicable in other scenarios including in scientific domains like Bio/Medical (G.
  • Malakasiotis, et al, 2015; Lee et al, 2019) where structured text is relatively common
Results
  • For each of the approaches, the authors show performance of the model when fine-tuned on the downstream tasks in TechQA and AskUbuntu datasets.
  • All the numbers reported are averages over 5 seeds, unless otherwise stated.
  • TechQA-RC Table 2 describes the performance on the RC task in the TechQA dataset.
  • The BERT baseline numbers are from (Castelli et al, 2019).
  • Model performance is compared on the dev set and the authors report the blind test set numbers6 for the single-best baseline and final models
Conclusion
  • The authors show that it is beneficial to extend the vocabulary of the LM while fine-tuning it on the target domain language.
  • The authors show that extending the pre-training with task-specific synthetic data is an effective domain adaptation strategy.
  • The authors empirically demonstrate that structure in the unsupervised domain data can be used to formulate auxillary pre-training tasks that can help downstream low-resource tasks like question answering and document ranking.
  • The authors aim to extend the approach to more domains and explore more generalizable approaches for unsupervised domain adaptation
Tables
  • Table1: Size statistics for two IT domain datasets. Train/Dev/Test: # examples, Unlabeled: # tokens
  • Table2: Results on TechQA-RC task. Each row with a + adds a step to the previous row. HA F1 refers to F1 for answerable questions. Numbers in parentheses show standard deviation
  • Table3: Experimental results on TechQA-DR task. Each row with a + adds a step to the previous row. M@1 is short for Match@1 and M@5 for Match@5. Numbers in parentheses show standard deviation
  • Table4: Experimental results on AskUbuntu-DQD task. Each row with a + adds a step to the previous row. P@1 and P@5 refer to Precision@1 and Precision@5, respectively. Numbers in parentheses show standard deviation
  • Table5: Hyperparameters for the LM training
  • Table6: Hyperparameters for the TechQA-RC task
  • Table7: Hyperparameters for the TechQA-DR task
  • Table8: Hyperparameters for the AskUbuntu-DQD task
  • Table9: Coverage and BPE/TOK ratio vs the number of word pieces added to the vocabulary for the Technotes collection
Download tables as Excel
Study subjects and analysis
documents: 50
TechQA (Castelli et al, 2019) is an extractive reading comprehension (Rajpurkar et al, 2016) dataset developed from real user questions in the customer support domain. Each question is accompanied by 50 documents, at most one of which has the answer. A companion collection of 801K unlabeled Technotes is provided to support LM training

randomly selected documents: 10
Using the method described in section 4, we use the 801K Technotes to construct a synthetic corpus for the TechQA tasks. The synthetic data contains 115K positive examples, each of which has 10 randomly selected documents as negatives. For the AskUbuntu-DQD, a 210K-example synthetic corpus is constructed from the web dump data, with a positive:negative example ratio of 1:1

Reference
  • Ashutosh Adhikari, Achyudh Ram, Raphael Tang, and Jimmy Lin. 2019. DocBERT: BERT for document classification. arXiv 1904.08398.
    Findings
  • Emily Alsentzer, John Murphy, William Boag, WeiHung Weng, Di Jindi, Tristan Naumann, and Matthew McDermott. 2019. Publicly available clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 72–78, Minneapolis, Minnesota, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ankur Bapna and Orhan Firat. 2019.
    Google ScholarFindings
  • Simple, scalable adaptation for neural machine translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1538– 1548, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615– 3620, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Vittorio Castelli, Rishav Chakravarti, Saswati Dana, Anthony Ferritto, Radu Florian, Martin Franz, Dinesh Garg, Dinesh Khandelwal, Scott McCarley, Mike McCawley, Mohamed Nasr, Lin Pan, Cezar Pendus, John Pitrelli, Saurabh Pujar, Salim Roukos, Andrzej Sakrajda, Avirup Sil, Rosario Uceda-Sosa, Todd Ward, and Rong Zhang. 2019. The TechQA dataset. arXiv 1911.02984; To appear in Proc. ACL 2020.
    Findings
  • Yu-An Chung, Hung-Yi Lee, and James Glass. 2018. Supervised and unsupervised transfer learning for question answering. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1585–1594, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Bhuwan Dhingra, Danish Danish, and Dheeraj Rajagopal. 2018. Simple and effective semi-supervised question answering. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 582–587, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • G. Tsatsaronis, G. Balikas, P. Malakasiotis, et al. 2015. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics, 16(138).
    Google ScholarLocate open access versionFindings
  • David Golub, Po-Sen Huang, Xiaodong He, and Li Deng. 2017. Two-stage synthesis networks for transfer learning in machine comprehension. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 835– 844, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Suchin Gururangan, Ana Marasovic, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv 2004.10964.
    Findings
  • Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. BioBERT: a pretrained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
    Google ScholarLocate open access versionFindings
  • Tao Lei, Hrishikesh Joshi, Regina Barzilay, Tommi Jaakkola, Kateryna Tymoshenko, Alessandro Moschitti, and Lluıs Marquez. 2016. Semi-supervised question retrieval with gated convolutions. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1279–1289, San Diego, California. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Patrick Lewis, Ludovic Denoyer, and Sebastian Riedel. 2019. Unsupervised question answering by cloze translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4896–4910, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual denoising pre-training for neural machine translation.
    Google ScholarFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv 1907.11692.
    Findings
  • Sewon Min, Minjoon Seo, and Hannaneh Hajishirzi. 2017. Question answering through transfer learning from large fine-grained supervision data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 510–517, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 20fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations.
    Google ScholarLocate open access versionFindings
  • Xipeng Qiu and Xuanjing Huang. 2015. Convolutional neural tensor network architecture for communitybased question answering. In IJCAI, pages 1305– 1311. AAAI Press.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9.
    Google ScholarLocate open access versionFindings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392.
    Google ScholarLocate open access versionFindings
  • Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends in Information Retrieval, 3:333–389.
    Google ScholarLocate open access versionFindings
  • Andreas Ruckle, Nafise Sadat Moosavi, and Iryna Gurevych. 2019. Neural duplicate question detection without labeled training data. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1607–1617, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Georg Wiese, Dirk Weissenborn, and Mariana Neves. 2017. Neural domain adaptation for biomedical question answering. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 281–289, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.
    Findings
  • Zhilin Yang, Junjie Hu, Ruslan Salakhutdinov, and William Cohen. 2017. Semi-supervised QA with generative domain-adaptive nets. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1040–1050, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • In our experiments, we used the Fairseq toolkit (Ott et al., 2019) for language modelling and the Transformers library (Wolf et al., 2019) for downstream tasks. For all of our target models, when fine-tuning on the downstream task, we choose the hyperparameters by grid search and pick the best models on the dev set according to the evaluation metrics for the corresponding task. For TechQA-RC task, we pick the best model according to (HA F1 + F1) and for TechQA-DR, we choose based on Match@1. For the AskUbuntu-DQD, we pick the best model based on MAP. The best hyperparamters for each of the tasks are shown in the Tables 5 to 8 below: Hyperparameter WARMUP UPDATES PEAK LR TOKENS PER SAMPLE MAX POSITIONS MAX SENTENCES UPDATE FREQ OPTIMIZER DROPOUT ATTENTION DROPOUT WEIGHT DECAY MAX Epochs CRITERION
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科