AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
The evaluation results on Named Entity Recognition and Relation Extraction are shown in Table 2, and Question Answering are shown in Table 3

BioMegatron: Larger Biomedical Domain Language Model

EMNLP 2020, pp.4700-4706, (2020)

Cited by: 0|Views216
Full Text
Bibtex
Weibo

Abstract

There has been an influx of biomedical domain-specific language models, showing language models pre-trained on biomedical text perform better on biomedical domain benchmarks than those trained on general domain text corpora such as Wikipedia and Books. Yet, most works do not study the factors affecting each domain language application dee...More

Code:

Data:

0
Introduction
  • Transferring the success of BERT (Devlin et al, 2018) to the biomedical domain, most notably Lee et al (2019) (BioBERT) and Beltagy et al (2019) (SciBERT) inspired a large number of similar works last year.
  • Gu et al (2020) performed a comprehensive study on the pre-training corpus domain, language model masking method, and adversarial training, benchmarking on a number of different datasets for token classification, sequence classification, and sequence regression.
  • Compared to the previous works, the authors perform a more detailed study on (1) subword vocabulary, (2) labeling method, (2) model size, and (3) domain transfer, showing gains in token classification, sequence classification, and question answering
Highlights
  • Transferring the success of BERT (Devlin et al, 2018) to the biomedical domain, most notably Lee et al (2019) (BioBERT) and Beltagy et al (2019) (SciBERT) inspired a large number of similar works last year
  • The evaluation results on Named Entity Recognition (NER) and Relation Extraction (RE) are shown in Table 2, and Question Answering (QA) are shown in Table 3
  • We examine how well a general- or domain- specific Language Models (LMs) generalizes across domains related to the model size
  • We review and test several factors that can affect the performance of domain language models
  • We find that a language model targeted for a domain and application performs best
  • Model size is a secondary factor to vocabulary set for token classification task
Results
  • The evaluation results on NER and RE are shown in Table 2, and QA are shown in Table 3.
  • The authors perform entity-level F1 NER using the official CoNLL evaluation script translated into Python.
  • RE uses micro-level F1, and QA uses the BioASQ evaluation script.
  • Representing named entities as single terms is more helpful than breaking them into several subtokens.
  • Table 4 shows the rate named entities break into sub-tokens for each benchmark training set with different sub-word vocabularies.
Conclusion
  • The authors review and test several factors that can affect the performance of domain language models.
  • The authors find that a language model targeted for a domain and application performs best.
  • Larger model size does not necessarily translate to better performance on a cross-domain benchmark task.
  • This probably indicates that there is no master model that can “do it all”, at least well enough as a targeted one.
  • The model size is a secondary factor; larger model size can probably further improve the performance of a a domain- and applicationspecific language model
Tables
  • Table1: Model configurations
  • Table2: Evaluation results on NER and RE after fine-tuning for 30 epochs with hyper-parameter settings of: num-fc-layers: {1, 2}; fc-hidden-size: {512, 1024}; fc-dropout: 0.5; max-seq-length: 128; learning-rate: 5e-5; cross-entropy loss, with Adam optimizer. BioMegatron models are pre-trained from scratch on PubMed, except 1.2b model which is fine-tuned from a general domain model checkpoint
  • Table3: Evaluation results on QA after fine-tuning for 30 epochs on checkpoints fine-tuned on SQuAD dataset with fixed hyper-parameter settings as num-fc-layers: 2; fc-hidden-size: 2048; fc-dropout: 0.1; max-seq-length: 512; learning-rate: 3e-5; cross-entropy loss, using Adam optimizer. BioMegatron models are pre-trained from scratch on PubMed, except 1.2b model which is fine-tuned from a general domain model checkpoint
  • Table4: The rate of named entities breaking into subtokens (#tokens/#words) in NER training sets
  • Table5: Results on BioASQ-7b factoid, without finetuning on SQuAD dataset first. The other models, including those using domain vocabularies, could not achieve any comparable results. A consistent pattern of improvement over model size noticeable on par with findings in general domain LM on SQuAD
  • Table6: Comparison of fine-tuning steps for NER and RE benchmark when pre-training general-domain Megatron-1.2b model on PubMed. Cross-domain LMs should be trained sufficiently long on domain text to achieve comparable performance
  • Table7: Fine-tuning and evaluating on BioASQ-7b using general domain LMs not trained on PubMed corpus. Larger model does not perform better
  • Table8: Fine-tuning on SQuAD -v1.1/-v2.0 using BioMegatron and evaluating on F1-score on dev-set. BioMegatron with ‘-ft’ are pre-trained from general domain checkpoints (fine-tuned). Results of other general domain LMs are compared: RoBERTa (<a class="ref-link" id="cLiu_et+al_2019_a" href="#rLiu_et+al_2019_a">Liu et al, 2019</a>), Megatron-LM (<a class="ref-link" id="cShoeybi_et+al_2019_a" href="#rShoeybi_et+al_2019_a">Shoeybi et al, 2019</a>)
  • Table9: Label bias in general and biomedical benchmark dataset. CONLL-2003 (Sang and De Meulder, 2003), MRPC (<a class="ref-link" id="cDolan_et+al_2005_a" href="#rDolan_et+al_2005_a">Dolan et al, 2005</a>), and SQuAD (<a class="ref-link" id="cRajpurkar_et+al_2016_a" href="#rRajpurkar_et+al_2016_a">Rajpurkar et al, 2016</a>) are general domain dataset for NER, CLS (RE), and QA, respectively, for comparison against biomedical domain dataset. Label bias is computed as [sum of the #samples of minority labels]/[#samples of majority label], for NER and RE (CLS), and [#minimum repeat of the same answer]/[#maximum repeat of the same answer] for QA
  • Table10: Pre-training text corpus of each biomedical LM. We pre-train on PubMed abstracts and full-text commercial-collection (CC) that are free of copyrights
Download tables as Excel
Related work
  • A prime example of Language Models (LMs) in the biomedical domain is BioBERT (Lee et al, 2019). It is a transformer LM pre-trained on the PubMed (www.ncbi.nlm.nih.gov/pubmed) biomedical text corpus comprised of biomedical literature abstracts. Their pre-training started from the checkpoint of Devlin et al (2018) trained on Wikipedia and Books-Corpus. Independently, Beltagy et al (2019) (SciBERT) pre-trained BERT from scratch using their vocabulary set on scientific text corpora, including PubMed abstracts and computer science papers. Both demonstrated increased performance over the previous non-BERT SOTA on biomedical benchmarks, including Named Entity Recognition (NER), Relation Extraction (RE), and Question Answering (QA). BioBERT and SciBERT report similar results on NER and RE, while only BioBERT report QA results.
Reference
  • Emily Alsentzer, John R Murphy, Willie Boag, WeiHung Weng, Di Jin, Tristan Naumann, and Matthew McDermott. 2019. Publicly available clinical bert embeddings. arXiv preprint arXiv:1904.03323.
    Findings
  • Iz Beltagy, Arman Cohan, and Kyle Lo. 2019. Scibert: Pretrained contextualized embeddings for scientific text. arXiv preprint arXiv:1903.10676.
    Findings
  • Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357.
    Google ScholarLocate open access versionFindings
  • Sheng Chen, Haibo He, and Edwardo A Garcia. 2010. Ramoboost: ranked minority oversampling in boosting. IEEE Transactions on Neural Networks, 21(10):1624–1642.
    Google ScholarLocate open access versionFindings
  • Dina Demner-Fushman, K Bretonnel Cohen, Sophia Ananiadou, and Jun’ichi Tsujii. 2019. Proceedings of the 18th bionlp workshop and shared task. In Proceedings of the 18th BioNLP Workshop and Shared Task.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
    Findings
  • Rezarta Islamaj Dogan, Robert Leaman, and Zhiyong Lu. 2014. Ncbi disease corpus: a resource for disease name recognition and concept normalization. Journal of biomedical informatics, 47:1–10.
    Google ScholarLocate open access versionFindings
  • Bill Dolan, Chris Brockett, and Chris Quirk. 2005. Microsoft research paraphrase corpus. Retrieved March, 29(2008):63.
    Google ScholarFindings
  • Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2020. Domainspecific language model pretraining for biomedical natural language processing. arXiv preprint arXiv:2007.15779.
    Findings
  • Kexin Huang, Jaan Altosaar, and Rajesh Ranganath. 2019. Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342.
    Findings
  • Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016.
    Google ScholarFindings
  • Martin Krallinger, Obdulia Rabal, Florian Leitner, Miguel Vazquez, David Salgado, Zhiyong Lu, Robert Leaman, Yanan Lu, Donghong Ji, Daniel M Lowe, et al. 2015. The chemdner corpus of chemicals and drugs and its annotation principles. Journal of cheminformatics, 7(1):1–17.
    Google ScholarLocate open access versionFindings
  • J Lee, W Yoon, S Kim, D Kim, CH So, and J Kang. 2019. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics (Oxford, England).
    Google ScholarFindings
  • Jiao Li, Yueping Sun, Robin J Johnson, Daniela Sciaky, Chih-Hsuan Wei, Robert Leaman, Allan Peter Davis, Carolyn J Mattingly, Thomas C Wiegers, and Zhiyong Lu. 2016. Biocreative v cdr task corpus: a resource for chemical disease relation extraction. Database, 2016.
    Google ScholarLocate open access versionFindings
  • Xiaoya Li, Xiaofei Sun, Yuxian Meng, Junjun Liang, Fei Wu, and Jiwei Li. 2019. Dice loss for data-imbalanced nlp tasks. arXiv preprint arXiv:1911.02855.
    Findings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
    Findings
  • Yifan Peng, Shankai Yan, and Zhiyong Lu. 2019. Transfer learning in biomedical natural language processing: An evaluation of bert and elmo on ten benchmarking datasets. arXiv preprint arXiv:1906.05474.
    Findings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Better language models and their implications.
    Google ScholarFindings
  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 20Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
    Findings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
    Findings
  • Lance A Ramshaw and Mitchell P Marcus. 1999. Text chunking using transformation-based learning. In Natural language processing using very large corpora, pages 157–176. Springer.
    Google ScholarFindings
  • Erik F Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Languageindependent named entity recognition. Proceedings of CoNLL-2003.
    Google ScholarLocate open access versionFindings
  • Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053.
    Findings
  • Trieu H. Trinh and Quoc V. Le. 2018. A simple method for commonsense reasoning. CoRR, abs/1806.02847.
    Findings
  • George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, et al. 2015. An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC bioinformatics, 16(1):138.
    Google ScholarLocate open access versionFindings
  • Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. 2019. Defending against neural fake news. CoRR, abs/1905.12616.
    Findings
Author
Hoo-Chang Shin
Hoo-Chang Shin
Yang Zhang
Yang Zhang
Evelina Bakhturina
Evelina Bakhturina
Raul Puri
Raul Puri
Mostofa Patwary
Mostofa Patwary
Mohammad Shoeybi
Mohammad Shoeybi
Raghav Mani
Raghav Mani
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科