SciBERT was not pre-trained with entity knowledge, it still performs much greater than a random guess, which means the inference tasks are not independent of the paper content information
OAG-BERT: Pre-train Heterogeneous Entity-augmented Academic Language Model
下载 PDF 全文
To enrich language models with domain knowledge is crucial but difficult. Based on the world's largest public academic graph Open Academic Graph (OAG), we pre-train an academic language model, namely OAG-BERT, which integrates massive heterogeneous entities including paper, author, concept, venue, and affiliation. To better endow OAG-BE...更多
下载 PDF 全文
- Pre-trained language models such as BERT , GPT  and XLNet  substantially promote the development of natural language processing.
- Besides pre-training for general purposes, more and more language models are targeting at specific domains, such as BioBERT  for biomedical field and SciBERT  for academic field, which establish new state-of-the-art on many domain-related benchmarks such as named entity recognition [10, 32], topic classification [4, 19] and so on.
- Most of these models are only pre-trained over domain corpus, but ignore to integrate domain entity knowledge, which is crucial for many entity-related downstream tasks.
- Pre-trained language models such as BERT , GPT  and XLNet  substantially promote the development of natural language processing
- Besides pre-training for general purposes, more and more language models are targeting at specific domains, such as BioBERT  for biomedical field and SciBERT  for academic field, which establish new state-of-the-art on many domain-related benchmarks such as named entity recognition [10, 32], topic classification [4, 19] and so on
- We present the Open Academic Graph (OAG)-BERT, an entity knowledge augmented academic language model that is pre-trained over 5 million paper full-text, 110 million paper abstracts and billions of academic entities and relations from the OAG
- We develop a simple extension to the Masked Language Model (MLM) to achieve that
- SciBERT was not pre-trained with entity knowledge, it still performs much greater than a random guess, which means the inference tasks are not independent of the paper content information
- In accord with SciBERT , we evaluate the model performance on the same 12 NLP tasks, including Named Entity Recognition (NER), Dependency Parsing (DEP), Relation Extraction (REL), PICO Extraction (PICO), and Text Classification (CLS)
- The proposed OAG-BERT is a bidirectional transformer-based pretraining model.
- It can encode scientific texts and entity knowledge into high dimensional embeddings, which can be used for downstream tasks such as predicting the published venue for papers.
- The authors build the OAG-BERT model on top of the conventional BERT  model with 12 transformer  encoder layers.
- While the original BERT model only focuses on natural language, the proposed OAG-BERT incorporates heterogeneous entity knowledge.
- In addition to learning from pure scientific texts such as paper title or abstract, the OAG-BERT model can comprehend other types of information, such as the published venues or the affiliations of paper authors.
- SciBERT was not pre-trained with entity knowledge, it still performs much greater than a random guess, which means the inference tasks are not independent of the paper content information.
- The authors speculate that the pre-training process on paper content helps the model learn some generalized knowledge on other types of information, such as field-of-studies or venue names.
- OAG-BERT +prompt +abstract +both FOS Hit@1 MRR Venue Hit@1 MRR.
- The authors observe that the proposed use of abstract can always help improve the performance.
- The prompt words works well with SciBERT but only provide limited help for OAG-BERT.
- The affiliation inference task appears to be harder than the other two tasks.
- Further analysis are provided in the A.1.
- Two extended experiments are enclosed as well, which reveal two findings:
- Equation 2, the authors use the sum of log probabilities of all tokens to calculate the entity log probability.
- For MLM-based models, the encoding process encodes “[MASK]” tokens and captures the length of the masked entity and each token’s position.
- If the pre-training corpus has fewer long entities than short entities, in the decoding process, the decoded tokens in a long entity will generally receive higher probability, compared to the ones in a short entity.
- In the designed decoding process, the authors do not strictly follow the left-to-right order as used in classical decoder models.
- As for venue and affiliation, it turns out that the out-of-order decoding generally performs much better than left-to-right decoding, except when OAG-BERT is using abstract where differences are relatively small as well.
- The authors present the results for models using left-to-right decoding and prompt words in Table 9, which indicates that the left-to-right decoding will sometimes undermine the effectiveness of prompt words significantly, especially for OAG-BERT
- Table1: The summary of model performance for all tasks. We report the performance of only using paper titles as inputs in title-only and the best performance of using other features such as FOS or venue as inputs in mixed
- Table2: The results for zero-shot inference tasks
- Table3: The generated FOS for the paper of GPT-3. The gold FOS are bolded. FOS not in the original OAG FOS candidate list are underlined
- Table4: The results of the classification task
- Table5: The Macro Pairewise F1 scores for the name disambiguation task
- Table6: The results for NLP Tasks
- Table7: The result of link prediction tasks
- Table8: The results for using different average methods while calculating entity log probabilities. Hit@1 and MRR are reported
- Table9: The results for using left-to-right decoding and outof-order decoding order. Hit@1 and MRR are reported. Results with difference larger than 1% Hit@1 were bolded
- Table10: A full list of used candidates in zero-shot inference tasks and supervised classification tasks
- Table11: The sizes for datasets used in supervised classification tasks
- Table12: Details for the CS heterogeneous graph used in the link prediction
- Table13: The performance of vanilla OAG-BERT with and without training on 512-token samples. All results in this table were produced by fine-tuning with 2 epochs and 2e-5 learning rates
- Our proposed OAG-BERT model is based on BERT , a selfsupervised  bidirectional language model. It employs multilayer transformers as its encoder and uses masked token prediction as its objective, which allows using massive unlabeled text data as training corpus. The model architecture and training scheme have been shown to be effective on various natural language tasks, such as question answering or natural language inference.
BERT has many variants. Some focus on the robustness of the pre-training process, like RoBERTa . Some others try to incorporate more knowledge into the natural language pre-training. SpanBERT  develops span-level masking which benefits span selection tasks. ERNIE  introduces explicit knowledge graph inputs to the BERT encoder and achieves significant improvements over knowledge-driven tasks.
- The S2ORC-BERT , applies the same method with SciBERT on a larger scientific corpus and slightly improves the performance on downstream tasks
- We calculated the macro pairwise f1 score following previous works
Thus, we only use paper title and abstract as the paper text information. From this corpus, we picked all authors with at least 3 papers published. Then we filtered out all papers not linked to these selected authors
First, we choose 19 top-level field-of-studies (FOS) such as “biology” and “computer science”. Then, from the paper data which were not used in the pre-training process, we randomly select 1,000 papers for each FOS. The task is to predict which research field each paper belongs to
Venue and Affiliation Inference Similar to the FOS inference task, we create venue and affiliation inference tasks. From nonpretrained papers, we choose 30 most frequent arXiv categories and 30 affiliations as inference candidates, with 100 papers randomly selected for each candidate. Full lists of the candidates including FOS candidates are enclosed in the appendix
- Waleed Ammar, Dirk Groeneveld, Chandra Bhagavatula, et al. 2018. Construction of the literature graph in semantic scholar. arXiv preprint arXiv:1805.02262 (2018).
- Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676 (2019).
- Bo Chen, Jing Zhang, Jie Tang, Lingfan Cai, Zhaoyu Wang, Shu Zhao, Hong Chen, and Cuiping Li. 2020. CONNA: Addressing Name Disambiguation on The Fly. TKDE (2020).
- Arman Cohan, Waleed Ammar, Madeleine Van Zuylen, and Field Cady. 2019. Structural scaffolds for citation intent classification in scientific publications. arXiv preprint arXiv:1904.01608 (2019).
- Arman Cohan, Iz Beltagy, Daniel King, Bhavana Dalvi, and Daniel S Weld. 2019. Pretrained language models for sequential sentence classification. arXiv preprint arXiv:1909.04054 (2019).
- David Cyranoski. 2019. Artificial intelligence is selecting grant reviewers in China. Nature 569, 7756 (2019).
- Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixedlength context. arXiv preprint arXiv:1901.02860 (2019).
- Franck Dernoncourt and Ji Young Lee. 2017. Pubmed 200k rct: a dataset for sequential sentence classification in medical abstracts. arXiv preprint arXiv:1710.06071 (2017).
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- Rezarta Islamaj Doğan, Robert Leaman, and Zhiyong Lu. 2014. NCBI disease corpus: a resource for disease name recognition and concept normalization. Journal of biomedical informatics 47 (2014).
- Yuxiao Dong, Nitesh V Chawla, and Ananthram Swami. 2017. metapath2vec: Scalable representation learning for heterogeneous networks. In SIGKDD.
- Zhengxiao Du, Jie Tang, and Yuhui Ding. 2018. Polar: Attention-based cnn for one-shot personalized article recommendation. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer.
- Zhengxiao Du, Jie Tang, and Yuhui Ding. 2019. POLAR++: Active One-shot Personalized Article Recommendation. IEEE Transactions on Knowledge and Data Engineering (2019).
- Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In SIGKDD.
- Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith. 2020. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. arXiv preprint arXiv:2004.10964 (2020).
- Ziniu Hu, Yuxiao Dong, Kuansan Wang, Kai-Wei Chang, and Yizhou Sun. 2020. Gpt-gnn: Generative pre-training of graph neural networks. In SIGKDD.
- Ziniu Hu, Yuxiao Dong, Kuansan Wang, and Yizhou Sun. 2020. Heterogeneous graph transformer. In WWW.
- Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. 2020. Spanbert: Improving pre-training by representing and predicting spans. TACL 8 (2020).
- David Jurgens, Srijan Kumar, Raine Hoover, Dan McFarland, and Dan Jurafsky. 2018. Measuring the evolution of a scientific field through citation frames. TACL 6 (2018).
- Anshul Kanakia, Zhihong Shen, Darrin Eide, and Kuansan Wang. 2019. A scalable hybrid research paper recommender system for microsoft academic. In WWW.
- J-D Kim, Tomoko Ohta, Yuka Tateisi, and Jun’ichi Tsujii. 2003. GENIA corpus - a semantically annotated corpus for bio-textmining. Bioinformatics 19, suppl_1 (2003).
- Jin-Dong Kim, Tomoko Ohta, Yoshimasa Tsuruoka, Yuka Tateisi, and Nigel Collier. 2004. Introduction to the bio-entity recognition task at JNLPBA. In JNLPBA. Citeseer.
- Su Nam Kim, David Martinez, Lawrence Cavedon, and Lars Yencken. 2011. Automatic classification of sentences to support evidence based medicine. In BMC bioinformatics, Vol. 12. Springer.
- Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
- Jens Kringelum, Sonny Kim Kjaerulff, Søren Brunak, Ole Lund, Tudor I Oprea, and Olivier Taboureau. 2016. ChemProt-3.0: a global chemical biology diseases mapping. Database 2016 (2016).
- Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (2020).
- Jiao Li, Yueping Sun, Robin J Johnson, Daniela Sciaky, Chih-Hsuan Wei, Robert Leaman, Allan Peter Davis, Carolyn J Mattingly, Thomas C Wiegers, and Zhiyong Lu. 2016. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database 2016 (2016).
- Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju, Haotang Deng, and Ping Wang. 2020. K-bert: Enabling language representation with knowledge graph. In AAAI, Vol. 34.
- Xiao Liu, Fanjin Zhang, Zhenyu Hou, Zhaoyu Wang, Li Mian, Jing Zhang, and Jie Tang. 2020. Self-supervised learning: Generative or contrastive. arXiv preprint arXiv:2006.08218 1, 2 (2020).
- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
- Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Dan S Weld. 2019. S2orc: The semantic scholar open research corpus. arXiv preprint arXiv:1911.02782 (2019).
- Yi Luan, Luheng He, Mari Ostendorf, and Hannaneh Hajishirzi. 2018. Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. arXiv preprint arXiv:1808.09602 (2018).
- Mark Neumann, Daniel King, Iz Beltagy, and Waleed Ammar. 2019. Scispacy: Fast and robust models for biomedical natural language processing. arXiv preprint arXiv:1902.07669 (2019).
- Benjamin Nye, Junyi Jessy Li, Roma Patel, Yinfei Yang, Iain J Marshall, Ani Nenkova, and Byron C Wallace. 2018. A corpus with multi-level annotations of patients, interventions and outcomes to support language processing for medical literature. In ACL, Vol. 2018. NIH Public Access.
- Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. 2019. Language models as knowledge bases? arXiv preprint arXiv:1909.01066 (2019).
- Nina Poerner, Ulli Waltinger, and Hinrich Schütze. 2019. E-bert: Efficient-yeteffective entity embeddings for bert. arXiv preprint arXiv:1911.03681 (2019).
- Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019).
- Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In SIGKDD.
- Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June Hsu, and Kuansan Wang. 2015. An overview of microsoft academic service (mas) and applications. In WWW.
- Tianxiang Sun, Yunfan Shao, Xipeng Qiu, Qipeng Guo, Yaru Hu, Xuanjing Huang, and Zheng Zhang. 2020. CoLAKE: Contextualized Language and Knowledge Embedding. arXiv preprint arXiv:2010.00309 (2020).
- Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. Arnetminer: extraction and mining of academic social networks. In SIGKDD.
- C. Tillmann and H. Ney. 2003. Word Reordering and a Dynamic Programming Beam Search Algorithm for Statistical Machine Translation. Computational Linguistics 29 (2003), 97–133.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762 (2017).
- Kan Wu, Jie Tang, and Chenhui Zhang. 2018. Where Have You Been? Inferring Career Trajectory from Academic Social Network.. In IJCAI.
- Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237 (2019).
- Yan Wang Qibin Chen Yizhen Luo Xingcheng Yao Aohan Zeng Shiguang Guo Peng Zhang Guohao Dai Yu Wang Chang Zhou Hongxia Yang Jie Tang Yukuo Cen, Zhenyu Hou. 2021. CogDL: An Extensive Toolkit for Deep Learning on Graphs. arXiv preprint arXiv:2103.00959 (2021).
- Fanjin Zhang, Xiao Liu, Jie Tang, Yuxiao Dong, Peiran Yao, Jie Zhang, Xiaotao Gu, Yan Wang, Bin Shao, Rui Li, et al. 2019. Oag: Toward linking large-scale heterogeneous entity graphs. In SIGKDD.
- Jie Zhang, Yuxiao Dong, Yan Wang, Jie Tang, and Ming Ding. 2019. ProNE: Fast and Scalable Network Representation Learning.. In IJCAI, Vol. 19.
- Minjia Zhang and Yuxiong He. 2020. Accelerating Training of TransformerBased Language Models with Progressive Layer Dropping. arXiv preprint arXiv:2010.13369 (2020).
- Yutao Zhang, Fanjin Zhang, Peiran Yao, and Jie Tang. 2018. Name Disambiguation in AMiner: Clustering, Maintenance, and Human in the Loop.. In SIGKDD.
- Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. 2019. ERNIE: Enhanced language representation with informative entities. arXiv preprint arXiv:1905.07129 (2019).