Recognizing software names in biomedical literature using machine learning.

Qiang Wei,Yaoyun Zhang,Muhammad Amith,Rebecca Lin, Jenay Lapeyrolerie,Cui Tao,Hua Xu

HEALTH INFORMATICS JOURNAL（2020）

引用 8|浏览374

暂无评分

摘要

Software tools now are essential to research and applications in the biomedical domain. However, existing software repositories are mainly built using manual curation, which is time-consuming and unscalable. This study took the initiative to manually annotate software names in 1,120 MEDLINE abstracts and titles and used this corpus to develop and evaluate machine learning-based named entity recognition systems for biomedical software. Specifically, two strategies were proposed for feature engineering: (1) domain knowledge features and (2) unsupervised word representation features of clustered and binarized word embeddings. Our best system achieved an F-measure of 91.79% for recognizing software from titles and an F-measure of 86.35% for recognizing software from both titles and abstracts using inexact matching criteria. We then created a biomedical software catalog with 19,557 entries using the developed system. This study demonstrates the feasibility of using natural language processing methods to automatically build a high-quality software index from biomedical literature.

查看译文

关键词

biomedical literature,biomedical software,biomedical software index,named entity recognition,natural language processing

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要