Learning Biological Sequence Types Using the Literature.

CIKM(2017)

引用 3|浏览20
暂无评分
摘要
We explore in this paper automatic biological sequence type classification for records in biological sequence databases. The sequence type attribute provides important information about the nature of a sequence represented in a record, and is often used in search to filter out irrelevant sequences. However, the sequence type attribute is generally a non-mandatory free-text field, and thus it is subject to many errors including typos, mis-assignment, and non-assignment. In GenBank, this problem concerns roughly 18% of records, an alarming number that should worry the biocuration community. To address this problem of automatic sequence type classification, we propose the use of literature associated to sequence records as an external source of knowledge that can be leveraged for the classification task. We define a set of literature-based features and train a machine learning algorithm to classify a record into one of six primary sequence types. The main intuition behind using the literature for this task is that sequences appear to be discussed differently in scientific articles, depending on their type. The experiments we have conducted on the PubMed Central collection show that the literature is indeed an effective way to address this problem of sequence type classification. Our classification method reached an accuracy of 92.7%, and substantially outperformed two baseline approaches used for comparison.
更多
查看译文
关键词
Data Analysis, Data Quality, Biological Databases, Data Cleansing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要