PubMed Phrases, an open set of coherent phrases for searching biomedical literature

SCIENTIFIC DATA(2018)

引用 18|浏览35
暂无评分
摘要
In biomedicine, key concepts are often expressed by multiple words (e.g., ‘zinc finger protein’). Previous work has shown treating a sequence of words as a meaningful unit, where applicable, is not only important for human understanding but also beneficial for automatic information seeking. Here we present a collection of PubMed ® Phrases that are beneficial for information retrieval and human comprehension. We define these phrases as coherent chunks that are logically connected. To collect the phrase set, we apply the hypergeometric test to detect segments of consecutive terms that are likely to appear together in PubMed. These text segments are then filtered using the BM25 ranking function to ensure that they are beneficial from an information retrieval perspective. Thus, we obtain a set of 705,915 PubMed Phrases . We evaluate the quality of the set by investigating PubMed user click data and manually annotating a sample of 500 randomly selected noun phrases. We also analyze and discuss the usage of these PubMed Phrases in literature search.
更多
查看译文
关键词
Data mining,Literature mining,Publication characteristics,Science,Humanities and Social Sciences,multidisciplinary
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要