Boosting drug named entity recognition using an aggregate classifier

Artificial Intelligence in Medicine(2015)

引用 55|浏览87
暂无评分
摘要
HighlightsWe investigate drug NER using limited or no manually annotated data.We propose an algorithm for combining methods based on annotations and dictionaries.We improved drug NER recall using suffix patterns evolved by genetic programming.We improved drug NER performance by aggregating heterogenous drug NER methods.We conclude that drug NER can be performed competitively without manual annotations. ObjectiveDrug named entity recognition (NER) is a critical step for complex biomedical NLP tasks such as the extraction of pharmacogenomic, pharmacodynamic and pharmacokinetic parameters. Large quantities of high quality training data are almost always a prerequisite for employing supervised machine-learning techniques to achieve high classification performance. However, the human labour needed to produce and maintain such resources is a significant limitation. In this study, we improve the performance of drug NER without relying exclusively on manual annotations. MethodsWe perform drug NER using either a small gold-standard corpus (120 abstracts) or no corpus at all. In our approach, we develop a voting system to combine a number of heterogeneous models, based on dictionary knowledge, gold-standard corpora and silver annotations, to enhance performance. To improve recall, we employed genetic programming to evolve 11 regular-expression patterns that capture common drug suffixes and used them as an extra means for recognition. MaterialsOur approach uses a dictionary of drug names, i.e. DrugBank, a small manually annotated corpus, i.e. the pharmacokinetic corpus, and a part of the UKPMC database, as raw biomedical text. Gold-standard and silver annotated data are used to train maximum entropy and multinomial logistic regression classifiers. ResultsAggregating drug NER methods, based on gold-standard annotations, dictionary knowledge and patterns, improved the performance on models trained on gold-standard annotations, only, achieving a maximum F-score of 95%. In addition, combining models trained on silver annotations, dictionary knowledge and patterns are shown to achieve comparable performance to models trained exclusively on gold-standard data. The main reason appears to be the morphological similarities shared among drug names. ConclusionWe conclude that gold-standard data are not a hard requirement for drug NER. Combining heterogeneous models build on dictionary knowledge can achieve similar or comparable classification performance with that of the best performing model trained on gold-standard annotations.
更多
查看译文
关键词
Named entity annotation sparsity,Gold-standard vs. silver-standard annotations,Named entity recogniser aggregation,Genetic-programming-evolved string-similarity patterns,Drug named entity recognition
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要