Improving Information Extraction from Pathology Reports using Named Entity Recognition.

Ken G Zeng, Tarun Dutt,Jan Witowski,Kranthi Kiran Gv, Frank Yeung, Michelle Kim, Jesi Kim, Mitchell Pleasure,Christopher Moczulski,L Julian Lechuga Lopez, Hao Zhang,Mariam Al Harbi,Farah E Shamout, Vincent J Major,Laura Heacock,Linda Moy,Freya Schnabel,Linda M Pak,Yiqiu Shen,Krzysztof J Geras

Research square(2023)

引用 0|浏览45
暂无评分
摘要
Pathology reports are considered the gold standard in medical research due to their comprehensive and accurate diagnostic information. Natural language processing (NLP) techniques have been developed to automate information extraction from pathology reports. However, existing studies suffer from two significant limitations. First, they typically frame their tasks as report classification, which restricts the granularity of extracted information. Second, they often fail to generalize to unseen reports due to variations in language, negation, and human error. To overcome these challenges, we propose a BERT (bidirectional encoder representations from transformers) named entity recognition (NER) system to extract key diagnostic elements from pathology reports. We also introduce four data augmentation methods to improve the robustness of our model. Trained and evaluated on 1438 annotated breast pathology reports, acquired from a large medical center in the United States, our BERT model trained with data augmentation achieves an entity F1-score of 0.916 on an internal test set, surpassing the BERT baseline (0.843). We further assessed the model's generalizability using an external validation dataset from the United Arab Emirates, where our model maintained satisfactory performance (F1-score 0.860). Our findings demonstrate that our NER systems can effectively extract fine-grained information from widely diverse medical reports, offering the potential for large-scale information extraction in a wide range of medical and AI research. We publish our code at https://github.com/nyukat/pathology_extraction.
更多
查看译文
关键词
information extraction,entity recognition,pathology reports
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要