Evaluating LLM – Generated Multimodal Diagnosis from Medical Images and Symptom Analysis
CoRR(2024)
摘要
Large language models (LLMs) constitute a breakthrough state-of-the-art
Artificial Intelligence technology which is rapidly evolving and promises to
aid in medical diagnosis. However, the correctness and the accuracy of their
returns has not yet been properly evaluated. In this work, we propose an LLM
evaluation paradigm that incorporates two independent steps of a novel
methodology, namely (1) multimodal LLM evaluation via structured interactions
and (2) follow-up, domain-specific analysis based on data extracted via the
previous interactions. Using this paradigm, (1) we evaluate the correctness and
accuracy of LLM-generated medical diagnosis with publicly available multimodal
multiple-choice questions(MCQs) in the domain of Pathology and (2) proceed to a
systemic and comprehensive analysis of extracted results. We used
GPT-4-Vision-Preview as the LLM to respond to complex, medical questions
consisting of both images and text, and we explored a wide range of diseases,
conditions, chemical compounds, and related entity types that are included in
the vast knowledge domain of Pathology. GPT-4-Vision-Preview performed quite
well, scoring approximately 84% of correct diagnoses. Next, we further
analyzed the findings of our work, following an analytical approach which
included Image Metadata Analysis, Named Entity Recognition and Knowledge
Graphs. Weaknesses of GPT-4-Vision-Preview were revealed on specific knowledge
paths, leading to a further understanding of its shortcomings in specific
areas. Our methodology and findings are not limited to the use of
GPT-4-Vision-Preview, but a similar approach can be followed to evaluate the
usefulness and accuracy of other LLMs and, thus, improve their use with further
optimization.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要