A Comparative Study of Responses to Retina Questions from either Experts, Expert-Edited Large Language Models (LLMs) or LLMs Alone

Ophthalmology Science(2024)

引用 0|浏览6
暂无评分
摘要
Objective To assess the quality, empathy, and safety of expert edited large language model (LLM), human expert created and LLM responses to common retina patient questions Design Randomized, masked multicenter study Participants Twenty-one common retina patient questions were randomly assigned among 13 retina specialists. Each expert created a response (Expert) and then edited a LLM (ChatGPT-4)-generated response to that question (Expert+AI), timing themselves for both tasks. Five LLMs (ChatGPT-3.5, ChatGPT-4, Claude 2, Bing, Bard) also generated responses to each question. The original question along with anonymized and randomized Expert+AI, Expert and LLM responses were evaluated by the other experts who did not write an expert response to the question. Evaluators judged quality and empathy (very poor, poor, acceptable, good, or very good) along with safety metrics (incorrect information, likelihood to cause harm, extent of harm, and missing content). Main Outcome Mean quality and empathy score, proportion of responses with incorrect information, likelihood to cause harm, extent of harm, and missing content for each response type Results There were 4008 total grades collected (2608 for quality and empathy; 1400 for safety metrics), with significant differences in both quality and empathy (p<0.001, p<0.001) between LLM, Expert and Expert+AI groups. For quality, Expert+AI (3.86 +/- 0.85) performed the best overall while GPT-3.5 (3.75 +/- 0.79) was the top performing LLM. For empathy, GPT-3.5 (3.75 +/- 0.69) had the highest mean score followed by Expert+AI (3.73 +/- 0.63). By mean score, Expert placed fourth out of seven for quality and sixth out of seven for empathy. For both quality (p<0.001) and empathy (p<0.001), expert-edited LLM responses performed better than expert-created responses. There were time savings for an expert-edited LLM response vs. expert-created response (p=0.02). ChatGPT-4 performed similar to Expert for Inappropriate Content (p=0.35), Missing Content (p=0.001), Extent of Possible Harm (p=0.356), and Likelihood of Possible Harm (p=0.129). Conclusions and Relevance In this randomized, masked, multicenter study, LLM responses were comparable to experts in terms of quality, empathy and safety metrics, warranting further exploration of their potential benefits in clinical settings.
更多
查看译文
关键词
artificial intelligence,chatbot,ChatGPT,large language model,retina,physician advice
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要