A Literature Review and Framework for Human Evaluation of Generative Large Language Models in Healthcare
arxiv(2024)
摘要
As generative artificial intelligence (AI), particularly Large Language
Models (LLMs), continues to permeate healthcare, it remains crucial to
supplement traditional automated evaluations with human expert evaluation.
Understanding and evaluating the generated texts is vital for ensuring safety,
reliability, and effectiveness. However, the cumbersome, time-consuming, and
non-standardized nature of human evaluation presents significant obstacles to
the widespread adoption of LLMs in practice. This study reviews existing
literature on human evaluation methodologies for LLMs within healthcare. We
highlight a notable need for a standardized and consistent human evaluation
approach. Our extensive literature search, adhering to the Preferred Reporting
Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, spans
publications from January 2018 to February 2024. This review provides a
comprehensive overview of the human evaluation approaches used in diverse
healthcare applications.This analysis examines the human evaluation of LLMs
across various medical specialties, addressing factors such as evaluation
dimensions, sample types, and sizes, the selection and recruitment of
evaluators, frameworks and metrics, the evaluation process, and statistical
analysis of the results. Drawing from diverse evaluation strategies highlighted
in these studies, we propose a comprehensive and practical framework for human
evaluation of generative LLMs, named QUEST: Quality of Information,
Understanding and Reasoning, Expression Style and Persona, Safety and Harm, and
Trust and Confidence. This framework aims to improve the reliability,
generalizability, and applicability of human evaluation of generative LLMs in
different healthcare applications by defining clear evaluation dimensions and
offering detailed guidelines.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要