DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models
CoRR(2024)
摘要
Since large language models (LLMs) achieve significant success in recent
years, the hallucination issue remains a challenge, numerous benchmarks are
proposed to detect the hallucination. Nevertheless, some of these benchmarks
are not naturally generated by LLMs but are intentionally induced. Also, many
merely focus on the factuality hallucination while ignoring the faithfulness
hallucination. Additionally, although dialogue pattern is more widely utilized
in the era of LLMs, current benchmarks only concentrate on sentence-level and
passage-level hallucination. In this study, we propose DiaHalu, the first
dialogue-level hallucination evaluation benchmark to our knowledge. Initially,
we integrate the collected topics into system prompts and facilitate a dialogue
between two ChatGPT3.5. Subsequently, we manually modify the contents that do
not adhere to human language conventions and then have LLMs re-generate,
simulating authentic human-machine interaction scenarios. Finally, professional
scholars annotate all the samples in the dataset. DiaHalu covers four common
multi-turn dialogue domains and five hallucination subtypes, extended from
factuality and faithfulness hallucination. Experiments through some well-known
LLMs and detection methods on the dataset show that DiaHalu is a challenging
benchmark, holding significant value for further research.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要