ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models
CoRR(2024)
摘要
Research on Large Language Models (LLMs) has recently witnessed an increasing
interest in extending models' context size to better capture dependencies
within long documents. While benchmarks have been proposed to assess long-range
abilities, existing efforts primarily considered generic tasks that are not
necessarily aligned with real-world applications. In contrast, our work
proposes a new benchmark for long-context LLMs focused on a practical meeting
assistant scenario. In this scenario, the long contexts consist of transcripts
obtained by automatic speech recognition, presenting unique challenges for LLMs
due to the inherent noisiness and oral nature of such data. Our benchmark,
named ELITR-Bench, augments the existing ELITR corpus' transcripts with 271
manually crafted questions and their ground-truth answers. Our experiments with
recent long-context LLMs on ELITR-Bench highlight a gap between open-source and
proprietary models, especially when questions are asked sequentially within a
conversation. We also provide a thorough analysis of our GPT-4-based evaluation
method, encompassing insights from a crowdsourcing study. Our findings suggest
that while GPT-4's evaluation scores are correlated with human judges', its
ability to differentiate among more than three score levels may be limited.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要