Understanding the Extent to which Summarization Evaluation Metrics Measure the Information Quality of Summaries

arxiv(2020)

引用 0|浏览9
暂无评分
摘要
Reference-based metrics such as ROUGE or BERTScore evaluate the content quality of a summary by comparing the summary to a reference. Ideally, this comparison should measure the summary's information quality by calculating how much information the summaries have in common. In this work, we analyze the token alignments used by ROUGE and BERTScore to compare summaries and argue that their scores largely cannot be interpreted as measuring information overlap, but rather the extent to which they discuss the same topics. Further, we provide evidence that this result holds true for many other summarization evaluation metrics. The consequence of this result is that it means the summarization community has not yet found a reliable automatic metric that aligns with its research goal, to generate summaries with high-quality information. Then, we propose a simple and interpretable method of evaluating summaries which does directly measure information overlap and demonstrate how it can be used to gain insights into model behavior that could not be provided by other methods alone.
更多
查看译文
关键词
summarization evaluation metrics,summaries,information quality
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要