NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism
CoRR(2024)
摘要
We present NewsBench, a novel evaluation framework to systematically assess
the capabilities of Large Language Models (LLMs) for editorial capabilities in
Chinese journalism. Our constructed benchmark dataset is focused on four facets
of writing proficiency and six facets of safety adherence, and it comprises
manually and carefully designed 1,267 test samples in the types of multiple
choice questions and short answer questions for five editorial tasks in 24 news
domains. To measure performances, we propose different GPT-4 based automatic
evaluation protocols to assess LLM generations for short answer questions in
terms of writing proficiency and safety adherence, and both are validated by
the high correlations with human evaluations. Based on the systematic
evaluation framework, we conduct a comprehensive analysis of ten popular LLMs
which can handle Chinese. The experimental results highlight GPT-4 and ERNIE
Bot as top performers, yet reveal a relative deficiency in journalistic safety
adherence in creative writing tasks. Our findings also underscore the need for
enhanced ethical guidance in machine-generated journalistic content, marking a
step forward in aligning LLMs with journalistic standards and safety
considerations.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要