WorldValuesBench: A Large-Scale Benchmark Dataset for Multi-Cultural Value Awareness of Language Models
International Conference on Computational Linguistics(2024)
摘要
The awareness of multi-cultural human values is critical to the ability of
language models (LMs) to generate safe and personalized responses. However,
this awareness of LMs has been insufficiently studied, since the computer
science community lacks access to the large-scale real-world data about
multi-cultural values. In this paper, we present WorldValuesBench, a globally
diverse, large-scale benchmark dataset for the multi-cultural value prediction
task, which requires a model to generate a rating response to a value question
based on demographic contexts. Our dataset is derived from an influential
social science project, World Values Survey (WVS), that has collected answers
to hundreds of value questions (e.g., social, economic, ethical) from 94,728
participants worldwide. We have constructed more than 20 million examples of
the type "(demographic attributes, value question) → answer" from
the WVS responses. We perform a case study using our dataset and show that the
task is challenging for strong open and closed-source models. On merely
11.1%, 25.0%, 72.2%, and 75.0% of the questions, Alpaca-7B,
Vicuna-7B-v1.5, Mixtral-8x7B-Instruct-v0.1, and GPT-3.5 Turbo can respectively
achieve <0.2 Wasserstein 1-distance from the human normalized answer
distributions. WorldValuesBench opens up new research avenues in studying
limitations and opportunities in multi-cultural value awareness of LMs.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要