Aligning Crowd Feedback via Distributional Preference Reward Modeling
CoRR(2024)
摘要
Deep Reinforcement Learning is widely used for aligning Large Language Models
(LLM) with human preference. However, the conventional reward modelling has
predominantly depended on human annotations provided by a select cohort of
individuals. Such dependence may unintentionally result in models that are
skewed to reflect the inclinations of these annotators, thereby failing to
represent the expectations of the wider population adequately. In this paper,
we introduce the Distributional Preference Reward Model (DPRM), a simple yet
effective framework to align large language models with a diverse set of human
preferences. To this end, we characterize the preferences by a beta
distribution, which can dynamically adapt to fluctuations in preference trends.
On top of that, we design an optimal-transportation-based loss to calibrate
DPRM to align with the preference distribution. Finally, the expected reward is
utilized to fine-tune an LLM policy to generate responses favoured by the
population. Our experiments show that DPRM significantly enhances the alignment
of LLMs with population preference, yielding more accurate, unbiased, and
contextually appropriate responses.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要