Countering Reward Over-optimization in LLM with Demonstration-Guided Reinforcement Learning
arxiv(2024)
摘要
While Reinforcement Learning (RL) has been proven essential for tuning large
language models (LLMs), it can lead to reward over-optimization (ROO). Existing
approaches address ROO by adding KL regularization, requiring computationally
expensive hyperparameter tuning. Additionally, KL regularization focuses solely
on regularizing the language policy, neglecting a potential source of
regularization: the reward function itself. Inspired by demonstration-guided
RL, we here introduce the Reward Calibration from Demonstration (RCfD), which
leverages human demonstrations and a reward model to recalibrate the reward
objective. Formally, given a prompt, the RCfD objective minimizes the distance
between the demonstrations' and LLM's rewards rather than directly maximizing
the reward function. This objective shift avoids incentivizing the LLM to
exploit the reward model and promotes more natural and diverse language
generation. We show the effectiveness of RCfD on three language tasks, which
achieves comparable performance to carefully tuned baselines while mitigating
ROO.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要