Decoding-time Realignment of Language Models
CoRR(2024)
摘要
Aligning language models with human preferences is crucial for reducing
errors and biases in these models. Alignment techniques, such as reinforcement
learning from human feedback (RLHF), are typically cast as optimizing a
tradeoff between human preference rewards and a proximity regularization term
that encourages staying close to the unaligned model. Selecting an appropriate
level of regularization is critical: insufficient regularization can lead to
reduced model capabilities due to reward hacking, whereas excessive
regularization hinders alignment. Traditional methods for finding the optimal
regularization level require retraining multiple models with varying
regularization strengths. This process, however, is resource-intensive,
especially for large models. To address this challenge, we propose
decoding-time realignment (DeRa), a simple method to explore and evaluate
different regularization strengths in aligned models without retraining. DeRa
enables control over the degree of alignment, allowing users to smoothly
transition between unaligned and aligned models. It also enhances the
efficiency of hyperparameter tuning by enabling the identification of effective
regularization strengths using a validation dataset.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要