More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness
CoRR(2024)
摘要
The surge in Large Language Models (LLMs) development has led to improved
performance on cognitive tasks as well as an urgent need to align these models
with human values in order to safely exploit their power. Despite the
effectiveness of preference learning algorithms like Reinforcement Learning
From Human Feedback (RLHF) in aligning human preferences, their assumed
improvements on model trustworthiness haven't been thoroughly testified. Toward
this end, this study investigates how models that have been aligned with
general-purpose preference data on helpfulness and harmlessness perform across
five trustworthiness verticals: toxicity, stereotypical bias, machine ethics,
truthfulness, and privacy. For model alignment, we focus on three widely used
RLHF variants: Supervised Finetuning (SFT), Proximal Policy Optimization (PPO),
and Direct Preference Optimization (DPO). Through extensive empirical
investigations, we discover that the improvement in trustworthiness by RLHF is
far from guaranteed, and there exists a complex interplay between preference
data, alignment algorithms, and specific trustworthiness aspects. Together, our
results underscore the need for more nuanced approaches for model alignment. By
shedding light on the intricate dynamics of these components within model
alignment, we hope this research will guide the community towards developing
language models that are both capable and trustworthy.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要