RLHF and IIA: Perverse Incentives

Wanqiao Xu,Shi Dong,Xiuyuan Lu, Grace Lam,Zheng Wen,Benjamin Van Roy

CoRR（2023）

引用 0|浏览13

暂无评分

摘要

Existing algorithms for reinforcement learning from human feedback (RLHF) can incentivize responses at odds with preferences because they are based on models that assume independence of irrelevant alternatives (IIA). The perverse incentives induced by IIA give rise to egregious behavior when innovating on query formats or learning algorithms.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要