Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision
CoRR(2024)
摘要
Current AI alignment methodologies rely on human-provided demonstrations or
judgments, and the learned capabilities of AI systems would be upper-bounded by
human capabilities as a result. This raises a challenging research question:
How can we keep improving the systems when their capabilities have surpassed
the levels of humans? This paper answers this question in the context of
tackling hard reasoning tasks (e.g., level 4-5 MATH problems) via learning from
human annotations on easier tasks (e.g., level 1-3 MATH problems), which we
term as easy-to-hard generalization. Our key insight is that an
evaluator (reward model) trained on supervisions for easier tasks can be
effectively used for scoring candidate solutions of harder tasks and hence
facilitating easy-to-hard generalization over different levels of tasks. Based
on this insight, we propose a novel approach to scalable alignment, which
firstly trains the process-supervised reward models on easy problems (e.g.,
level 1-3), and then uses them to evaluate the performance of policy models on
hard problems. We show that such easy-to-hard generalization from
evaluators can enable easy-to-hard generalizations in generators
either through re-ranking or reinforcement learning (RL). Notably, our
process-supervised 7b RL model achieves an accuracy of 34.0% on MATH500,
despite only using human supervision on easy problems. Our approach suggests a
promising path toward AI systems that advance beyond the frontier of human
supervision.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要