Math-Shepherd: Verify and Reinforce LLMs Step-by-step Without Human Annotations
Annual Meeting of the Association for Computational Linguistics(2024)
Abstract
In this paper, we present an innovative process-oriented math process rewardmodel called Math-Shepherd, which assigns a reward score to each stepof math problem solutions. The training of Math-Shepherd is achieved usingautomatically constructed process-wise supervision data, breaking thebottleneck of heavy reliance on manual annotation in existing work. We explorethe effectiveness of Math-Shepherd in two scenarios: 1) Verification:Math-Shepherd is utilized for reranking multiple outputs generated by LargeLanguage Models (LLMs); 2) Reinforcement Learning: Math-Shepherd isemployed to reinforce LLMs with step-by-step Proximal Policy Optimization(PPO). With Math-Shepherd, a series of open-source LLMs demonstratesexceptional performance. For instance, the step-by-step PPO with Math-Shepherdsignificantly improves the accuracy of Mistral-7B (77.9%→84.1% on GSM8Kand 28.6%→33.0% on MATH). The accuracy can be further enhanced to 89.1%and 43.5% on GSM8K and MATH with the verification of Math-Shepherd,respectively. We believe that automatic process supervision holds significantpotential for the future evolution of LLMs.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined