Chrome Extension
WeChat Mini Program
Use on ChatGLM

Parameter Efficient Reinforcement Learning from Human Feedback

Hakim Sidahmed,Samrat Phatale, Alex Hutcheson, Zhuonan Lin, Zhang Chen, Zac Yu, Jarvis Jin,Simral Chaudhary, Roman Komarytsia, Christiane Ahlheim, Yonghao Zhu, Bowen Li,Saravanan Ganesh,Bill Byrne,Jessica Hoffmann,Hassan Mansoor, Wei Li,Abhinav Rastogi,Lucas Dixon

CoRR(2024)

Cited 0|Views25
No score
Abstract
While Reinforcement Learning from Human Feedback (RLHF) effectively aligns pretrained Large Language and Vision-Language Models (LLMs, and VLMs) with human preferences, its computational cost and complexity hamper its wider adoption. To alleviate some of the computational burden of fine-tuning, parameter efficient methods, like LoRA were introduced. In this work, we empirically evaluate the setup of Parameter Efficient Reinforcement Learning from Human Feedback (PE-RLHF) that leverages LoRA fine-tuning for Reward Modeling, and Reinforcement Learning. We benchmark the PE-RLHF setup on six diverse datasets spanning summarization, harmless/helpful response generation, UI automation, and visual question answering in terms of effectiveness of the trained models, and the training resources required. Our findings show, for the first time, that PE-RLHF achieves comparable performance to RLHF, while significantly reducing training time (up to 90 30 models, and 27 and model sizes for both reward modeling and reinforcement learning. By mitigating the computational burden associated with RLHF, we push for a broader adoption of PE-RLHF as an alignment technique for LLMs and VLMs.
More
Translated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined