FGAIF: Aligning Large Vision-Language Models with Fine-grained AI Feedback
arxiv(2024)
摘要
Large Vision-Language Models (LVLMs) have demonstrated proficiency in
tackling a variety of visual-language tasks. However, current LVLMs suffer from
misalignment between text and image modalities which causes three kinds of
hallucination problems, i.e., object existence, object attribute, and object
relationship. To tackle this issue, existing methods mainly utilize
Reinforcement Learning (RL) to align modalities in LVLMs. However, they still
suffer from three main limitations: (1) General feedback can not indicate the
hallucination type contained in the response; (2) Sparse rewards only give the
sequence-level reward for the whole response; and (3)Annotation cost is
time-consuming and labor-intensive. To handle these limitations, we propose an
innovative method to align modalities in LVLMs through Fine-Grained Artificial
Intelligence Feedback (FGAIF), which mainly consists of three steps: AI-based
Feedback Collection, Fine-grained Reward Model Training, and Reinforcement
Learning with Fine-grained Reward. Specifically, We first utilize AI tools to
predict the types of hallucination for each segment in the response and obtain
a collection of fine-grained feedback. Then, based on the collected reward
data, three specialized reward models are trained to produce dense rewards.
Finally, a novel fine-grained feedback module is integrated into the Proximal
Policy Optimization (PPO) algorithm. Extensive experiments are conducted on
hallucination and general benchmarks, demonstrating the superior performance of
our proposed method. Notably, compared with previous models trained with the
RL-based aligning method, our proposed method is effective even with fewer
parameters.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要