Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study
CoRR(2024)
Abstract
Despite the impressive capabilities of Multimodal Large Language Models
(MLLMs) in integrating text and image modalities, challenges remain in
accurately interpreting detailed visual elements. This paper presents an
empirical study on enhancing MLLMs with state-of-the-art (SOTA) object
detection and Optical Character Recognition models to improve fine-grained
image understanding and reduce hallucination in responses. Our research
investigates the embedding-based infusion of detection information, the impact
of such infusion on the MLLMs' original abilities, and the interchangeability
of detection models. We conduct systematic experiments with models such as
LLaVA-1.5, DINO, and PaddleOCRv2, revealing that our approach not only refines
MLLMs' performance in specific visual tasks but also maintains their original
strengths. The resulting enhanced MLLMs outperform SOTA models on 9 out of 10
benchmarks, achieving an improvement of up to 12.99
score, marking a notable advancement in multimodal understanding. We release
our codes to facilitate further exploration into the fine-grained multimodal
dialogue capabilities of MLLMs.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined