VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools
arxiv(2023)
Abstract
Building models that comprehends videos and responds specific user
instructions is a practical and challenging topic, as it requires mastery of
both vision understanding and knowledge reasoning. Compared to language and
image modalities, training efficiency remains a serious problem as existing
studies train models on massive sparse videos paired with brief descriptions.
In this paper, we introduce VidCoM, a fast adaptive framework that
leverages Large Language Models (LLMs) to reason about videos using lightweight
visual tools. Specifically, we reveal that the key to responding to specific
instructions is focusing on relevant video events, and utilize two visual
tools, structured scene graph generation and descriptive image caption
generation, to gather and represent the event information. Thus, a LLM enriched
with world knowledge is adopted as the reasoning agent to achieve the responses
by performing multiple reasoning steps on specific video events. To address the
difficulty of LLMs identifying video events, we further propose an
Instruction-oriented Video Events Recognition (InsOVER) algorithm. This
algorithm locates the corresponding video events based on an efficient
Hungarian matching between decompositions of linguistic instructions and video
events, thereby enabling LLMs to interact effectively with extended videos.
Extensive experiments on two typical video comprehension tasks show that the
proposed tuning-free framework outperforms the pre-trained models including
Flamingo-80B, to achieve the state-of-the-art performance. Our source code and
system will be publicly available.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined