LLaViLo: Boosting Video Moment Retrieval Via Adapter-Based Multimodal Modeling
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW(2023)
Abstract
Recent studies have explored the potential of large language models (LLMs) for understanding the semantic information in images. However, the use of LLMs to understand videos, which contain continuous contextual information, remains limited. In this paper, we propose LLaV-iLo (LLaMa-Video-Localizer), a video moment retrieval pipeline powered by a large language model. LLaViLo has two key features: 1) In contrast to fine-tuning the entire LLM, we introduce and optimize only 1.7% of additional parameters in adapter modules, freezing the pre-trained LLM to enable efficient alignment of video and text. 2) A multi-objective optimization framework concurrently op-timizes two objectives: a set prediction objective and a captioning objective. The joint training of these two objectives allows the proposed framework to produce high-quality time coordinates. Compared with other state-of-the-art methods, the proposed LLaViLo achieves significant performance improvement on QVHighlights and Charades-STA datasets.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined