CREMA: Multimodal Compositional Video Reasoning via Efficient Modular Adaptation and Fusion
CoRR(2024)
摘要
Despite impressive advancements in multimodal compositional reasoning
approaches, they are still limited in their flexibility and efficiency by
processing fixed modality inputs while updating a lot of model parameters. This
paper tackles these critical challenges and proposes CREMA, an efficient and
modular modality-fusion framework for injecting any new modality into video
reasoning. We first augment multiple informative modalities (such as optical
flow, 3D point cloud, audio) from given videos without extra human annotation
by leveraging existing pre-trained models. Next, we introduce a query
transformer with multiple parameter-efficient modules associated with each
accessible modality. It projects diverse modality features to the LLM token
embedding space, allowing the model to integrate different data types for
response generation. Furthermore, we propose a fusion module designed to
compress multimodal queries, maintaining computational efficiency in the LLM
while combining additional modalities. We validate our method on video-3D,
video-audio, and video-language reasoning tasks and achieve better/equivalent
performance against strong multimodal LLMs, including BLIP-2, 3D-LLM, and
SeViLA while using 96
analyses of CREMA, including the impact of each modality on reasoning domains,
the design of the fusion module, and example visualizations.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要