Understanding Long Videos in One Multimodal Language Model Pass
arxiv(2024)
摘要
Large Language Models (LLMs), known to contain a strong awareness of world
knowledge, have allowed recent approaches to achieve excellent performance on
Long-Video Understanding benchmarks, but at high inference costs. In this work,
we first propose Likelihood Selection, a simple technique that unlocks faster
inference in autoregressive LLMs for multiple-choice tasks common in long-video
benchmarks. In addition to faster inference, we discover the resulting models
to yield surprisingly good accuracy on long-video tasks, even with no video
specific information. Building on this, we inject video-specific object-centric
information extracted from off-the-shelf pre-trained models and utilize natural
language as a medium for information fusion. Our resulting Multimodal Video
Understanding (MVU) framework demonstrates state-of-the-art performance across
long-video and fine-grained action recognition benchmarks. Code available at:
https://github.com/kahnchana/mvu
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要