LSTP: Language-guided Spatial-Temporal Prompt Learning for Long-form Video-Text Understanding
CoRR(2024)
摘要
Despite progress in video-language modeling, the computational challenge of
interpreting long-form videos in response to task-specific linguistic queries
persists, largely due to the complexity of high-dimensional video data and the
misalignment between language and visual cues over space and time. To tackle
this issue, we introduce a novel approach called Language-guided
Spatial-Temporal Prompt Learning (LSTP). This approach features two key
components: a Temporal Prompt Sampler (TPS) with optical flow prior that
leverages temporal information to efficiently extract relevant video content,
and a Spatial Prompt Solver (SPS) that adeptly captures the intricate spatial
relationships between visual and textual elements. By harmonizing TPS and SPS
with a cohesive training strategy, our framework significantly enhances
computational efficiency, temporal understanding, and spatial-temporal
alignment. Empirical evaluations across two challenging tasks–video question
answering and temporal question grounding in videos–using a variety of
video-language pretrainings (VLPs) and large language models (LLMs) demonstrate
the superior performance, speed, and versatility of our proposed LSTP paradigm.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要