Neural-Symbolic VideoQA: Learning Compositional Spatio-Temporal Reasoning for Real-world Video Question Answering
arxiv(2024)
摘要
Compositional spatio-temporal reasoning poses a significant challenge in the
field of video question answering (VideoQA). Existing approaches struggle to
establish effective symbolic reasoning structures, which are crucial for
answering compositional spatio-temporal questions. To address this challenge,
we propose a neural-symbolic framework called Neural-Symbolic VideoQA
(NS-VideoQA), specifically designed for real-world VideoQA tasks. The
uniqueness and superiority of NS-VideoQA are two-fold: 1) It proposes a Scene
Parser Network (SPN) to transform static-dynamic video scenes into Symbolic
Representation (SR), structuralizing persons, objects, relations, and action
chronologies. 2) A Symbolic Reasoning Machine (SRM) is designed for top-down
question decompositions and bottom-up compositional reasonings. Specifically, a
polymorphic program executor is constructed for internally consistent reasoning
from SR to the final answer. As a result, Our NS-VideoQA not only improves the
compositional spatio-temporal reasoning in real-world VideoQA task, but also
enables step-by-step error analysis by tracing the intermediate results.
Experimental evaluations on the AGQA Decomp benchmark demonstrate the
effectiveness of the proposed NS-VideoQA framework. Empirical studies further
confirm that NS-VideoQA exhibits internal consistency in answering
compositional questions and significantly improves the capability of
spatio-temporal and logical inference for VideoQA tasks.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要