谷歌浏览器插件
订阅小程序
在清言上使用

Human-centric Spatio-Temporal Video Grounding Via the Combination of Mutual Matching Network and TubeDETR

Fan Yu, Zhixiang Zhao, Yuchen Wang, Yi Xu,Tongwei Ren,Gangshan Wu

PIC '22 Proceedings of the 4th on Person in Context Workshop(2022)

引用 0|浏览14
暂无评分
摘要
In this technical report, we represent our solution for the Human-centric Spatio-Temporal Video Grounding (HC-STVG) track of the 4th Person in Context (PIC) workshop and challenge. Our solution is built on the basis of TubeDETR and Mutual Matching Network (MMN). Specifically, TubeDETR exploits a video-text encoder and a space-time decoder to predict the starting time, the ending time and the tube of the target person. MMN detects persons in images, links them as tubes, extracts features of person tubes and the text description, and predicts the similarities between them to choose the most likely person tube as the grounding result. Our solution finally finetunes the results by combining the spatio localization of MMN and the temporal localization of TubeDETR. In the HC-STVG track of the 4th PIC challenge, our solution achieves the third place.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要