Tracking with Human-Intent Reasoning
CoRR(2023)
摘要
Advances in perception modeling have significantly improved the performance
of object tracking. However, the current methods for specifying the target
object in the initial frame are either by 1) using a box or mask template, or
by 2) providing an explicit language description. These manners are cumbersome
and do not allow the tracker to have self-reasoning ability. Therefore, this
work proposes a new tracking task -- Instruction Tracking, which involves
providing implicit tracking instructions that require the trackers to perform
tracking automatically in video frames. To achieve this, we investigate the
integration of knowledge and reasoning capabilities from a Large
Vision-Language Model (LVLM) for object tracking. Specifically, we propose a
tracker called TrackGPT, which is capable of performing complex reasoning-based
tracking. TrackGPT first uses LVLM to understand tracking instructions and
condense the cues of what target to track into referring embeddings. The
perception component then generates the tracking results based on the
embeddings. To evaluate the performance of TrackGPT, we construct an
instruction tracking benchmark called InsTrack, which contains over one
thousand instruction-video pairs for instruction tuning and evaluation.
Experiments show that TrackGPT achieves competitive performance on referring
video object segmentation benchmarks, such as getting a new state-of the-art
performance of 66.5 $\mathcal{J}\&\mathcal{F}$ on Refer-DAVIS. It also
demonstrates a superior performance of instruction tracking under new
evaluation protocols. The code and models are available at
\href{https://github.com/jiawen-zhu/TrackGPT}{https://github.com/jiawen-zhu/TrackGPT}.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要