Beyond Visual Cues: Synchronously Exploring Target-Centric Semantics for Vision-Language Tracking
CoRR(2023)
摘要
Single object tracking aims to locate one specific target in video sequences,
given its initial state. Classical trackers rely solely on visual cues,
restricting their ability to handle challenges such as appearance variations,
ambiguity, and distractions. Hence, Vision-Language (VL) tracking has emerged
as a promising approach, incorporating language descriptions to directly
provide high-level semantics and enhance tracking performance. However, current
VL trackers have not fully exploited the power of VL learning, as they suffer
from limitations such as heavily relying on off-the-shelf backbones for feature
extraction, ineffective VL fusion designs, and the absence of VL-related loss
functions. Consequently, we present a novel tracker that progressively explores
target-centric semantics for VL tracking. Specifically, we propose the first
Synchronous Learning Backbone (SLB) for VL tracking, which consists of two
novel modules: the Target Enhance Module (TEM) and the Semantic Aware Module
(SAM). These modules enable the tracker to perceive target-related semantics
and comprehend the context of both visual and textual modalities at the same
pace, facilitating VL feature extraction and fusion at different semantic
levels. Moreover, we devise the dense matching loss to further strengthen
multi-modal representation learning. Extensive experiments on VL tracking
datasets demonstrate the superiority and effectiveness of our methods.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要