TAM-VT: Transformation-Aware Multi-scale Video Transformer for Segmentation and Tracking
arxiv(2023)
摘要
Video Object Segmentation (VOS) has emerged as an increasingly important
problem with availability of larger datasets and more complex and realistic
settings, which involve long videos with global motion (e.g, in egocentric
settings), depicting small objects undergoing both rigid and non-rigid
(including state) deformations. While a number of recent approaches have been
explored for this task, these data characteristics still present challenges. In
this work we propose a novel, clip-based DETR-style encoder-decoder
architecture, which focuses on systematically analyzing and addressing
aforementioned challenges. Specifically, we propose a novel
transformation-aware loss that focuses learning on portions of the video where
an object undergoes significant deformations – a form of "soft" hard examples
mining. Further, we propose a multiplicative time-coded memory, beyond vanilla
additive positional encoding, which helps propagate context across long videos.
Finally, we incorporate these in our proposed holistic multi-scale video
transformer for tracking via multi-scale memory matching and decoding to ensure
sensitivity and accuracy for long videos and small objects. Our model enables
on-line inference with long videos in a windowed fashion, by breaking the video
into clips and propagating context among them. We illustrate that short clip
length and longer memory with learned time-coding are important design choices
for improved performance. Collectively, these technical contributions enable
our model to achieve new state-of-the-art (SoTA) performance on two complex
egocentric datasets – VISOR and VOST, while achieving comparable to SoTA
results on the conventional VOS benchmark, DAVIS'17. A series of detailed
ablations validate our design choices as well as provide insights into the
importance of parameter choices and their impact on performance.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要