STAT: Towards Generalizable Temporal Action Localization
arxiv(2024)
摘要
Weakly-supervised temporal action localization (WTAL) aims to recognize and
localize action instances with only video-level labels. Despite the significant
progress, existing methods suffer from severe performance degradation when
transferring to different distributions and thus may hardly adapt to real-world
scenarios . To address this problem, we propose the Generalizable Temporal
Action Localization task (GTAL), which focuses on improving the
generalizability of action localization methods. We observed that the
performance decline can be primarily attributed to the lack of generalizability
to different action scales. To address this problem, we propose STAT
(Self-supervised Temporal Adaptive Teacher), which leverages a teacher-student
structure for iterative refinement. Our STAT features a refinement module and
an alignment module. The former iteratively refines the model's output by
leveraging contextual information and helps adapt to the target scale. The
latter improves the refinement process by promoting a consensus between student
and teacher models. We conduct extensive experiments on three datasets,
THUMOS14, ActivityNet1.2, and HACS, and the results show that our method
significantly improves the Baseline methods under the cross-distribution
evaluation setting, even approaching the same-distribution evaluation
performance.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要