MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection

Rui Dai,Srijan Das,Kumara Kahatapitiya,Michael S. Ryoo,Francois Bremond

IEEE Conference on Computer Vision and Pattern Recognition（2022）

引用 41|浏览43

暂无评分

摘要

Action detection is a significant and challenging task, especially in densely-labelled datasets of untrimmed videos. Such data consist of complex temporal relations including composite or co-occurring actions. To detect actions in these complex settings, it is critical to capture both shortterm and long-term temporal information efficiently. To this end, we propose a novel ‘ConvTransformer’ network for action detection: MS-TCT 1 1 Code/Models: https://github.com/dairui01/MS-TCT. This network comprises of three main components: (1) a Temporal Encoder module which explores global and local temporal relations at multiple temporal resolutions, (2) a Temporal Scale Mixer module which effectively fuses multi-scale features, creating a unified feature representation, and (3) a Classification module which learns a center-relative position of each action instance in time, and predicts frame-level classification scores. Our experimental results on multiple challenging datasets such as Charades, TSU and MultiTHUMOS, validate the effectiveness of the proposed method, which outperforms the state-of-the-art methods on all three datasets.

查看译文

关键词

Action and event recognition, Behavior analysis

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要