Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization
CoRR(2023)
摘要
Action Localization is a challenging problem that combines detection and
recognition tasks, which are often addressed separately. State-of-the-art
methods rely on off-the-shelf bounding box detections pre-computed at high
resolution and propose transformer models that focus on the classification task
alone. Such two-stage solutions are prohibitive for real-time deployment. On
the other hand, single-stage methods target both tasks by devoting part of the
network (generally the backbone) to sharing the majority of the workload,
compromising performance for speed. These methods build on adding a DETR head
with learnable queries that, after cross- and self-attention can be sent to
corresponding MLPs for detecting a person's bounding box and action. However,
DETR-like architectures are challenging to train and can incur in big
complexity.
In this paper, we observe that a straight bipartite matching loss can be
applied to the output tokens of a vision transformer. This results in a
backbone + MLP architecture that can do both tasks without the need of an extra
encoder-decoder head and learnable queries. We show that a single MViT-S
architecture trained with bipartite matching to perform both tasks surpasses
the same MViT-S when trained with RoI align on pre-computed bounding boxes.
With a careful design of token pooling and the proposed training pipeline, our
MViTv2-S model achieves +3 mAP on AVA2.2. w.r.t. the two-stage counterpart.
Code and models will be released after paper revision.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要