YOWOv3: A Lightweight Spatio-Temporal Joint Network for Video Action Detection

Anlei Zhu,Yinghui Wang,Jinlong Yang,Tao Yan,Haomiao Ma,Wei Li

IEEE Transactions on Circuits and Systems for Video Technology（2024）

引用 0|浏览2

暂无评分

摘要

Spatio-temporal action detection networks, which need to simultaneously extract and fuse spatial and temporal features, often result in existing models becoming bloated and difficult to run in real-time and deploy on edge devices. This paper introduces an efficient and real-time spatio-temporal action detection model, YOWOv3. This model uses efficient 3D and 2D backbone networks to separately extract spatial and spatial-temporal features from sequential information. A lightweight spatio-temporal feature fusion module, designed by deeply integrating convolution and self-attention mechanisms, further enhances the extraction of spatio-temporal features. We refer to this module as the CFACM (Channel Fusion & Attention Convolution Mix) module. Our approach not only outperforms the latest efficient spatio-temporal action detection models in terms of lightness, reducing the model size by 24% compared to the latter, but also improves the mAP accuracy on the UCF101-24 dataset by 1.35%, while maintaining excellent speed performance, thus achieving a balance between accuracy and speed. Furthermore, existing models often use 3D convolutions to extract temporal information, which may be limited on certain devices, such as Apple’s M series processors. To mitigate the potential issue of 3D convolution operators not being supported during edge deployment of spatio-temporal action detection models, we employ a spatio-temporal shift module containing only 2D convolutions. This enables the model to acquire temporal information and inject the obtained temporal features into multi-level spatio-temporal feature extraction models. This not only liberates the model from the constraints of 3D convolution operations but also enhances the model’s balance between accuracy and speed. This results in state-of-the-art performance in lightweight networks using only 2D convolutions.

查看译文

关键词

Spatio-Temporal Action Detection,Spatio-Temporal Feature Fusion,Lightweight Network,Edge Devices

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要