Interaction-Aware Spatio-Temporal Pyramid Attention Networks for Action Classification

Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XVI(2022)

引用 93|浏览154
暂无评分
摘要
For CNN-based visual action recognition, the accuracy may be increased if local key action regions are focused on. The task of self-attention is to focus on key features and ignore irrelevant information. So, self-attention is useful for action recognition. However, current self-attention methods usually ignore correlations among local feature vectors at spatial positions in CNN feature maps. In this paper, we propose an effective interaction-aware self-attention model which can extract information about the interactions between feature vectors to learn attention maps. Since the different layers in a network capture feature maps at different scales, we introduce a spatial pyramid with the feature maps at different layers for attention modeling. The multi-scale information is utilized to obtain more accurate attention scores. These attention scores are used to weight the local feature vectors of the feature maps and then calculate attentional feature maps. Since the number of feature maps input to the spatial pyramid attention layer is unrestricted, we easily extend this attention layer to a spatio-temporal version. Our model can be embedded in any general CNN to form a video-level end-to-end attention network for action recognition. Several methods are investigated to combine the RGB and flow streams to obtain accurate predictions of human actions. Experimental results show that our method achieves state-of-the-art results on the datasets UCF101, HMDB51, Kinetics-400, and untrimmed Charades.
更多
查看译文
关键词
Algorithms,Humans,Learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要