Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(2016)

引用 1042|浏览94
暂无评分
摘要
We address temporal action localization in untrimmed long videos. This is important because videos in real applications are usually unconstrained and contain multiple action instances plus video content of background scenes or other activities. To address this challenging issue, we exploit the effectiveness of deep networks in temporal action localization via three segment-based 3D ConvNets: (1) a proposal network identifies candidate segments in a long video that may contain actions; (2) a classification network learns one-vs-all action classification model to serve as initialization for the localization network; and (3) a localization network fine-tunes on the learned classification network to localize each action instance. We propose a novel loss function for the localization network to explicitly consider temporal overlap and therefore achieve high temporal localization accuracy. Only the proposal network and the localization network are used during prediction. On two large-scale benchmarks, our approach achieves significantly superior performances compared with other state-of-the-art systems: mAP increases from 1.7% to 7.4% on MEXaction2 and increases from 15.0% to 19.0% on THUMOS 2014, when the overlap threshold for evaluation is set to 0.5.
更多
查看译文
关键词
temporal localization accuracy,temporal overlap,loss function,classification network learning,one-vs-all action classification,classification network,candidate segments,segment-based 3D ConvNets,deep networks,background scenes,video content,action instances,untrimmed long videos,temporal action localization
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要