Metal: Minimum Effort Temporal Activity Localization In Untrimmed Videos

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)(2020)

引用 20|浏览157
暂无评分
摘要
Existing Temporal Activity Localization (TAL) methods largely adopt strong supervision for model training which requires (1) vast amounts of untrimmed videos per each activity category and (2) accurate segment-level boundary annotations (start time and end time) for every instance. This poses a critical restriction to the current methods in practical scenarios where not only segment-level annotations are expensive to obtain but many activity categories are also rare and unobserved during training. Therefore, Can we learn a TAL model under weak supervision that can localize unseen activity classes? To address this scenario, we define a novel example-based TAL problem called Minimum Effort Temporal Activity Localization (METAL): Given only a few examples, the goal is to find the occurrences of semantically-related segments in an untrimmed video sequence while model training is only supervised by the video-level annotation. Towards this objective, we propose a novel Similarity Pyramid Network (SPN) that adopts the few-shot learning technique of Relation Network and directly encodes hierarchical multi-scale correlations, which we learn by optimizing two complimentary loss functions in an end-to-end manner. We evaluate the SPN on the THUMOS' 14 and ActivityNet datasets, of which we rearrange the videos to fit the METAL setup. Results show that our SPN achieves performance superior or competitive to state-of-the-art approaches with stronger supervision.
更多
查看译文
关键词
hierarchical multiscale correlations,relation network,few-shot learning technique,similarity pyramid network,segment-level boundary annotations,minimum effort temporal activity localization,example-based TAL problem,video-level annotation,untrimmed video sequence,semantically-related segments,TAL model
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要