Fusion detection network with discriminative enhancement for weakly-supervised temporal action localization

EXPERT SYSTEMS WITH APPLICATIONS(2024)

引用 0|浏览11
暂无评分
摘要
Weakly-supervised temporal action localization aims to identify and localize action instances in untrimmed videos using only video-level action labels. Due to the lack of frame-level annotation information, correctly distinguishing foreground and background snippets in a video is crucial for temporal action localization. However, alongside foreground and background snippets, a large number of semantically similar snippets exist within the video. Such snippets share the same semantic information with foreground or background, leading to less fine-grained boundary localization of action instances. Inspired by the success of multimodal learning, we have extracted high-quality semantic features from multimodal inputs and constructed contrast loss to enhance the ability of the model to distinguish semantically similar snippets. In this paper, we propose a fusion detection network with discriminative enhancement(De-FDN). Specifically, we design a fusion detection model (FDM) that fully leverages the complementarity and correlation among multimodal features to extract high-quality semantic features from videos. We then construct multimodal class activation sequences to accomplish accurate identification and localization of action instances. Additionally, we design a discriminative enhancement mechanism (DEM), which increases the gap between semantically similar segments by calculating the semantic contrast loss. Extensive experiments on the THUMOS14, ActivityNet1.2, and ActivityNet1.3 datasets demonstrate the effectiveness of our method.
更多
查看译文
关键词
Temporal action localization,Weakly-supervised,Fusion detection network,Discriminative enhancement
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要