Multimodal active speaker detection using cross-attention and contextual information.

IEEE International Conference on Consumer Electronics(2024)

引用 0|浏览1
暂无评分
摘要
An active speaker detection (ASD) framework is aimed to identify whether an on-screen person is speaking or not in each frame of the video. In this paper, we introduce a novel ASD system by mindful integration of audio-video cues through a cross-attention module to capture inter-modal information while retaining the distinct intra-modal features. Furthermore, the system models the inter-speaker relations between the speakers within the same scene. The experimental evaluation validates the effectiveness of the approach, achieving an average mAP score of 94.8%.
更多
查看译文
关键词
multimodal active speaker detection,cross-attention block,contextual speaker relations
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要