Multimodal active speaker detection using cross-attention and contextual information.

IEEE International Conference on Consumer Electronics（2024）

引用 0|浏览1

暂无评分

摘要

An active speaker detection (ASD) framework is aimed to identify whether an on-screen person is speaking or not in each frame of the video. In this paper, we introduce a novel ASD system by mindful integration of audio-video cues through a cross-attention module to capture inter-modal information while retaining the distinct intra-modal features. Furthermore, the system models the inter-speaker relations between the speakers within the same scene. The experimental evaluation validates the effectiveness of the approach, achieving an average mAP score of 94.8%.

查看译文

关键词

multimodal active speaker detection,cross-attention block,contextual speaker relations

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要