Multimodal Active Speaker Detection And Virtual Cinematography For Video Conferencing

Cutler Ross,Mehran Ramin,Johnson Sam,Zhang Cha,Kirk Adam,Whyte Oliver,Kowdle Adarsh

2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING（2020）

引用 5|浏览71

暂无评分

摘要

Active speaker detection (ASD) and virtual cinematography (VC) can significantly improve the experience of a video conference by automatically panning, tilting and zooming of a camera: subjectively users rate an expert video cinematographer significantly higher than the unedited video. We describe a new automated ASD and VC that performs within 0.3 MOS of an expert cinematographer based on subjective ratings with a 1-5 scale. This system uses a 4K wide-FOV camera, a depth camera, and a microphone array, extracts features from each modality and trains an ASD using an AdaBoost machine learning system that is very efficient and runs in real-time. A VC is similarly trained using machine learning. To avoid distracting the room participants the system has no moving parts - the VC works by cropping and zooming the 4K wide-FOV video stream. The system was tuned and evaluated using extensive crowdsourcing techniques and evaluated on a system with N=100 meetings, each 2-5 minutes in length.

查看译文

关键词

Active speaker detection, virtual cinematography, video conferencing, machine learning, computer vision, sound source localization, multimodal fusion, crowdsourcing

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要