Multimodal Active Speaker Detection And Virtual Cinematography For Video Conferencing

2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING(2020)

引用 5|浏览71
暂无评分
摘要
Active speaker detection (ASD) and virtual cinematography (VC) can significantly improve the experience of a video conference by automatically panning, tilting and zooming of a camera: subjectively users rate an expert video cinematographer significantly higher than the unedited video. We describe a new automated ASD and VC that performs within 0.3 MOS of an expert cinematographer based on subjective ratings with a 1-5 scale. This system uses a 4K wide-FOV camera, a depth camera, and a microphone array, extracts features from each modality and trains an ASD using an AdaBoost machine learning system that is very efficient and runs in real-time. A VC is similarly trained using machine learning. To avoid distracting the room participants the system has no moving parts - the VC works by cropping and zooming the 4K wide-FOV video stream. The system was tuned and evaluated using extensive crowdsourcing techniques and evaluated on a system with N=100 meetings, each 2-5 minutes in length.
更多
查看译文
关键词
Active speaker detection, virtual cinematography, video conferencing, machine learning, computer vision, sound source localization, multimodal fusion, crowdsourcing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要