See me Speaking? Differentiating on Whether Words are Spoken On Screen or Off to Optimize Machine Dubbing

Multimodal Interfaces and Machine Learning for Multimodal Interaction(2020)

引用 4|浏览20
暂无评分
摘要
ABSTRACTDubbing is the art of finding a translation from a source into a target language that can be lip-synchronously revoiced, i. e., that makes the target language speech appear as if it was spoken by the very actors all along. Lip synchrony is essential for the full-fledged reception of foreign audiovisual media, such as movies and series, as violated constraints of synchrony between video (lips) and audio (speech) lead to cognitive dissonance and reduce the perceptual quality. Of course, synchrony constraints only apply to the translation when the speaker's lips are visible on screen. Therefore, deciding whether to apply synchrony constraints requires an automatic method for detecting whether an actor's lips are visible on screen for a given stretch of speech or not. In this paper, we attempt, for the first time, to classify on- from off-screen speech based on a corpus of real-world television material that has been annotated word-by-word for the visibility of talking lips on screen. We present classification experiments in which we classify
更多
查看译文
关键词
audiovisual machine translation, dubbing, multi-modal speech processing, activity recognition
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要