AV-RIR: Audio-Visual Room Impulse Response Estimation
CVPR 2024(2023)
摘要
Accurate estimation of Room Impulse Response (RIR), which captures an
environment's acoustic properties, is important for speech processing and AR/VR
applications. We propose AV-RIR, a novel multi-modal multi-task learning
approach to accurately estimate the RIR from a given reverberant speech signal
and the visual cues of its corresponding environment. AV-RIR builds on a novel
neural codec-based architecture that effectively captures environment geometry
and materials properties and solves speech dereverberation as an auxiliary task
by using multi-task learning. We also propose Geo-Mat features that augment
material information into visual cues and CRIP that improves late reverberation
components in the estimated RIR via image-to-RIR retrieval by 86%. Empirical
results show that AV-RIR quantitatively outperforms previous audio-only and
visual-only approaches by achieving 36% - 63% improvement across various
acoustic metrics in RIR estimation. Additionally, it also achieves higher
preference scores in human evaluation. As an auxiliary benefit, dereverbed
speech from AV-RIR shows competitive performance with the state-of-the-art in
various spoken language processing tasks and outperforms reverberation time
error score in the real-world AVSpeech dataset. Qualitative examples of both
synthesized reverberant speech and enhanced speech can be found at
https://www.youtube.com/watch?v=tTsKhviukAE.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要