Correlation of Fréchet Audio Distance With Human Perception of Environmental Audio Is Embedding Dependant
CoRR(2024)
摘要
This paper explores whether considering alternative domain-specific
embeddings to calculate the Fréchet Audio Distance (FAD) metric can help the
FAD to correlate better with perceptual ratings of environmental sounds. We
used embeddings from VGGish, PANNs, MS-CLAP, L-CLAP, and MERT, which are
tailored for either music or environmental sound evaluation. The FAD scores
were calculated for sounds from the DCASE 2023 Task 7 dataset. Using perceptual
data from the same task, we find that PANNs-WGM-LogMel produces the best
correlation between FAD scores and perceptual ratings of both audio quality and
perceived fit with a Spearman correlation higher than 0.5. We also find that
music-specific embeddings resulted in significantly lower results.
Interestingly, VGGish, the embedding used for the original Fréchet
calculation, yielded a correlation below 0.1. These results underscore the
critical importance of the choice of embedding for the FAD metric design.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要