Cross-Modal Audio-Text Retrieval via Sequential Feature Augmentation

2023 2ND ASIA CONFERENCE ON ALGORITHMS, COMPUTING AND MACHINE LEARNING, CACML 2023(2023)

引用 0|浏览0
暂无评分
摘要
The goal of cross-modal audio-text retrieval is to retrieve the target audio clips (textual descriptions), which should be relevant to a given textual (audial) query. It is a challenging task because it necessitates learning comprehensive feature representations for two different modalities and unifying them into a common embedding space. However, most existing cross-modal audio-text retrieval approaches do not explicitly learn the sequential representation in audio features. Moreover, their method of directly employing a fully connected neural network to transform the different modalities into a common space is detrimental to sequential features. In this paper, we introduce a sequential feature augmentation framework based on reinforcement learning and feature fusion to enhance the sequential feature for cross-modal features. First, we adopt reinforcement learning to explore effective sequential features in audial and textual features. Then, a recurrent fusion module is applied as a feature enhancement component to project heterogeneous features into a common space. Extensive experiments are conducted on two prevalent datasets: the AudioCaps and the Clotho. The results demonstrate that our method gains a significant improvement over previous state-of-the-art methods.
更多
查看译文
关键词
cross-modal task,audio-text retrieval,reinforcement learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要