Cross-Modal Audio-Text Retrieval via Sequential Feature Augmentation
2023 2ND ASIA CONFERENCE ON ALGORITHMS, COMPUTING AND MACHINE LEARNING, CACML 2023(2023)
Abstract
The goal of cross-modal audio-text retrieval is to retrieve the target audio clips (textual descriptions), which should be relevant to a given textual (audial) query. It is a challenging task because it necessitates learning comprehensive feature representations for two different modalities and unifying them into a common embedding space. However, most existing cross-modal audio-text retrieval approaches do not explicitly learn the sequential representation in audio features. Moreover, their method of directly employing a fully connected neural network to transform the different modalities into a common space is detrimental to sequential features. In this paper, we introduce a sequential feature augmentation framework based on reinforcement learning and feature fusion to enhance the sequential feature for cross-modal features. First, we adopt reinforcement learning to explore effective sequential features in audial and textual features. Then, a recurrent fusion module is applied as a feature enhancement component to project heterogeneous features into a common space. Extensive experiments are conducted on two prevalent datasets: the AudioCaps and the Clotho. The results demonstrate that our method gains a significant improvement over previous state-of-the-art methods.
MoreTranslated text
Key words
cross-modal task,audio-text retrieval,reinforcement learning
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined