Expressivity-aware Music Performance Retrieval Using Mid-level Perceptual Features and Emotion Word Embeddings
FIRE '23 Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation(2024)
Abstract
This paper explores a specific sub-task of cross-modal music retrieval. Weconsider the delicate task of retrieving a performance or rendition of amusical piece based on a description of its style, expressive character, oremotion from a set of different performances of the same piece. We observe thata general purpose cross-modal system trained to learn a common text-audioembedding space does not yield optimal results for this task. By introducingtwo changes – one each to the text encoder and the audio encoder – wedemonstrate improved performance on a dataset of piano performances andassociated free-text descriptions. On the text side, we use emotion-enrichedword embeddings (EWE) and on the audio side, we extract mid-level perceptualfeatures instead of generic audio embeddings. Our results highlight theeffectiveness of mid-level perceptual features learnt from music and emotionenriched word embeddings learnt from emotion-labelled text in capturing musicalexpression in a cross-modal setting. Additionally, our interpretable mid-levelfeatures provide a route for introducing explainability in the retrieval anddownstream recommendation processes.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined