Chrome Extension
WeChat Mini Program
Use on ChatGLM

Listen As You Wish: Fusion of Audio and Text for Cross-Modal Event Detection in Smart Cities

INFORMATION FUSION(2024)

Cited 0|Views12
No score
Abstract
In the era of smart cities, the advent of the Internet of Things technology has catalyzed the proliferation of multimodal sensor data, presenting new challenges in cross-modal event detection, particularly in audio event detection via textual queries. This paper focuses on the novel task of text-to-audio grounding (TAG), aiming to precisely localize sound segments that correspond to events described in textual queries within an untrimmed audio. This challenging new task requires multi-modal (acoustic and linguistic) information fusion as well as the reasoning for the cross-modal semantic matching between the given audio and textual query. Unlike conventional methods that often overlook the nuanced interactions between and within modalities, we introduce the Cross-modal Graph Interaction (CGI) model. This innovative approach leverages a language graph to model complex semantic relationships between query words, enhancing the understanding of textual queries. Additionally, a cross-modal attention mechanism generates snippet-specific query representations, facilitating fine-grained semantic matching between audio segments and textual descriptions. A cross-gating module further refines this process by emphasizing relevant features across modalities and suppressing irrelevant information, optimizing multimodal information fusion. Our comprehensive evaluation on the Audiogrounding benchmark dataset not only demonstrates the CGI model’s superior performance over existing methods but also underscores the significance of sophisticated multimodal interaction in improving the efficacy of TAG in smart cities.
More
Translated text
Key words
Smart city,Multimodal information fusion,Cross-modal learning,Text-to-audio grounding,Graph neural network
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined