MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction Tuning
arxiv(2024)
摘要
In the realm of audio-language pre-training (ALP), the challenge of achieving
cross-modal alignment is significant. Moreover, the integration of audio inputs
with diverse distributions and task variations poses challenges in developing
generic audio-language models. In this study, we introduce MINT, a novel ALP
framework boosting audio-language models through multi-target pre-training and
instruction tuning. MINT leverages the strength of frozen pre-trained audio
encoders and large language models (LLMs) to improve audio-language
pre-training, enabling effective transferablility to both audio-text
understanding and generation tasks. To address the modality gap, we propose
Bridge-Net, a lightweight trainable module that enhances cross-modality
alignment and the model's ability to follow instructions for a variety of
audio-text tasks. Bridge-Net is pivotal within MINT, initially enhancing
audio-language representation learning through a multi-target pre-training
approach. Subsequently, Bridge-Net further boosts audio-to-language generative
learning by integrating a frozen language model with instruction tuning. This
integration empowers MINT to extract features in a flexible and effective
manner, specifically tailored to the provided instructions for diverse tasks.
Experimental results demonstrate that MINT attains superior performance across
various audio-language understanding and generation tasks, highlighting its
robust generalization capabilities even in zero-shot scenarios.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要