BLSP: Bootstrapping Language-Speech Pre-training Via Behavior Alignment of Continuation Writing
arXiv (Cornell University)(2023)
Abstract
The emergence of large language models (LLMs) has sparked significantinterest in extending their remarkable language capabilities to speech.However, modality alignment between speech and text still remains an openproblem. Current solutions can be categorized into two strategies. One is acascaded approach where outputs (tokens or states) of a separately trainedspeech recognition system are used as inputs for LLMs, which limits theirpotential in modeling alignment between speech and text. The other is anend-to-end approach that relies on speech instruction data, which is verydifficult to collect in large quantities. In this paper, we address theseissues and propose the BLSP approach that Bootstraps Language-SpeechPre-training via behavior alignment of continuation writing. We achieve this bylearning a lightweight modality adapter between a frozen speech encoder and anLLM, ensuring that the LLM exhibits the same generation behavior regardless ofthe modality of input: a speech segment or its transcript. The training processcan be divided into two steps. The first step prompts an LLM to generate textswith speech transcripts as prefixes, obtaining text continuations. In thesecond step, these continuations are used as supervised signals to train themodality adapter in an end-to-end manner. We demonstrate that thisstraightforward process can extend the capabilities of LLMs to speech, enablingspeech recognition, speech translation, spoken language understanding, andspeech conversation, even in zero-shot cross-lingual scenarios.
MoreTranslated text
Key words
Language Modeling,Topic Modeling,Statistical Language Modeling,Part-of-Speech Tagging,Word Representation
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined