Cross-Architecture Transfer Learning for Linear-Cost Inference Transformers
CoRR(2024)
摘要
Recently, multiple architectures has been proposed to improve the efficiency
of the Transformer Language Models through changing the design of the
self-attention block to have a linear-cost inference (LCI). A notable approach
in this realm is the State-Space Machines (SSMs) architecture, which showed
on-par performance on language modeling tasks with the self-attention
transformers. However, such an architectural change requires a full pretraining
of the weights from scratch, which incurs a huge cost to researchers and
practitioners who want to use the new architectures. In the more traditional
linear attention works, it has been proposed to approximate full attention with
linear attention by swap-and-finetune framework. Motivated by this approach, we
propose Cross-Architecture Transfer Learning (XATL), in which the weights of
the shared components between LCI and self-attention-based transformers, such
as layernorms, MLPs, input/output embeddings, are directly transferred to the
new architecture from already pre-trained model parameters. We experimented the
efficacy of the method on varying sizes and alternative attention architectures
and show that significantly reduces the training time up to 2.5x
times and converges to a better minimum with up to 2.6
LM benchmarks within the same compute budget.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要