Align-to-Distill: Trainable Attention Alignment for Knowledge Distillation in Neural Machine Translation
arxiv(2024)
摘要
The advent of scalable deep models and large datasets has improved the
performance of Neural Machine Translation. Knowledge Distillation (KD) enhances
efficiency by transferring knowledge from a teacher model to a more compact
student model. However, KD approaches to Transformer architecture often rely on
heuristics, particularly when deciding which teacher layers to distill from. In
this paper, we introduce the 'Align-to-Distill' (A2D) strategy, designed to
address the feature mapping problem by adaptively aligning student attention
heads with their teacher counterparts during training. The Attention Alignment
Module in A2D performs a dense head-by-head comparison between student and
teacher attention heads across layers, turning the combinatorial mapping
heuristics into a learning problem. Our experiments show the efficacy of A2D,
demonstrating gains of up to +3.61 and +0.63 BLEU points for WMT-2022 De->Dsb
and WMT-2014 En->De, respectively, compared to Transformer baselines.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要