DistiLLM: Towards Streamlined Distillation for Large Language Models
CoRR(2024)
摘要
Knowledge distillation (KD) is widely used for compressing a teacher model to
a smaller student model, reducing its inference cost and memory footprint while
preserving model capabilities. However, current KD methods for auto-regressive
sequence models (e.g., large language models) suffer from missing a
standardized objective function. Moreover, the recent use of student-generated
outputs to address training-inference mismatches has significantly escalated
computational costs. To tackle these issues, we introduce DistiLLM, a more
effective and efficient KD framework for auto-regressive language models.
DistiLLM comprises two components: (1) a novel skew Kullback-Leibler divergence
loss, where we unveil and leverage its theoretical properties, and (2) an
adaptive off-policy approach designed to enhance the efficiency in utilizing
student-generated outputs. Extensive experiments, including
instruction-following tasks, demonstrate the effectiveness of DistiLLM in
building high-performing student models while achieving up to 4.3×
speedup compared to recent KD methods.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要