DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training
arxiv(2023)
摘要
FlashAttention (Dao, 2023) effectively reduces the quadratic peak memory
usage to linear in training transformer-based large language models (LLMs) on a
single GPU. In this paper, we introduce DISTFLASHATTN, a distributed
memory-efficient attention mechanism optimized for long-context LLMs training.
We propose three key techniques: token-level workload balancing, overlapping
key-value communication, and a rematerialization-aware gradient checkpointing
algorithm. We evaluate DISTFLASHATTN on Llama-7B and variants with sequence
lengths from 32K to 512K. DISTFLASHATTN achieves 8x longer sequences, 4.45 -
5.64x speedup compared to Ring Self-Attention, 2 - 8x longer sequences, 1.24 -
2.01x speedup compared to Megatron-LM with FlashAttention. It achieves 1.67x
and 1.26 - 1.88x speedup compared to recent Ring Attention and
DeepSpeed-Ulysses. Code is available at https://github.com/RulinShao/LightSeq.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要