SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks
CoRR(2024)
摘要
Large language models (LLMs) have proven to be highly effective across
various natural language processing tasks. However, their large number of
parameters poses significant challenges for practical deployment. Pruning, a
technique aimed at reducing the size and complexity of LLMs, offers a potential
solution by removing redundant components from the network. Despite the promise
of pruning, existing methods often struggle to achieve substantial end-to-end
LLM inference speedup. In this paper, we introduce SLEB, a novel approach
designed to streamline LLMs by eliminating redundant transformer blocks. We
choose the transformer block as the fundamental unit for pruning, because LLMs
exhibit block-level redundancy with high similarity between the outputs of
neighboring blocks. This choice allows us to effectively enhance the processing
speed of LLMs. Our experimental results demonstrate that SLEB successfully
accelerates LLM inference without compromising the linguistic capabilities of
these models, making it a promising technique for optimizing the efficiency of
LLMs. The code is available at: https://github.com/leapingjagg-dev/SLEB
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要