Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization
CoRR(2023)
摘要
Large language models (LLMs) have demonstrated impressive abilities in
various domains while the inference cost is expensive. The state-of-the-art
methods use 2-bit quantization for mainstream LLMs. However, challenges still
exist: (1) Nonnegligible accuracy loss for 2-bit quantization. Weights are
quantized by groups, while the ranges of weights are large in some groups,
resulting in large quantization errors and nonnegligible accuracy loss (e.g.
>3% for Llama2-7b with 2-bit quantization in GPTQ and Greenbit). (2) Limited
accuracy improvement by adding 4-bit weights. Increasing 10% extra average bit
more 4-bit weights only leads to <0.5% accuracy improvement on a quantized
Llama2-7b. (3) Time-consuming dequantization operations on GPUs. The
dequantization operations lead to >50% execution time, hindering the potential
of reducing LLM inference cost. To tackle these challenges, we propose the
following techniques: (1) We only quantize a small fraction of groups with the
larger range using 4-bit with memory alignment consideration on GPUs. (2) We
point out that the distribution of the sparse outliers with larger weights is
different in 2-bit and 4-bit groups, and only a small fraction of outliers
require 16-bit quantization. Such design leads to >0.5% accuracy improvement
with <3% average increased bit for Llama2-7b. (3) We design the asynchronous
dequantization on GPUs, leading to up to 3.92X speedup. We conduct extensive
experiments on different model families and model sizes. We achieve 2.85-bit
for each weight and the end-to-end speedup for Llama2-7b is 1.74X over the
original model, and we reduce both runtime cost and hardware cost by up to
2.70X and 2.81X with less GPU requirements.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要