A Comprehensive Evaluation of Quantization Strategies for Large Language Models
Findings of the Association for Computational Linguistics ACL 2024(2024)
Abstract
Increasing the number of parameters in large language models (LLMs) usuallyimproves performance in downstream tasks but raises compute and memory costs,making deployment difficult in resource-limited settings. Quantizationtechniques, which reduce the bits needed for model weights or activations withminimal performance loss, have become popular due to the rise of LLMs. However,most quantization studies use pre-trained LLMs, and the impact of quantizationon instruction-tuned LLMs and the relationship between perplexity and benchmarkperformance of quantized LLMs are not well understood. Evaluation of quantizedLLMs is often limited to language modeling and a few classification tasks,leaving their performance on other benchmarks unclear. To address these gaps,we propose a structured evaluation framework consisting of three criticaldimensions: (1) knowledge & capacity, (2) alignment, and (3) efficiency, andconduct extensive experiments across ten diverse benchmarks. Our experimentalresults indicate that LLMs with 4-bit quantization can retain performancecomparable to their non-quantized counterparts, and perplexity can serve as aproxy metric for quantized LLMs on most benchmarks. Furthermore, quantized LLMswith larger parameter scales can outperform smaller LLMs. Despite the memorysavings achieved through quantization, it can also slow down the inferencespeed of LLMs. Consequently, substantial engineering efforts and hardwaresupport are imperative to achieve a balanced optimization of decoding speed andmemory consumption in the context of quantized LLMs.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined