EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge
CoRR(2024)
摘要
Despite the remarkable strides of Large Language Models (LLMs) in various
fields, the wide applications of LLMs on edge devices are limited due to their
massive parameters and computations. To address this, quantization is commonly
adopted to generate lightweight LLMs with efficient computations and fast
inference. However, Post-Training Quantization (PTQ) methods dramatically
degrade in quality when quantizing weights, activations, and KV cache together
to below 8 bits. Besides, many Quantization-Aware Training (QAT) works quantize
model weights, leaving the activations untouched, which do not fully exploit
the potential of quantization for inference acceleration on the edge. In this
paper, we propose EdgeQAT, the Entropy and Distribution Guided QAT for the
optimization of lightweight LLMs to achieve inference acceleration on Edge
devices. We first identify that the performance drop of quantization primarily
stems from the information distortion in quantized attention maps, demonstrated
by the different distributions in quantized query and key of the self-attention
mechanism. Then, the entropy and distribution guided QAT is proposed to
mitigate the information distortion. Moreover, we design a token
importance-aware adaptive method to dynamically quantize the tokens with
different bit widths for further optimization and acceleration. Our extensive
experiments verify the substantial improvements with our framework across
various datasets. Furthermore, we achieve an on-device speedup of up to 2.37x
compared with its FP16 counterparts across multiple edge devices, signaling a
groundbreaking advancement.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要