Output-Directed Dynamic Quantization for DNN Acceleration

PROCEEDINGS OF THE 52ND INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2023(2023)

引用 0|浏览11
暂无评分
摘要
Quantization is an effective technique for reducing the number of computations and improving the performance of deep neural networks (DNNs). Weight quantization is popular because weights can be trained beforehand. However, weight quantization only targets the kernel weights and ignores the sensitivity of input features, which can lead to reduced accuracy. Fine-grained input quantization has gained attention as a way to speed up DNNs while maintaining accuracy. Existing approaches determine computation precision based on input sensitivity but do not effectively reduce computations for insensitive outputs or retain the precision of sensitive outputs. These limitations motivate us to develop an output-directed dynamic quantization method named ODQ in this paper. ODQ is a two-stage DNN quantization scheme designed to improve performance, reduce energy consumption, and maintain and often improve accuracy, compared with existing quantization methods. Specifically, inputs and weights go through sensitivity prediction and result generation. The high-order 2 bits of input and weight are used to predict output sensitivity. Result generation is performed only for predicted sensitive outputs. We designed an FPGA accelerator to optimize ODQ quantization performance for DNNs. We implement a prototype of ODQ and evaluate its performance using several state-of-the-art DNNs. Compared with a state-of-the-art input-directed quantization approach, ODQ achieves a 67.6% performance speedup and a 66.9% energy saving, with minimal accuracy degradation (<= 0.6%).
更多
查看译文
关键词
Deep Neural Network,Dynamic Quantization,Sensitivity Prediction,Performance Acceleration,FPGA
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要