# A 127.8TOPS/W Arbitrarily Quantized 1-to-8b Scalable-Precision Accelerator for General-Purpose Deep Learning with Reduction of Storage, Logic and Latency Waste.

ISSCC（2023）

Abstract

Research on deep learning accelerators has focused on inference tasks to improve performance by means of maximally utilizing sparsity and quantization. Unlike CNN-only networks, however, recent state-of-the-art (SOTA) models consist of multiple blocks of various layers with different layer-by-layer characteristics in sparsity and required precision. This trend presents challenges in building a general accelerator architecture to maximize the benefits from sparsity and quantization, while supporting efficient processing for various models ranging from traditional CNNs to the new models to come in the future. First, there are multiple considerations that include the bottleneck in data bandwidth, as well as the trade-off between sparsity and required precision. The required precision is likely to increase as the sparsity increases. This underpins the need for flexibility in setting the quantization with a layer-by-layer configuration. In addition, storing data in a unified format can also prohibit the maximum utilization of hardware resources. Since recent models have large variations in sparsity [11], a major portion of data movement might be taken by sending zeros, causing a severe waste in data bandwidth. We propose a sparsity-aware accelerator that adaptively changes the data format by detecting the sparsity of the given task. Data is stored in raw format when the sparse rate is low and in compressed format (run-length coding, RLC) when the sparse rate is high. Second, there is a correlation between the effective precision and the quantization policy. Arbitrary quantization has demonstrated a higher level of quality of result (QoR) compared to linear quantization (denoted as INT). There have been two representative approaches in nonlinear quantization: 1) arbitrary basis (AB) where quantized values are given by linear combinations of $n$ independent bases, and 2) arbitrary quantization (AQ) which has arbitrary 2 quantized values. Though these quantization schemes achieve good accuracy, there has been no hardware implementation for efficient processing of AQ. The conventional INT multiplication increases the complexity by 4x as both input precisions double. On the other hand, if AQ with a scalable precision of up to 8b is implemented using a look-up-table (LUT) approach, it would explode hardware complexity. To resolve this problem, we propose a hierarchical decoding architecture for AQ with a scalable precision up to 8b. Finally, the required precisions for inputs and weights are not the same [4], [10]. Good QoR is realized by assigning more bits to inputs and fewer bits to weights. Previous accelerators handle inputs and weights with a fixed and equal precision leading to the waste of computational energy. This work employs a dynamic-precision bit-serial multiplication for the weights to minimize waste of energy. Putting them all together, we propose a 1-to-8b scalable-precision general-purpose deep learning accelerator to support multiply-and-accumulate (MAC) operations with input and weight vectors quantized by AQ and AB, respectively. The accelerator includes three main features: 1) a zero elimination scheme that works with two data formats, raw and RLC, to save storage cost and to improve effective bandwidth, 2) extended-precision AQ computing hardware without exploding logic complexity, and 3) bit-serial AB processing without unnecessary computations.

MoreTranslated text

Key words

arbitrary quantization,compressed format,conventional INT multiplication,data bandwidth,data format,data movement,dynamic-precision bit-serial multiplication,extended-precision AQ computing hardware,general accelerator architecture,inference tasks,layer-by-layer characteristics,layer-by-layer configuration,linear quantization,look-up-table,multiply-and-accumulate operations,nonlinear quantization,quantization schemes,run-length coding,scalable-precision general-purpose deep learning accelerators,sparsity-aware accelerator,word length 1 bit to 8 bit,zero elimination scheme

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Chat Paper

Summary is being generated by the instructions you defined