# FracBits: Mixed Precision Quantization via Fractional Bit-Widths

Weibo:

Abstract:

Model quantization helps to reduce model size and latency of deep neural networks. Mixed precision quantization is favorable with customized hardwares supporting arithmetic operations at multiple bit-widths to achieve maximum efficiency. We propose a novel learning-based algorithm to derive mixed precision models end-to-end under target...More

Code:

Data:

Introduction

- Neural network quantization [3,4,12,15,21,22,23,25,31,32] has attracted large amount of attention due to the resource and latency constraints in real applications.
- Recent progress on neural network quantization has shown that the performance of quantized models can be as good as full precision models under moderate target bit-width such as 4 bits [12].
- In order to fully exploit the power of model quantization, mixed precision quantization strategies are proposed to strike a better balance between computation cost and model accuracy.
- With more flexibility to distribute the computation budgets across layers [4,12,25], or even weight kernels [15], the quantized models with mixed precision usually achieve favorable performance than the ones with uniform precision.

Highlights

- Neural network quantization [3,4,12,15,21,22,23,25,31,32] has attracted large amount of attention due to the resource and latency constraints in real applications
- In order to fully exploit the power of model quantization, mixed precision quantization strategies are proposed to strike a better balance between computation cost and model accuracy
- Finding the best configuration for a mixed precision model can be achieved by preserving a single branch for each convolution layer and pruning all other branches, which is conceptually equivalent to some recent neural architecture search (NAS) algorithms that aim at searching sub-networks from a supergraph [2,20,24,26]
- We propose a new formulation named FracBits for mixed precision quantization
- We formulate the bit-width of each layer or kernel with a continuous learnable parameter that can be instantiated by interpolating quantized parameters of two neighboring bit-widths
- Our method facilitates differentiable optimization of layer-wise or kernel-wise bit-width in a single shot of training, which can further be combined with channel pruning by formulating a pruned channel with 0 bit quantization

Methods

- HAQ[23] ReLeQ[4] AutoQ[15] DNAS[25] US[7] DQ[22] FracBits differentiable search one-shot support kernel-wise quantization support channel pruning

Current approaches for mixed precision quantization usually borrow ideas from neural architecture search (NAS) literature. - Finding the best configuration for a mixed precision model can be achieved by preserving a single branch for each convolution layer and pruning all other branches, which is conceptually equivalent to some recent NAS algorithms that aim at searching sub-networks from a supergraph [2,20,24,26].
- ReLeQ [4] and HAQ [23] follow this footprint and employ reinforcement learning to choose layer-wise bit-width configurations for a neural network.
- Uniform Sampling (US) [7] uses uniform sampling to sample subnetworks from the supergraph in training and searches for pruned or quantized models using evolutionary algorithm

Results

- Enhanced by SAT quantization method, FracBits-SAT further improves over SAT baseline and achieves only 0.5% accuracy drop on 3-bit ResNet18 and a 0.2% performance gain on 4-bit ResNet18.
- FracBits-SAT-K outperforms SAT significantly with 1.2% and 0.6% increase on top-1 accuracy on 3 and 4-bit MobileNet V2 respectively, and with 0.4% and 0.7% increase on 3 and 4-bit ResNet18, respectively

Conclusion

- The authors formulate the bit-width of each layer or kernel with a continuous learnable parameter that can be instantiated by interpolating quantized parameters of two neighboring bit-widths.
- The authors' method facilitates differentiable optimization of layer-wise or kernel-wise bit-width in a single shot of training, which can further be combined with channel pruning by formulating a pruned channel with 0 bit quantization.
- With only a regularized term to penalize extra computational resource in the training process, the method is able to discover proper bit-width configurations for different models, outperforming previous mixed precision and uniform precision approaches.
- The authors believe the method will motivate research along low-precision neural networks, and low-cost computational models

Summary

## Introduction:

Neural network quantization [3,4,12,15,21,22,23,25,31,32] has attracted large amount of attention due to the resource and latency constraints in real applications.- Recent progress on neural network quantization has shown that the performance of quantized models can be as good as full precision models under moderate target bit-width such as 4 bits [12].
- In order to fully exploit the power of model quantization, mixed precision quantization strategies are proposed to strike a better balance between computation cost and model accuracy.
- With more flexibility to distribute the computation budgets across layers [4,12,25], or even weight kernels [15], the quantized models with mixed precision usually achieve favorable performance than the ones with uniform precision.
## Methods:

HAQ[23] ReLeQ[4] AutoQ[15] DNAS[25] US[7] DQ[22] FracBits differentiable search one-shot support kernel-wise quantization support channel pruning

Current approaches for mixed precision quantization usually borrow ideas from neural architecture search (NAS) literature.- Finding the best configuration for a mixed precision model can be achieved by preserving a single branch for each convolution layer and pruning all other branches, which is conceptually equivalent to some recent NAS algorithms that aim at searching sub-networks from a supergraph [2,20,24,26].
- ReLeQ [4] and HAQ [23] follow this footprint and employ reinforcement learning to choose layer-wise bit-width configurations for a neural network.
- Uniform Sampling (US) [7] uses uniform sampling to sample subnetworks from the supergraph in training and searches for pruned or quantized models using evolutionary algorithm
## Results:

Enhanced by SAT quantization method, FracBits-SAT further improves over SAT baseline and achieves only 0.5% accuracy drop on 3-bit ResNet18 and a 0.2% performance gain on 4-bit ResNet18.- FracBits-SAT-K outperforms SAT significantly with 1.2% and 0.6% increase on top-1 accuracy on 3 and 4-bit MobileNet V2 respectively, and with 0.4% and 0.7% increase on 3 and 4-bit ResNet18, respectively
## Conclusion:

The authors formulate the bit-width of each layer or kernel with a continuous learnable parameter that can be instantiated by interpolating quantized parameters of two neighboring bit-widths.- The authors' method facilitates differentiable optimization of layer-wise or kernel-wise bit-width in a single shot of training, which can further be combined with channel pruning by formulating a pruned channel with 0 bit quantization.
- With only a regularized term to penalize extra computational resource in the training process, the method is able to discover proper bit-width configurations for different models, outperforming previous mixed precision and uniform precision approaches.
- The authors believe the method will motivate research along low-precision neural networks, and low-cost computational models

- Table1: A comparison of our approach and previous mixed quantization algorithms. Our method FracBits achieves one-shot differentiable search and supports kernel-wise quatization and pruning
- Table2: Comparison of computation cost constrained layer-wise quantization of our method and previous approaches on ImageNet with MobileNet V1/V2. Note that accuracies are in % and bitops are in B (billion)
- Table3: Comparison of computation cost constrained layer-wise quantization of our method and previous approaches on ImageNet with ResNet18. Note bitops of US [<a class="ref-link" id="c7" href="#r7">7</a>] and DNAS [<a class="ref-link" id="c25" href="#r25">25</a>] does not include first and last layer in their papers, and
- Table4: Comparison of model size constrained layer-wise quantization of our method and previous approaches on ImageNet with MobileNet V1/V2. Note that accuracies are in % and sizes are in MB
- Table5: Comparison of computation cost constrained kernel-wise quantization of our method and previous approaches on MobileNet V2 and ResNet18. Note that accuracies are in % and bitops are in B (billion)
- Table6: A comparative study of our method with different configurations and hyper-parameters on MobileNet V1 for compution cost constrained quantization

Related work

- Quantized Neural Networks Previous quantization techniques can be categorized into two types. The first type named post-training quantization directly quantizes weights and activations of a pretrained full-precision model into lower bit [13,18]. This type of methods typically suffer from significant performance degeneration, as the training progress is ignorant of the quantization procedure. Another type of techniques named quantization-aware training is proposed to incorporate quantization into training stage. Early studies in this direction employ a single precision for the whole neural network. For example, DoReFa [32] proposes to transform the unbounded weights into a finite interval to reduce undesired quantization error introduced by infrequent large outliers. PACT [3] investigates the effect of clipping activations from different layers, finding the layer-dependence of the optimal clipping-levels. SAT [12] investigates the gradient scales in training with quantized weights, and further improves model performance by adjusting weight scales. As another direction, some work assigns different bit-widths to different layers or kernels, enabling more flexible computation budget allocation. The first attempts employ reinforcement learning technique with rewards from estimated memory and computational cost by formulas [4] or simulators [23]. AutoQ [15] modifies the training procedure into a hierarchical strategy, resulting in fine-grained kernel-wise quantization. However, these RL strategies needs to sample and train a large number of model variants which is very resource-demanding. DNAS [25] resorts to a differentiable strategy by constructing a supernet with each layer comprised by a linear combination of outputs from different bit-widths. However, due to the discrepancy between the search process and final configuration, it still needs to retrain the discovered model candidates. To further improve the searching efficiency, we propose a one-shot differentiable search method with fractional bit-widths. Due to the smooth transition between fractional bit-width and final integer bit-width, our method embeds the bit-width searching and model finetuning stages in a single pass of model training. Meanwhile, our technique supports kernel-wise quantization with channel pruning in the same framework by assigning 0 bit to the pruned channels, similar to [15] but through a differentiable approach with much reduced searching cost. It is also orthogonal to Uniform Sample (US) [7] for joint quantization and pruning, which trains a supernet by uniform sampling and searches good sub-architectures with evolutionary algorithm. Network Pruning Network pruning is an orthogonal approach to speed up inference of neural networks to quantization. Early work [8] compresses bulky models by learning connection together with weights, which produces unstructured connection in the final network. Later, structured compression by kernel-wise [16] or channel-wise [5,9,14,27] pruning is proposed, where the learned architecture is more friendly with acceleration on modern hardware. As an example, [14] identifies and prunes insignificant channels in each layer by penalizing on the scaling factor of the batch normalization layer. More recently, NAS algorithms are leveraged to guide network pruning. [28] presents a one-shot searching algorithm by greedily slimming a pretrained slimmable neural network [29]. [17] proposes a one-shot resource-aware searching algorithm using FLOPs as a L1 regularization term on the scaling factor of the batch normalization layer. We adopt a similar strategy to use BitOPs and model sizes as L1 regularization which are computed based on the trainable fractional bit-widths in our framework.

Funding

- Enhanced by SAT quantization method, FracBits-SAT further improves over SAT baseline and achieves only 0.5% accuracy drop on 3-bit ResNet18 and a 0.2% performance gain on 4-bit ResNet18
- FracBits-SAT-K outperforms SAT significantly with 1.2% and 0.6% increase on top-1 accuracy on 3 and 4-bit MobileNet V2 respectively, and with 0.4% and 0.7% increase on 3 and 4-bit ResNet18, respectively

Reference

- Bengio, Y., Leonard, N., Courville, A.: Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013)
- Cai, H., Zhu, L., Han, S.: Proxylessnas: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332 (2018)
- Choi, J., Wang, Z., Venkataramani, S., Chuang, P.I.J., Srinivasan, V., Gopalakrishnan, K.: Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085 (2018)
- Elthakeb, A.T., Pilligundla, P., Yazdanbakhsh, A., Kinzer, S., Esmaeilzadeh, H.: Releq: A reinforcement learning approach for deep quantization of neural networks. In: NuerIPS (2018)
- Gordon, A., Eban, E., Nachum, O., Chen, B., Wu, H., Yang, T.J., Choi, E.: Morphnet: Fast & simple resource-constrained structure learning of deep networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1586–1595 (2018)
- Goyal, P., Dollar, P., Girshick, R.B., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR abs/1706.02677 (2017), http://arxiv.org/abs/1706.02677
- Guo, Z., Zhang, X., Mu, H., Heng, W., Liu, Z., Wei, Y., Sun, J.: Single path oneshot neural architecture search with uniform sampling. CoRR abs/1904.00420 (2019), http://arxiv.org/abs/1904.00420
- Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015)
- He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural networks. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1389–1397 (2017)
- Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016)
- Jin, Q., Yang, L., Liao, Z.: Adabits: Neural network quantization with adaptive bit-widths. arXiv preprint arXiv:1912.09666 (2019)
- Jin, Q., Yang, L., Liao, Z.: Towards efficient training for neural network quantization. arXiv preprint arXiv:1912.10207 (2019)
- Krishnamoorthi, R.: Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342 (2018)
- Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C.: Learning efficient convolutional networks through network slimming. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2736–2744 (2017)
- Lou, Q., Liu, L., Kim, M., Jiang, L.: Autoqb: Automl for network quantization and binarization on mobile devices. CoRR abs/1902.05690 (2019), http://arxiv.org/abs/1902.05690
- Luo, J.H., Wu, J., Lin, W.: Thinet: A filter level pruning method for deep neural network compression. In: Proceedings of the IEEE international conference on computer vision. pp. 5058–5066 (2017)
- Mei, J., Li, Y., Lian, X., Jin, X., Yang, L., Yuille, A., Yang, J.: Atomnas: Finegrained end-to-end neural architecture search. arXiv preprint arXiv:1912.09640 (2019)
- Nagel, M., Baalen, M.v., Blankevoort, T., Welling, M.: Data-free quantization through weight equalization and bias correction. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1325–1334 (2019)
- Nikolic, M., Hacene, G.B., Bannon, C., Lascorz, A.D., Courbariaux, M., Bengio, Y., Gripon, V., Moshovos, A.: Bitpruning: Learning bitlengths for aggressive and accurate quantization. arXiv preprint arXiv:2002.03090 (2020)
- Pham, H., Guan, M.Y., Zoph, B., Le, Q.V., Dean, J.: Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268 (2018)
- Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: Xnor-net: Imagenet classification using binary convolutional neural networks. In: European conference on computer vision. pp. 525–542. Springer (2016)
- Uhlich, S., Mauch, L., Yoshiyama, K., Cardinaux, F., Garcia, J.A., Tiedemann, S., Kemp, T., Nakamura, A.: Mixed precision dnns: All you need is a good parametrization. ICLR (2020)
- Wang, K., Liu, Z., Lin, Y., Lin, J., Han, S.: Haq: Hardware-aware automated quantization with mixed precision. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8612–8620 (2019)
- Wu, B., Dai, X., Zhang, P., Wang, Y., Sun, F., Wu, Y., Tian, Y., Vajda, P., Jia, Y., Keutzer, K.: Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 10734–10742 (2019)
- Wu, B., Wang, Y., Zhang, P., Tian, Y., Vajda, P., Keutzer, K.: Mixed precision quantization of convnets via differentiable neural architecture search. arXiv preprint arXiv:1812.00090 (2018)
- Xie, S., Zheng, H., Liu, C., Lin, L.: Snas: stochastic neural architecture search. arXiv preprint arXiv:1812.09926 (2018)
- Ye, J., Lu, X., Lin, Z., Wang, J.Z.: Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers. arXiv preprint arXiv:1802.00124 (2018)
- Yu, J., Huang, T.S.: Network slimming by slimmable networks: Towards one-shot architecture search for channel numbers. CoRR abs/1903.11728 (2019), http://arxiv.org/abs/1903.11728
- Yu, J., Yang, L., Xu, N., Yang, J., Huang, T.: Slimmable neural networks. arXiv preprint arXiv:1812.08928 (2018)
- Zhang, D., Yang, J., Ye, D., Hua, G.: Lq-nets: Learned quantization for highly accurate and compact deep neural networks. In: Proceedings of the European conference on computer vision (ECCV). pp. 365–382 (2018)
- Zhou, S.C., Wang, Y.Z., Wen, H., He, Q.Y., Zou, Y.H.: Balanced quantization: An effective and efficient approach to quantized neural networks. Journal of Computer Science and Technology 32(4), 667–682 (2017)
- Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y.: Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 (2016)

Tags

Comments