AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
On the more challenging ResNet50, our per-sample quantizer and block Householder quantizer with an 8-bit gradient have indistinguishable results compared with quantization-aware training, while per-tensor quantizer suffers from ∼ 1% accuracy degradation

A Statistical Framework for Low-bitwidth Training of Deep Neural Networks

NIPS 2020, (2020)

Cited by: 0|Views70
EI
Full Text
Bibtex
Weibo

Abstract

Fully quantized training (FQT), which uses low-bitwidth hardware by quantizing the activations, weights, and gradients of a neural network model, is a promising approach to accelerate the training of deep neural networks. One major challenge with FQT is the lack of theoretical understanding, in particular of how gradient quantization im...More
0
Introduction
  • Deep neural networks (DNNs) have a high computational cost and memory footprint that slow down their training and inference.
  • By taking advantage of low-bitwidth computational units in hardware, neural network quantization methods provide promising approaches for reducing the cost of timing, memory, and energy consumption, for both training and inference.
  • QAT computes the gradients in full precision, so the training phase is not accelerated
Highlights
  • Deep neural networks (DNNs) have a high computational cost and memory footprint that slow down their training and inference
  • We present a framework for Fully quantized training (FQT) and use the framework to show that the FQT gradient is an unbiased estimator of the quantization-aware training (QAT) gradient
  • The other L sources of randomness are due to the stochastic quantizers Qb(·) per each layer, as illustrated in Fig. 2(a). Both QAT and FQT can be viewed as stochastic optimization algorithms to solve the learning problem (2) approximately
  • On the more challenging ResNet50, our per-sample quantizer (PSQ) and block Householder quantizer (BHQ) with an 8-bit gradient have indistinguishable results compared with QAT, while per-tensor quantizer (PTQ) suffers from ∼ 1% accuracy degradation
  • We compare our result with existing 8-bit training works in Table 2 in an end-to-end fashion. These results demonstrate that BHQ establishes a new state-of-the-art on this benchmark task
  • We provide theoretical bounds to guide practice, and we show how to use these theoretical results to lead to improved performance in practice
  • We present a framework for FQT algorithms
Methods
  • FP8 [24] HBFP8_16 [26].
  • HFP8 [25] WAGEUBN [23] Unified INT8 [22] BHQ Val. acc.
  • Variance 10−1 10−2 10−3.
  • (a) Gradient variance Exact 34.55 – QAT 34.47 –
Results
  • The authors view the FQT gradient ∇ˆ Θ as a stochastic estimator of the QAT gradient ∇Θ. The FQT gradient ∇ˆ Θ has L + 1 sources of randomness.
  • The other L sources of randomness are due to the stochastic quantizers Qb(·) per each layer, as illustrated in Fig. 2(a)
  • Both QAT and FQT can be viewed as stochastic optimization algorithms to solve the learning problem (2) approximately.
  • On ResNet18, the BHQ with a 5-bit gradient achieves ≤ 0.4% validation accuracy degradation, comparable with the baseline BTQ with a 7-bit gradient.
  • On the more challenging ResNet50, the PSQ and BHQ with an 8-bit gradient have indistinguishable results compared with QAT, while PTQ suffers from ∼ 1% accuracy degradation.
  • These results demonstrate that BHQ establishes a new state-of-the-art on this benchmark task
Conclusion
  • The authors present a framework for FQT algorithms.
  • The authors' framework assumes deterministic forward propagation and unbiased stochastic gradient quantizers.
  • The authors formulate the FQT gradient as a stochastic estimator of the QAT gradient, and the authors derive its bias and variance, which impacts the convergence behavior of the training Setting Exact QAT.
  • 8-bit FQT 7-bit FQT 6-bit FQT 5-bit FQT 4-bit FQT PTQ ResNet18 PSQ BHQ diverge ResNet50.
  • Table 2: 8-bit training results for ResNet50
Tables
  • Table1: ResNet18/50 validation accuracy (training loss) on ImageNet
  • Table2: Table 2
  • Table3: Table of Notations
Download tables as Excel
Funding
  • We provide theoretical bounds to guide practice, and we show how to use these theoretical results to lead to improved performance in practice
Study subjects and analysis
data: 128
The actual time cost is highly platform-specific, and a complete hardware-algorithm co-design is out of the scope of this paper, which mostly focuses on the theoretical properties of gradient quantization. As a representative example, we investigate the quantization overhead for a (N = 128, C = 64, H = W = 56) convolutional layer in INT8 on a single Intel CPU core, using a CPU version of TensorFlow [38] compiled with AVX support. In this case, the actual convolution takes 480ms

Reference
  • Eli Kravchik, Fan Yang, Pavel Kisilev, and Yoni Choukroun. Low-bit quantization of neural networks for efficient inference. In The IEEE International Conference on Computer Vision (ICCV) Workshops, Oct 2019.
    Google ScholarLocate open access versionFindings
  • Markus Nagel, Mart van Baalen, Tijmen Blankevoort, and Max Welling. Data-free quantization through weight equalization and bias correction. ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Ritchie Zhao, Yuwei Hu, Jordan Dotzel, Chris De Sa, and Zhiru Zhang. Improving neural network quantization without retraining using outlier channel splitting. In Proceedings of the 36th International Conference on Machine Learning, pages 7543–7552, 2019.
    Google ScholarLocate open access versionFindings
  • Ron Banner, Yury Nahshan, Elad Hoffer, and Daniel Soudry. Post training 4-bit quantization of convolution networks for rapid-deployment. CoRR, abs/1810.05723, 1(2), 2018.
    Findings
  • Yaohui Cai, Zhewei Yao, Zhen Dong, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Zeroq: A novel zero shot quantization framework. arXiv preprint arXiv:2001.00281, 2020.
    Findings
  • Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
    Findings
  • Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542.
    Google ScholarLocate open access versionFindings
  • Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085, 2018.
    Findings
  • Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua. LQ-Nets: Learned quantization for highly accurate and compact deep neural networks. In The European Conference on Computer Vision (ECCV), September 2018.
    Google ScholarLocate open access versionFindings
  • Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quantization: Towards lossless cnns with low-precision weights. International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integerarithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2704–2713, 2018.
    Google ScholarLocate open access versionFindings
  • Zhen Dong, Zhewei Yao, Amir Gholami, Michael Mahoney, and Kurt Keutzer. Hawq: Hessian aware quantization of neural networks with mixed-precision. ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Zhen Dong, Zhewei Yao, Yaohui Cai, Daiyaan Arfeen, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Hawq-v2: Hessian aware trace-weighted quantization of neural networks. arXiv preprint arXiv:1911.03852, 2019.
    Findings
  • Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Q-bert: Hessian based ultra low precision quantization of bert. arXiv preprint arXiv:1909.05840, 2019.
    Findings
  • https://www.nvidia.com/en-us/data-center/a100/.
    Findings
  • Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine LEarning, 2019.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
    Findings
  • Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In International Conference on Machine Learning, pages 1737–1746, 2015.
    Google ScholarLocate open access versionFindings
  • Ron Banner, Itay Hubara, Elad Hoffer, and Daniel Soudry. Scalable methods for 8-bit training of neural networks. In Advances in Neural Information Processing Systems, pages 5145–5153, 2018.
    Google ScholarLocate open access versionFindings
  • Shuang Wu, Guoqi Li, Feng Chen, and Luping Shi. Training and inference with integers in deep neural networks. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Feng Zhu, Ruihao Gong, Fengwei Yu, Xianglong Liu, Yanfei Wang, Zhelong Li, Xiuqi Yang, and Junjie Yan. Towards unified int8 training for convolutional neural network. In Conference on Computer Vision and Pattern Recognition, 2020.
    Google ScholarLocate open access versionFindings
  • Yukuan Yang, Lei Deng, Shuang Wu, Tianyi Yan, Yuan Xie, and Guoqi Li. Training high-performance and large-scale deep neural networks with full 8-bit integers. Neural Networks, 2020.
    Google ScholarLocate open access versionFindings
  • Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and Kailash Gopalakrishnan. Training deep neural networks with 8-bit floating point numbers. In Advances in Neural Information Processing Systems, pages 7675–7684, 2018.
    Google ScholarLocate open access versionFindings
  • Xiao Sun, Jungwook Choi, Chia-Yu Chen, Naigang Wang, Swagath Venkataramani, Vijayalakshmi Viji Srinivasan, Xiaodong Cui, Wei Zhang, and Kailash Gopalakrishnan. Hybrid 8-bit floating point (hfp8) training and inference for deep neural networks. In Advances in Neural Information Processing Systems, pages 4901–4910, 2019.
    Google ScholarLocate open access versionFindings
  • Mario Drumond, LIN Tao, Martin Jaggi, and Babak Falsafi. Training dnns with hybrid block floating point. In Advances in Neural Information Processing Systems, pages 453–463, 2018.
    Google ScholarLocate open access versionFindings
  • Léopold Cambier, Anahita Bhiwandiwalla, Ting Gong, Mehran Nekuii, Oguz H Elibol, and Hanlin Tang. Shifted and squeezed 8-bit floating point format for low-precision training of deep neural networks. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • Charbel Sakr and Naresh R Shanbhag. Per-tensor fixed-point quantization of the back-propagation algorithm. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
    Google ScholarFindings
  • https://github.com/google/gemmlowp.
    Findings
  • https://github.com/pytorch/fbgemm.
    Findings
  • Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
    Findings
  • Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pages 3123–3131, 2015.
    Google ScholarLocate open access versionFindings
  • Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186.
    Google ScholarLocate open access versionFindings
  • Harold Kushner and G George Yin. Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media, 2003.
    Google ScholarFindings
  • Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
    Google ScholarLocate open access versionFindings
  • Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16), pages 265–283, 2016.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645.
    Google ScholarLocate open access versionFindings
  • Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
    Google ScholarFindings
  • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255.
    Google ScholarLocate open access versionFindings
  • Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 2019.
    Google ScholarLocate open access versionFindings
  • Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. 2020.
    Google ScholarFindings
  • https://github.com/nvidia/deeplearningexamples/tree/master/pytorch/classification/convnets/resnet50v1.5.
    Findings
  • 1. Sort the magnitude Mi:=
    Google ScholarFindings
  • 2. Loop over the number of groups G. Assume that {Mi} is already sorted, we consider the first G rows as “large” and all the other N − G rows as “small”. The i-th group contains the i-th largest row and a number of small rows. Furthermore, we heuristically set the size of the i-th group to (N − G)
    Google ScholarFindings
  • 3. Use the grouping of rows described in Step 2 to construct the block Householder quantizer.
    Google ScholarFindings
Author
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科