AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
To reduce communication costs of data-parallel Stochastic gradient descent, we introduce two adaptively quantized methods, Adaptive Level Quantization and Adaptive Multiplier Quantization, to learn and adapt gradient quantization method on the fly

Adaptive Gradient Quantization for Data-Parallel SGD

NIPS 2020, (2020)

Cited by: 0|Views53
EI
Full Text
Bibtex
Weibo

Abstract

Many communication-efficient variants of SGD use gradient quantization schemes. These schemes are often heuristic and fixed over the course of training. We empirically observe that the statistics of gradients of deep models change during the training. Motivated by this observation, we introduce two adaptive quantization schemes, ALQ and...More
0
Introduction
  • Many communication-efficient variants of SGD use gradient quantization schemes. These schemes are often heuristic and fixed over the course of training.
  • The authors empirically observe that the statistics of gradients of deep models change during the training
  • Motivated by this observation, the authors introduce two adaptive quantization schemes, ALQ and AMQ.
  • The authors introduce two adaptive quantization schemes, ALQ and AMQ
  • In both schemes, processors update their compression schemes in parallel by efficiently computing sufficient statistics of a parametric distribution.
  • 1e4 want distributed optimization methods that match the per- Figure 1: Changes in the average variance of formance of SGD on a single hypothetical super machine, normalized gradient coordinates in a ResNetwhile paying a negligible communication cost
Highlights
  • Many communication-efficient variants of Stochastic gradient descent (SGD) use gradient quantization schemes
  • We propose two adaptive methods for quantizing the gradients in data-parallel SGD
  • We improve the validation accuracy by almost 2% on CIFAR-10 and 1% on ImageNet in challenging low-cost communication setups
  • We provide theoretical guarantees for adaptively quantized SGD (AQSGD) algorithm, obtain variance and code-length bounds, and convergence guarantees for convex, nonconvex, and momentum-based variants of AQSGD
  • To reduce communication costs of data-parallel SGD, we introduce two adaptively quantized methods, Adaptive Level Quantization (ALQ) and Adaptive Multiplier Quantization (AMQ), to learn and adapt gradient quantization method on the fly
  • We demonstrate the superiority of ALQ and AMQ over nonadaptive methods empirically on deep models and large datasets
Methods
  • ResNet-110 on CIFAR-10

    ResNet-32 on CIFAR-10

    ResNet-18 on ImageNet Bucket Size SuperSGD

    NUQSGD [21, 22] QSGDinf [20] TRN [15]

    ALQ ALQ-N AMQ AMQ-N Loss Loss Loss

    SuperSGD ALQ AMQ Qinf TRN

    Training Iteration (a) ResNet-32 on CIFAR-10

    (b) ResNet-110 on CIFAR-10.
  • ResNet-18 on ImageNet Bucket Size SuperSGD.
  • For bucket size 100 and 3 bits, NUQSGD performs nearly as good as adaptive methods but quickly loses accuracy as the bucket size grows or shrinks.
  • QSGDinf stays competitive for a wider range of bucket sizes but still loses accuracy faster than other methods.
  • This shows the impact of bucketing as an understudied trick in evaluating quantization methods
Results
  • The authors improve the validation accuracy by almost 2% on CIFAR-10 and 1% on ImageNet in challenging low-cost communication setups.
  • ALQ, achieves the best overall performance on ImageNet and the gap on CIFAR-10 with ALQ-N is less than 0.3%
Conclusion
  • To reduce communication costs of data-parallel SGD, the authors introduce two adaptively quantized methods, ALQ and AMQ, to learn and adapt gradient quantization method on the fly.
  • In addition to quantization method, in both methods, processors learn and adapt their coding methods in parallel by efficiently computing sufficient statistics of a parametric distribution.
  • The authors establish a number of convergence guarantees for the adaptive methods.
Tables
  • Table1: Validation accuracy on CIFAR-10 and ImageNet using 3 bits (except for SuperSGD and TRN) with 4 GPUs
  • Table2: Validation accuracy of ResNet32 on CIFAR-10 using 3 quantization bits (except for SuperSGD and TRN) and bucket size 16384
  • Table3: Training Hyper-parameters for CIFAR-10 and ImageNet
  • Table4: Validation Accuracy on Full ImageNet Run
  • Table5: Training ResNet50 on ImageNet with min-batch size 512. Time per step for training with 32bits full-precision is 1.2s and with 16 bits full-precision is 0.61s
  • Table6: Training ResNet18 on ImageNet with min-batch size 512. Time per step for training with 32bits full-precision is 0.57s and with 16 bits full-precision is 0.28s
  • Table7: Additional overhead of proposed methods for training ResNet18 on ImageNet (Table 6). We also show the cost of performing 3 updates relative to the total cost of training for 60 epochs. Full-precision training for 60 epochs with 32 bits takes 95 hours while with 16 bits takes 46 hours
Download tables as Excel
Related work
  • Adaptive quantization has been used for speech communication and storage [18]. In machine learning, several biased and unbiased schemes have been proposed to compress networks and gradients. Recently, lattice-based quantization has been studied for distributed mean estimation and variance reduction [19]. In this work, we focus on unbiased and coordinate-wise schemes to compress gradients.

    Alistarh et al [20] proposed Quantized SGD (QSGD) focusing on the uniform quantization of stochastic gradients normalized to have unit Euclidean norm. Their experiments illustrate a similar quantization method, where gradients are normalized to have unit L∞ norm, achieves better performance. We refer to this method as QSGDinf or Qinf in short. Wen et al [15] proposed TernGrad, which can be viewed as a special case of QSGDinf with three quantization levels.
Funding
  • FF was supported by OGS Scholarship
  • DA and IM were supported the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 805223 ScaleML)
  • DMR was supported by an NSERC Discovery Grant
  • ARK was supported by NSERC Postdoctoral Fellowship
  • Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute.5
Reference
  • M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In Proc. Advances in Neural Information Processing Systems (NIPS), 2010.
    Google ScholarLocate open access versionFindings
  • R. Bekkerman, M. Bilenko, and J. Langford. Scaling up machine learning: Parallel and distributed approaches. Cambridge University Press, 2011.
    Google ScholarFindings
  • B. Recht, C. Ré, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Proc. Advances in Neural Information Processing Systems (NIPS), 2011.
    Google ScholarLocate open access versionFindings
  • J. Dean, G. Corrado, R. Monga K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng. Large scale distributed deep networks. In Proc. Advances in Neural Information Processing Systems (NIPS), 2012.
    Google ScholarLocate open access versionFindings
  • A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and A. Ng. Deep learning with COTS HPC systems. In Proc. International Conference on Machine Learning (ICML), 2013.
    Google ScholarLocate open access versionFindings
  • T. Chilimbi, Y. Suzue J. Apacible, and K. Kalyanaraman. Project adam: Building an efficient and scalable deep learning training system. In Proc. USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2014.
    Google ScholarLocate open access versionFindings
  • M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In Proc. USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2014.
    Google ScholarLocate open access versionFindings
  • J. C. Duchi, S. Chaturapruek, and C. Ré. Asynchronous stochastic convex optimization. In Proc. Advances in Neural Information Processing Systems (NIPS), 2015.
    Google ScholarLocate open access versionFindings
  • E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Y. Petuum. Petuum: A new platform for distributed machine learning on big data. IEEE transactions on Big Data, 1(2):49–67, 2015.
    Google ScholarLocate open access versionFindings
  • S. Zhang, A. E. Choromanska, and Y. LeCun. Deep learning with elastic averaging SGD. In Proc. Advances in Neural Information Processing Systems (NIPS), 2015.
    Google ScholarLocate open access versionFindings
  • F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In Proc. INTERSPEECH, 2014.
    Google ScholarLocate open access versionFindings
  • S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical precision. In Proc. International Conference on Machine Learning (ICML), 2015.
    Google ScholarLocate open access versionFindings
  • M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, and M. Devin. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467, 2016.
    Findings
  • S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou. DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160, 2016.
    Findings
  • W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. TernGrad: Ternary gradients to reduce communication in distributed deep learning. In Proc. Advances in Neural Information Processing Systems (NIPS), 2017.
    Google ScholarLocate open access versionFindings
  • J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar. signSGD: Compressed optimisation for non-convex problems. In Proc. International Conference on Machine Learning (ICML), 2018.
    Google ScholarLocate open access versionFindings
  • S. Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends ® in Machine Learning, 8(3-4):231–358, 2015.
    Google ScholarLocate open access versionFindings
  • P. Cummiskey, N. S. Jayant, and J. L. Flanagan. Adaptive quantization in differential PCM coding of speech. Bell System Technical Journal, 52(7):1105–1118, 1973.
    Google ScholarLocate open access versionFindings
  • D. Alistarh, S. Ashkboos, and P. Davies. Distributed mean estimation with optimal error bounds. arXiv:2002.09268v2, 2020.
    Findings
  • D. Alistarh, D. Grubic, J. Z. Li, R. Tomioka, and M. Vojnovic. QSGD: Communicationefficient SGD via gradient quantization and encoding. In Proc. Advances in Neural Information Processing Systems (NIPS), 2017.
    Google ScholarLocate open access versionFindings
  • A. Ramezani-Kebrya, F. Faghri, and D. M. Roy. NUQSGD: Improved communication efficiency for data-parallel SGD via nonuniform quantization. arXiv preprint arXiv:1908.06077v1, 2019.
    Findings
  • S. Horváth, C.-Y Ho, L. Horváth, A. N. Sahu, M. Canini, and P. Richtárik. Natural compression for distributed deep learning. arXiv:1905.10988v1, 2019.
    Findings
  • H. Zhang, J. Li, K. Kara, D. Alistarh, J. Liu, and C. Zhang. ZipML: Training linear models with end-to-end low precision, and a little bit of deep learning. In Proc. International Conference on Machine Learning (ICML), 2017.
    Google ScholarLocate open access versionFindings
  • D. Zhang, J. Yang, D. Ye, and G. Hua. LQ-Nets: Learned quantization for highly accurate and compact deep neural networks. In Proc. European Conference on Computer Vision (ECCV), 2018.
    Google ScholarLocate open access versionFindings
  • F. Fu, Y. Hu, Y. He, J. Jiang, Y. Shao, C. Zhang, and B. Cui. Don’t waste your bits! squeeze activations and gradients for deep neural networks via TINYSCRIPT. In Proc. International Conference on Machine Learning (ICML), 2020.
    Google ScholarLocate open access versionFindings
  • M. H. Protter and C. B. Morrey. Intermediate Calculus. Springer, 1985.
    Google ScholarFindings
  • [28] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
    Google ScholarLocate open access versionFindings
  • [29] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
    Google ScholarFindings
  • [30] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
    Google ScholarLocate open access versionFindings
  • [31] Dami Choi, Christopher J. Shallue, Zachary Nado, Jaehoon Lee, Chris J. Maddison, and George E. Dahl. On Empirical Comparisons of Optimizers for Deep Learning. arXiv e-prints, art. arXiv:1910.05446, October 2019.
    Findings
  • [32] T. M. Cover and J. A. Thomas. Elements of Information Theory. WILEY, 2006.
    Google ScholarLocate open access versionFindings
  • [33] S. Ghadimi and G. Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
    Google ScholarLocate open access versionFindings
  • [34] T. Yang, Q. Lin, and Z. Li. Unified convergence analysis of stochastic momentum methods for convex and non-convex optimization. arXiv:1604.03257v2, 2016.
    Findings
  • [35] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
    Google ScholarLocate open access versionFindings
  • [36] Y. Nesterov. A method of solving a convex programming problem with convergence O(1/k2). Soviet Mathematics Doklady, 27(2):372–376, 1983.
    Google ScholarLocate open access versionFindings
Author
Fartash Faghri
Fartash Faghri
Iman Tabrizian
Iman Tabrizian
Ilia Markov
Ilia Markov
Daniel Roy
Daniel Roy
Ali Ramezani-Kebrya
Ali Ramezani-Kebrya
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科