Double Quantization for Communication-Efficient Distributed Optimization

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), pp. 4440-4451, 2019.

Cited by: 5|Views44
EI
Weibo:
To reduce the communication complexity, we propose double quantization, a general scheme for quantizing both model parameters and gradients

Abstract:

Modern distributed training of machine learning models often suffers from high communication overhead for synchronizing stochastic gradients and model parameters. In this paper, to reduce the communication complexity, we propose double quantization,a general scheme for quantizing both model parameters and gradients. Three communication-ef...More

Code:

Data:

0
Full Text
Bibtex
Weibo
Introduction
  • The data parallel mechanism is a widely used architecture for distributed optimization, which has received much recent attention due to data explosion and increasing model complexity.
  • It decomposes the time consuming gradient computations into sub-tasks, and assigns them to separate worker machines for execution.
Highlights
  • The data parallel mechanism is a widely used architecture for distributed optimization, which has received much recent attention due to data explosion and increasing model complexity
  • The training data is distributed among M workers and each worker maintains a local copy of model parameters
  • We show that AsyLPG achieves the same asymptotic convergence rate as the unquantized serial counterpart, but with a significantly lower communication cost. We combine gradient sparsification with double quantization and propose Sparse-AsyLPG to further reduce communication overhead
  • Our analysis shows that the convergence rate scales with d/φ for a sparsity budget φ. We propose accelerated AsyLPG, and mathematically prove that double quantization can be accelerated by the momentum technique [19, 26]. We conduct experiments on a multi-server distributed test-bed
  • We propose three communication-efficient algorithms for distributed training with asynchronous parallelism
  • The evaluations on logistic regression and neural network based on real-world datasets validate that our algorithms can significantly reduce communication cost
Methods
  • The authors conduct experiments to validate the efficiency of the algorithms.
  • The authors start with the logistic regression problem and evaluate the performance of the algorithms on neural network models.
  • The authors further study the relationship of hyperparameter μ and number of transmitted bits.
  • Loss function's value Time (s).
  • Loss function's value # of bits.
  • Sparse-AsyLPG Acc-AsyLPG AsyFPG QSVRG Acc-AsyFPG.
  • 20 #4o0f epoc6h0 s 80 100 Computation Encoding
Results
  • The evaluations on logistic regression and neural network based on real-world datasets validate that the algorithms can significantly reduce communication cost
Conclusion
  • The authors propose three communication-efficient algorithms for distributed training with asynchronous parallelism.
  • The authors analyze the variance of low-precision gradients and show that the algorithms achieve the same asymptotic convergence rate as the full-precision algorithms, while transmitting much fewer bits per iteration.
  • The authors incorporate gradient sparsification into double quantization, and setup relation between convergence rate and sparsity budget.
  • The evaluations on logistic regression and neural network based on real-world datasets validate that the algorithms can significantly reduce communication cost
Summary
  • Introduction:

    The data parallel mechanism is a widely used architecture for distributed optimization, which has received much recent attention due to data explosion and increasing model complexity.
  • It decomposes the time consuming gradient computations into sub-tasks, and assigns them to separate worker machines for execution.
  • Methods:

    The authors conduct experiments to validate the efficiency of the algorithms.
  • The authors start with the logistic regression problem and evaluate the performance of the algorithms on neural network models.
  • The authors further study the relationship of hyperparameter μ and number of transmitted bits.
  • Loss function's value Time (s).
  • Loss function's value # of bits.
  • Sparse-AsyLPG Acc-AsyLPG AsyFPG QSVRG Acc-AsyFPG.
  • 20 #4o0f epoc6h0 s 80 100 Computation Encoding
  • Results:

    The evaluations on logistic regression and neural network based on real-world datasets validate that the algorithms can significantly reduce communication cost
  • Conclusion:

    The authors propose three communication-efficient algorithms for distributed training with asynchronous parallelism.
  • The authors analyze the variance of low-precision gradients and show that the algorithms achieve the same asymptotic convergence rate as the full-precision algorithms, while transmitting much fewer bits per iteration.
  • The authors incorporate gradient sparsification into double quantization, and setup relation between convergence rate and sparsity budget.
  • The evaluations on logistic regression and neural network based on real-world datasets validate that the algorithms can significantly reduce communication cost
Tables
  • Table1: Evaluation on dataset MNIST. Left: # of transmitted bits until the training loss is first below 0.05. Right
Download tables as Excel
Related work
  • Designing large-scale distributed algorithms for machine learning has been receiving increasing attention, and many algorithms, both synchronous and asynchronous, have been proposed, e.g., [22, 4, 17, 12]. In order to reduce the communication cost, researchers also started to focus on cutting down transmitted bits per iteration, based mainly on two schemes, i.e., quantization and sparsification.

    Quantization. Algorithms based on quantization store a floating-point number using limited number of bits. For example, [25] quantized gradients to a representation of {−1, 1}, and empirically showed the communication-efficiency in training of deep neural networks. [5, 6] considered the bi-direction communications of gradients between master and workers. In their setting, each worker transmitted gradient sign to the master and master aggregated signs by majority vote. [2, 34, 35] adopte√d an unbiased gradient quantization with multiple levels. [13] provided a convergence rate of O(1/ K) for implementing SGD with unbiased gradient quantizer in solving nonconvex objectives, where K is the number of iterations. The error-feedback method was applied in [25, 35, 29] to integrate history quantization error into the current stage. Specifically, [29] compressed transmitted gradients with error-compensation in both directions between master and workers, and showed a linear speedup in the nonconvex case. [15] constructed several examples where simply transmitting gradient sign cannot converge. They combine the error-feedback method to fix the divergence and prove the convergence rate for nonconvex smooth objectives. [40] also studied bi-direction compression with error-feedback. They partitioned gradients into several blocks, which were compressed using different 1-bit quantizers separately. They analyzed the convergence rate when integrating the momentum. [9] proposed a low-precision framework of SVRG [14], which quantized model parameters for single machine computation. [38] proposed an end-to-end low-precision scheme, which quantized data, model and gradient with synchronous parallelism. A biased quantization with gradient clipping was analyzed in [37]. [8] empirically studied asynchronous and low-precision SGD on logistic regression. [28] considered the decentralized training and proposed an extrapolation compression method to obtain a higher compression level. [36] proposed a two-phase parameter quantization method, where the parameter in the first phase was the linear combination of full-precision and low-precision parameters. In the second phase, they set the weight of full-precision value to zero to obtain a full compression.
Funding
  • The work of Yue Yu and Longbo Huang was supported in part by the National Natural Science Foundation of China Grant 61672316
Reference
  • A. F. Aji and K. Heafield. Sparse communication for distributed gradient descent. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017.
    Google ScholarLocate open access versionFindings
  • D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1709–1720, 2017.
    Google ScholarLocate open access versionFindings
  • D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli. The convergence of sparsified gradient methods. In Advances in Neural Information Processing Systems, pages 5977–5987, 2018.
    Google ScholarLocate open access versionFindings
  • R. Bekkerman, M. Bilenko, and J. Langford. Scaling up machine learning: Parallel and distributed approaches. Cambridge University Press, 2011.
    Google ScholarFindings
  • J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar. signSGD: Compressed Optimisation for Non-Convex Problems. In International Conference on Machine Learning (ICML), 2018.
    Google ScholarLocate open access versionFindings
  • J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar. signSGD with Majority Vote is Communication Efficient and Fault Tolerant. In International Conference on Learning Representations (ICLR), 2019.
    Google ScholarLocate open access versionFindings
  • C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):27, 2011.
    Google ScholarLocate open access versionFindings
  • C. De Sa, M. Feldman, C. Ré, and K. Olukotun. Understanding and optimizing asynchronous low-precision stochastic gradient descent. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pages 561–574. ACM, 2017.
    Google ScholarLocate open access versionFindings
  • C. De Sa, M. Leszczynski, J. Zhang, A. Marzoev, C. R. Aberger, K. Olukotun, and C. Ré. High-accuracy low-precision training. arXiv preprint:1803.03383, 2018.
    Findings
  • J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, pages 1223–1231, 2012.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770– 778, 2016.
    Google ScholarLocate open access versionFindings
  • Z. Huo and H. Huang. Asynchronous stochastic gradient descent with variance reduction for non-convex optimization. arXiv preprint arXiv:1604.03584, 2016.
    Findings
  • P. Jiang and G. Agrawal. A linear speedup analysis of distributed deep learning with sparse and quantized communication. In Advances in Neural Information Processing Systems, pages 2525–2536. Curran Associates, Inc., 2018.
    Google ScholarLocate open access versionFindings
  • R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315–323, 2013.
    Google ScholarLocate open access versionFindings
  • S. P. Karimireddy, Q. Rebjock, S. U. Stich, and M. Jaggi. Error feedback fixes signsgd and other gradient compression schemes. In International Conference on Machine Learning (ICML), 2019.
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. Technical report, Tech Report, 2009.
    Google ScholarFindings
  • X. Lian, Y. Huang, Y. Li, and J. Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2737–2745, 2015.
    Google ScholarLocate open access versionFindings
  • MNIST. http://yann.lecun.com/exdb/mnist/.
    Findings
  • Y. Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pages 372–376, 1983.
    Google ScholarLocate open access versionFindings
  • Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer, 2003.
    Google ScholarFindings
  • OpenMPI. https://www.open-mpi.org/.
    Findings
  • B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693–701, 2011.
    Google ScholarLocate open access versionFindings
  • S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. J. Smola. On variance reduction in stochastic gradient descent and its asynchronous variants. In Advances in Neural Information Processing Systems, pages 2647–2655, 2015.
    Google ScholarLocate open access versionFindings
  • S. J. Reddi, S. Sra, B. Poczos, and A. Smola. Fast stochastic methods for nonsmooth nonconvex optimization. In Advances in Neural Information Processing Systems, 2016.
    Google ScholarLocate open access versionFindings
  • F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.
    Google ScholarLocate open access versionFindings
  • F. Shang, Y. Liu, J. Cheng, and J. Zhuo. Fast stochastic variance reduced gradient method with momentum acceleration for machine learning. arXiv preprint arXiv:1703.07948, 2017.
    Findings
  • S. U. Stich, J.-B. Cordonnier, and M. Jaggi. Sparsified sgd with memory. In Advances in Neural Information Processing Systems, pages 4452–4463, 2018.
    Google ScholarLocate open access versionFindings
  • H. Tang, S. Gan, C. Zhang, T. Zhang, and J. Liu. Communication compression for decentralized training. In Advances in Neural Information Processing Systems, pages 7652–7662, 2018.
    Google ScholarLocate open access versionFindings
  • H. Tang, C. Yu, X. Lian, T. Zhang, and J. Liu. Doublesqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression. In International Conference on Machine Learning (ICML), 2019.
    Google ScholarLocate open access versionFindings
  • H. Wang, S. Sievert, S. Liu, Z. Charles, D. Papailiopoulos, and S. Wright. Atomo: Communication-efficient learning via atomic sparsification. In Advances in Neural Information Processing Systems, 2018.
    Google ScholarLocate open access versionFindings
  • J. Wang, M. Kolar, N. Srebro, and T. Zhang. Efficient distributed learning with sparsity. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3636–3645, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
    Google ScholarLocate open access versionFindings
  • J. Wang, W. Wang, and N. Srebro. Memory and communication efficient distributed stochastic optimization with minibatch-prox. Conference on Learning Theory (COLT), 2017.
    Google ScholarFindings
  • J. Wangni, J. Wang, J. Liu, and T. Zhang. Gradient sparsification for communication-efficient distributed optimization. In Advances in Neural Information Processing Systems, 2018.
    Google ScholarLocate open access versionFindings
  • W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems, pages 1509–1519, 2017.
    Google ScholarLocate open access versionFindings
  • J. Wu, W. Huang, J. Huang, and T. Zhang. Error compensated quantized sgd and its applications to large-scale distributed optimization. In International Conference on Machine Learning (ICML), 2018.
    Google ScholarLocate open access versionFindings
  • P. Yin, S. Zhang, J. Lyu, S. Osher, Y. Qi, and J. Xin. Binaryrelax: A relaxation approach for training deep neural networks with quantized weights. SIAM Journal on Imaging Sciences, 11(4):2205–2223, 2018.
    Google ScholarLocate open access versionFindings
  • Y. Yu, J. Wu, and J. Huang. Exploring fast and communication-efficient algorithms in largescale distributed networks. In Proceedings of The 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.
    Google ScholarLocate open access versionFindings
  • H. Zhang, J. Li, K. Kara, D. Alistarh, J. Liu, and C. Zhang. Zipml: Training linear models with end-to-end low precision, and a little bit of deep learning. In International Conference on Machine Learning (ICML), pages 4035–4043, 2017.
    Google ScholarLocate open access versionFindings
  • X. Zhang, J. Liu, Z. Zhu, and E. S. Bentley. Compressed distributed gradient descent: Communication-efficient consensus over networks. In IEEE INFOCOM 2019-IEEE Conference on Computer Communications, pages 2431–24IEEE, 2019.
    Google ScholarLocate open access versionFindings
  • S. Zheng, Z. Huang, and J. T. Kwok. Communication-efficient distributed blockwise momentum sgd with error-feedback. In Advances in Neural Information Processing Systems, 2019.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments