Distributed Training with Heterogeneous Data: Bridging Median- and Mean-Based Algorithms

NIPS 2020, 2020.

Cited by: 9|Views30
EI
Weibo:
We show that when the data at different nodes come from different distributions, the class of median-based algorithms suffers from non-convergence caused by using median to evaluate mean

Abstract:

Recently, there is a growing interest in the study of median-based algorithms for distributed non-convex optimization. Two prominent such algorithms include signSGD with majority vote, an effective approach for communication reduction via 1-bit compression on the local gradients, and medianSGD, an algorithm recently proposed to ensure rob...More

Code:

Data:

0
Full Text
Bibtex
Weibo
Introduction
  • In the past few years, deep neural networks have achieved great successes in many tasks including computer vision and natural language processing.
  • Where each node i can only access information of its local function fi(·), defined by its local data.
  • Such local objective takes the form of either an expected loss over local data distribution, or an empirical average over loss functions evaluated over finite number of data points.
Highlights
  • In the past few years, deep neural networks have achieved great successes in many tasks including computer vision and natural language processing
  • We show that when the data at different nodes come from different distributions, the class of median-based algorithms suffers from non-convergence caused by using median to evaluate mean
  • To fix the non-convergence issue, We provide a perturbation mechanism to shrink the gap between expected median and mean
  • After incorporating the perturbation mechanism into signSGD and medianSGD, we show that both algorithms can guarantee convergence to stationary points with a rate of O(d3/4/T 1/4)
  • The perturbation mechanism can be approximately realized by sub-sampling of data during gradient evaluation, which partly support the use of sub-sampling in practice
  • We conducted experiments on training neural nets to show the necessity of the perturbation mechanism and sub-sampling
Methods
  • The authors show how noise helps the practical behavior of the algorithm. Since signSGD is better studied empirically and medianSGD is more of theoretical interest so far, the authors use signSGD to demonstrate the benefit of injecting noise.
  • The authors first study the asymptotic performance of different algorithms, where the authors use a subset of MNIST and train neural networks until convergence.
  • The authors compare Noisy signSGD (Algorithm 3) with different b, signSGD with sub-sampling on data, and signSGD without any noise.
  • It should be noticed that the signSGD without noise converges to solutions where the sizes of the gradients are quite large, compared with the amount of noise added by Noisy signSGD or signSGD with sub-sampling.
  • The exploration effect of the noise may contribute to making the final gradient small, since the noise added is not strong enough to bridge the gap
Conclusion
  • The authors uncover the connection between signSGD and medianSGD by showing signSGD is a median-based algorithm.
  • The authors show that when the data at different nodes come from different distributions, the class of median-based algorithms suffers from non-convergence caused by using median to evaluate mean.
  • After incorporating the perturbation mechanism into signSGD and medianSGD, the authors show that both algorithms can guarantee convergence to stationary points with a rate of O(d3/4/T 1/4).
  • To the best of the knowledge, this is the first time that median-based methods, including signSGD and medianSGD, are able to converge with provable rate for distributed problems with heterogeneous data.
  • The authors conducted experiments on training neural nets to show the necessity of the perturbation mechanism and sub-sampling
Summary
  • Introduction:

    In the past few years, deep neural networks have achieved great successes in many tasks including computer vision and natural language processing.
  • Where each node i can only access information of its local function fi(·), defined by its local data.
  • Such local objective takes the form of either an expected loss over local data distribution, or an empirical average over loss functions evaluated over finite number of data points.
  • Methods:

    The authors show how noise helps the practical behavior of the algorithm. Since signSGD is better studied empirically and medianSGD is more of theoretical interest so far, the authors use signSGD to demonstrate the benefit of injecting noise.
  • The authors first study the asymptotic performance of different algorithms, where the authors use a subset of MNIST and train neural networks until convergence.
  • The authors compare Noisy signSGD (Algorithm 3) with different b, signSGD with sub-sampling on data, and signSGD without any noise.
  • It should be noticed that the signSGD without noise converges to solutions where the sizes of the gradients are quite large, compared with the amount of noise added by Noisy signSGD or signSGD with sub-sampling.
  • The exploration effect of the noise may contribute to making the final gradient small, since the noise added is not strong enough to bridge the gap
  • Conclusion:

    The authors uncover the connection between signSGD and medianSGD by showing signSGD is a median-based algorithm.
  • The authors show that when the data at different nodes come from different distributions, the class of median-based algorithms suffers from non-convergence caused by using median to evaluate mean.
  • After incorporating the perturbation mechanism into signSGD and medianSGD, the authors show that both algorithms can guarantee convergence to stationary points with a rate of O(d3/4/T 1/4).
  • To the best of the knowledge, this is the first time that median-based methods, including signSGD and medianSGD, are able to converge with provable rate for distributed problems with heterogeneous data.
  • The authors conducted experiments on training neural nets to show the necessity of the perturbation mechanism and sub-sampling
Related work
  • Distributed training and communication efficiency. Distributed training of neural nets has become popular since the work of Dean et al [2012], in which distributed SGD was shown to achieve significant acceleration compared with SGD [Robbins and Monro, 1951]. As an example, Goyal et al [2017] showed that distributed training of ResNet-50 [He et al, 2016] can finish within an hour. There is a recent line of work providing methods for communication reduction in distributed training, including stochastic quantization [Alistarh et al, 2017, Wen et al, 2017] and 1-bit gradient compression such as signSGD [Bernstein et al, 2018a,b]

    Byzantine robust optimization. Byzantine robust optimization draws increasingly more attention in the past a few years. Its goal is to ensure performance of the optimization algorithms in the existence of Byzantine failures. Alistarh et al [2018] developed a variant of SGD based on detecting Byzantine nodes. Yin et al [2018] proposed medianGD that is shown to converge with optimal statistical rate. Blanchard et al [2017] proposed a robust aggregation rule called Krum. It is shown in Bernstein et al [2018b] that signSGD is also robust against certain failures. Most existing works assume homogeneous data. In addition, Bagdasaryan et al [2018] showed that many existing Byzantine robust methods are vulnerable to adversarial attacks.
Reference
  • Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Communicationefficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1709–1720, 2017.
    Google ScholarLocate open access versionFindings
  • Dan Alistarh, Zeyuan Allen-Zhu, and Jerry Li. Byzantine stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 4613–4623, 2018.
    Google ScholarLocate open access versionFindings
  • Eugene Bagdasaryan, Andreas Veit, Yiqing Hua, Deborah Estrin, and Vitaly Shmatikov. How to backdoor federated learning. arXiv preprint arXiv:1807.00459, 2018.
    Findings
  • Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. Signsgd: Compressed optimisation for non-convex problems. In Proceedings of the International Conference on Machine Learning (ICML), pages 559–568, 2018a.
    Google ScholarLocate open access versionFindings
  • Jeremy Bernstein, Jiawei Zhao, Kamyar Azizzadenesheli, and Anima Anandkumar. signsgd with majority vote is communication efficient and fault tolerant. In Proceedings of the International Conference on Learning Representations (ICLR), 2018b.
    Google ScholarLocate open access versionFindings
  • Peva Blanchard, Rachid Guerraoui, Julien Stainer, et al. Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems, pages 119–129, 2017.
    Google ScholarLocate open access versionFindings
  • Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub Konecny, Stefano Mazzocchi, H Brendan McMahan, et al. Towards federated learning at scale: System design. arXiv preprint arXiv:1902.01046, 2019.
    Findings
  • Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223–1231, 2012.
    Google ScholarLocate open access versionFindings
  • Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
    Findings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian U Stich, and Martin Jaggi. Error feedback fixes signsgd and other gradient compression schemes. arXiv preprint arXiv:1901.09847, 2019.
    Findings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • Jakub Konečny, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.
    Findings
  • Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI’14, pages 583–598, Berkeley, CA, USA, 20USENIX Association. ISBN 978-1-931971-16-4. URL http://dl.acm.org/citation.cfm?id=2685048.2685095.
    Locate open access versionFindings
  • Brendan McMahan and Daniel Ramage. Federated learning: Collaborative machine learning without centralized training data. Google Research Blog, 3, 2017.
    Google ScholarLocate open access versionFindings
  • Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1273–1282, 2017.
    Google ScholarLocate open access versionFindings
  • Steven J Miller. The Probability Lifesaver: Order Statistics and the Median Theorem. Princeton University Press, 2017.
    Google ScholarFindings
  • Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237, 2019.
    Findings
  • Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
    Google ScholarFindings
  • Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S Talwalkar. Federated multi-task learning. In Advances in Neural Information Processing Systems, pages 4424–4434, 2017.
    Google ScholarLocate open access versionFindings
  • Robert A Wannamaker, Stanley P Lipshitz, John Vanderkooy, and J Nelson Wright. A theory of nonsubtractive dither. IEEE Transactions on Signal Processing, 48(2):499–516, 2000.
    Google ScholarLocate open access versionFindings
  • Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in neural information processing systems, pages 1509–1519, 2017.
    Google ScholarLocate open access versionFindings
  • Dong Yin, Yudong Chen, Kannan Ramchandran, and Peter Bartlett. Byzantine-robust distributed learning: Towards optimal statistical rates. In Proceedings of the International Conference on Machine Learning (ICML), pages 5636–5645, 2018.
    Google ScholarLocate open access versionFindings
  • 1. Summation of all zeroth order terms multiplied by 1/b is
    Google ScholarFindings
  • 2. All the terms multiplied by 1/b2 cancels with each other after summation due to the definition of u. I.e.
    Google ScholarFindings
  • 3. Excluding the terms above, the rest of the terms are upper bounded by the order of O(1/b3). Available at http://yann.lecun.com/exdb/mnist/
    Findings
Your rating :
0

 

Tags
Comments