We introduce a general consistency condition covering communication-reduced and asynchronous distributed stochastic gradient descent implementations
Elastic Consistency: A Practical Consistency Model for Distributed Stochastic Gradient Descent
AAAI, pp.9037-9045, (2021)
下载 PDF 全文
One key element behind the recent progress of machine learning has been the ability to train machine learning models in large-scale distributed shared-memory and message-passing environments. Most of these models are trained employing variants of stochastic gradient descent (SGD) based optimization, but most methods involve some type of ...更多
下载 PDF 全文
- Machine learning models can match or surpass humans on specialized tasks such as image classification (Krizhevsky, Sutskever, and Hinton 2012; He et al 2016), speech recognition (Seide et al 2014), or complex games (Silver et al 2016).
- For synchronous message-passing with communicationcompression, the framework implies the first general bounds for the parallel, multi-node case: references (Stich, Cordonnier, and Jaggi 2018; Karimireddy et al 2019) derive tight rates for such methods, but in the sequential case, where there is a single node which applies the compressed gradient onto its model, whereas (Alistarh et al 2018) considers the multi-node case, but requires an additional analytic assumption.
- Machine learning models can match or surpass humans on specialized tasks such as image classification (Krizhevsky, Sutskever, and Hinton 2012; He et al 2016), speech recognition (Seide et al 2014), or complex games (Silver et al 2016)
- We introduce a convergence criterion for stochastic gradient descent (SGD)-based optimization called elastic consistency, which is independent of the system model, but can be specialized to cover various model consistency relaxations
- Under standard smoothness assumptions on the loss, elastic consistency is sufficient to guarantee convergence rates for inconsistent SGD iterations for both convex and nonconvex objectives. This condition is necessary for SGD convergence: we provide simple worst-case instances where SGD convergence is linear in the elastic consistency parameter, showing that the iterations will diverge if elastic consistency is regularly broken
- We show that elastic consistency is satisfied by both asynchronous message-passing and shared-memory models, centralized or decentralized, with or without faults, and by communication-reduced methods
- Please note that, in the above, time t counts each time step at which a stochastic gradient is generated at a node, in sequential order
- Assuming that the parameters are constant, the c√onvergence of the non-convex objectives is at a ra√te of O(1/ T ) for SGD iterations defined in (10) and O(1/ T p) for SGD iterations defined in (11)
- In the crash-prone case, elastic consistency implies new convergence bounds for crash or message-omission faults.
- The authors will allow processes to start their forward pass before all layers are synchronized, as long as enough gradient norm has been received to ensure a small elastic consistency constant.
- The elastic scheduling rule will allow the processor to start its forward pass before this point, on the inconsistent view, as long as the norm of the received update is at least a β-fraction of its own gradient at the step.
- The authors can show that the elastic consistency constant B is upper bounded by O(M ), since a processor cannot miss more than one gradient.
- Variance-bounded scheduler inspires a way of improving the elastic consistency bounds for crash and message-drop faults and asynchrony with delay τmax : instead of proceeding without the dropped messages, each node can replace the corresponding missing gradient with its own.
- The authors' framework improves in three respects: (i) it does not require stringent gradient sparsity assumptions; (ii) it is able to analyze the case where the updates are not unbiased estimators of the gradient, which allows extensions to error-feedback communication-reduction; and (3) it tackles convergence for general non-convex objectives.
- Qiao et al (Qiao et al 2019) model asynchrony and communication reduction as perturbations of the SGD iteration, and introduce a metric called “rework cost,” which can be subsumed into the elastic consistency bound.
- Karimireddy et al (Karimireddy et al 2019) analyze communication-compression with error feedback, and present a general notion of δ-compressor to model communication-reduced consistency relaxations; later, the framework was extended to include asynchronous iterations (Stich and Karimireddy 2019).
- The authors' framework generalizes in one important practical aspect, as it allows the analysis in distributed settings: (Karimireddy et al 2019; Stich and Karimireddy 2019) assume that the iterations are performed at a single processor, which may compress gradients or view inconsistent information only with respect to its own earlier iterations.
- Table1: Summary of elastic consistency bounds
- Related Work and Discussion
Distributed machine learning has recently gained significant practical adoption, e.g. (Dean et al 2012; Ho et al 2013; Chilimbi et al 2014; Zhang, Choromanska, and LeCun 2015; Xing et al 2015; Jayarajan et al 2019; Peng et al 2019). Consequently, there has been significant work on introducing and analyzing distributed relaxations of SGD (Recht et al 2011; Ho et al 2013; Sa et al 2015; Lian et al 2015; Chaturapruek, Duchi, and Ré 2015; Leblond, Pedregosa, and Lacoste-Julien 2017; Alistarh, De Sa, and Konstantinov 2018; Wang and Joshi 2018; Woodworth et al 2018; Karimireddy et al 2019; Stich and Karimireddy 2019; Lu, Nash, and De Sa 2020). Due to space constraints, we cover in detail only work that is technically close to ours.
Specifically, De Sa et al (Sa et al 2015) were the first to consider a unified analysis framework for asynchonous and communication-compressed iterations. Relative to it, our framework improves in three respects: (i) it does not require stringent gradient sparsity assumptions; (ii) it is also able to analyze the case where the updates are not unbiased estimators of the gradient, which allows extensions to error-feedback communication-reduction; and (3) it also tackles convergence for general non-convex objectives. Reference (Lian et al 2015) presented the first general analysis of asynchronous non-convex SGD , without communicationreduction. Qiao et al (Qiao et al 2019) model asynchrony and communication reduction as perturbations of the SGD iteration, and introduce a metric called “rework cost,” which can be subsumed into the elastic consistency bound.
- This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 805223 ScaleML)
- Bapi Chatterjee was supported by the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 754411 (ISTPlus)
- Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. 2016. TensorFlow: A System for Large-Scale Machine Learning. In OSDI, volume 16, 265–283.
- Aji, A. F.; and Heafield, K. 2017. Sparse Communication for Distributed Gradient Descent. In EMNLP, 440–445.
- Alistarh, D.; De Sa, C.; and Konstantinov, N. 2018. The convergence of stochastic gradient descent in asynchronous shared memory. In PODC, 169–178.
- Alistarh, D.; Grubic, D.; Li, J.; Tomioka, R.; and Vojnovic, M. 2017. QSGD: Communication-efficient SGD via gradient quantization and encoding. In NIPS, 1709–1720.
- Alistarh, D.; Hoefler, T.; Johansson, M.; Konstantinov, N.; Khirirat, S.; and Renggli, C. 2018. The convergence of sparsified gradient methods. In NIPS, 5977–5987.
- Attiya, H.; and Welch, J. 2004. Distributed computing: fundamentals, simulations, and advanced topics, volume 19. J. W. & Sons.
- Bertsekas, D. P.; and Tsitsiklis, J. N. 1989. Parallel and distributed computation: numerical methods, volume 23. Prentice hall Englewood Cliffs, NJ.
- Bubeck, S.; et al. 2015. Convex optimization: Algorithms and complexity. Foundations and Trends R in Machine Learning 8(3-4): 231–357.
- Chaturapruek, S.; Duchi, J. C.; and Ré, C. 2015. Asynchronous stochastic convex optimization: the noise is in the noise and SGD don’t care. In NIPS, 1531–1539.
- Chilimbi, T. M.; Suzue, Y.; Apacible, J.; and Kalyanaraman, K. 2014. Project Adam: Building an Efficient and Scalable Deep Learning Training System. In OSDI, volume 14, 571– 582.
- Dean, J.; Corrado, G.; Monga, R.; Chen, K.; Devin, M.; Mao, M.; Senior, A.; Tucker, P.; Yang, K.; Le, Q. V.; et al. 2012. Large scale distributed deep networks. In NIPS, 1223–1231.
- Ghadimi, S.; and Lan, G. 2013. Stochastic first-and zerothorder methods for nonconvex stochastic programming. SIAM Journal on Optimization 23(4): 2341–2368.
- Goyal, P.; Dollár, P.; Girshick, R.; Noordhuis, P.; Wesolowski, L.; Kyrola, A.; Tulloch, A.; Jia, Y.; and He, K. 2017.
- He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR, 770–778.
- Ho, Q.; Cipar, J.; Cui, H.; Lee, S.; Kim, J. K.; Gibbons, P. B.; Gibson, G. A.; Ganger, G.; and Xing, E. P. 2013. More effective distributed ml via a stale synchronous parallel parameter server. In NIPS, 1223–1231.
- Jayarajan, A.; Wei, J.; Gibson, G.; Fedorova, A.; and Pekhimenko, G. 2019. Priority-based parameter propagation for distributed DNN training. arXiv preprint arXiv:1905.03960.
- Karimireddy, S. P.; Rebjock, Q.; Stich, S. U.; and Jaggi, M. 2019. Error Feedback Fixes SignSGD and other Gradient Compression Schemes. In ICML, 3252–3261.
- Ketkar, N. 2017. Introduction to PyTorch. In Deep Learning with Python, 195–208. Springer.
- Krizhevsky, A.; and Hinton, G. 2009. Learning multiple layers of features from tiny images.
- Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In NIPS, 1097–1105.
- Leblond, R.; Pedregosa, F.; and Lacoste-Julien, S. 2017. ASAGA: Asynchronous Parallel SAGA. In AISTATS, 46– 54.
- Li, M.; Andersen, D. G.; Park, J. W.; Smola, A. J.; Ahmed, A.; Josifovski, V.; Long, J.; Shekita, E. J.; and Su, B.-Y. 2014. Scaling Distributed Machine Learning with the Parameter Server. In OSDI, volume 1, 3.
- Lian, X.; Huang, Y.; Li, Y.; and Liu, J. 2015. Asynchronous parallel stochastic gradient for nonconvex optimization. In NIPS, 2737–2745.
- Lian, X.; Zhang, C.; Zhang, H.; Hsieh, C.; Zhang, W.; and Liu, J. 2017. Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent. In NIPS, 5330–5340.
- Lin, T.; Stich, S. U.; Patel, K. K.; and Jaggi, M. 2018a. Don’t Use Large Mini-Batches, Use Local SGD. arXiv preprint arXiv:1808.07217.
- Lin, Y.; Han, S.; Mao, H.; Wang, Y.; and Dally, B. 2018b. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training. In ICLR, Poster.
- Lu, Y.; Li, Z.; and Sa, C. D. 2020. Towards Optimal Convergence Rate in Decentralized Stochastic Training. ArXiv abs/2006.08085.
- Lu, Y.; Nash, J.; and De Sa, C. 2020. MixML: A Unified Analysis of Weakly Consistent Parallel Learning. arXiv preprint arXiv:2005.06706.
- Nadiradze, G.; Markov, I.; Chatterjee, B.; Kungurtsev, V.; and Alistarh, D. 2020. Elastic Consistency: A General Consistency Model for Distributed Stochastic Gradient Descent. arXiv preprint arXiv:2001.05918.
- Nguyen, L. M.; Nguyen, P. H.; van Dijk, M.; Richtárik, P.; Scheinberg, K.; and Takác, M. 2018. SGD and Hogwild! Convergence Without the Bounded Gradients Assumption. In ICML, 3747–3755.
- Peng, Y.; Zhu, Y.; Chen, Y.; Bao, Y.; Yi, B.; Lan, C.; Wu, C.; and Guo, C. 2019. A generic communication scheduler for distributed DNN training acceleration. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, 16–29.
- Qiao, A.; Aragam, B.; Zhang, B.; and Xing, E. P. 2019. Fault Tolerance in Iterative-Convergent Machine Learning. In ICML, 5220–5230.
- Recht, B.; Re, C.; Wright, S.; and Niu, F. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In NIPS, 693–701.
- Robbins, H.; and Monro, S. 1951. A stochastic approximation method. The annals of mathematical statistics 400–407.
- Sa, C. D.; Zhang, C.; Olukotun, K.; and Ré, C. 2015. Taming the Wild: A Unified Analysis of Hogwild-Style Algorithms. In NIPS, 2674–2682.
- Seide, F.; Fu, H.; Droppo, J.; Li, G.; and Yu, D. 2014. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In INTERSPEECH, 1058–1062.
- Sergeev, A.; and Del Balso, M. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799.
- Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. 2016. Mastering the game of Go with deep neural networks and tree search. nature 529(7587): 484–489.
- Stich, S. U. 2018. Local SGD converges fast and communicates little. arXiv preprint arXiv:1805.09767.
- Stich, S. U.; Cordonnier, J.; and Jaggi, M. 2018. Sparsified SGD with Memory. In NIPS, 4452–4463.
- Stich, S. U.; and Karimireddy, S. P. 2019. The errorfeedback framework: Better rates for SGD with delayed gradients and compressed communication. arXiv preprint arXiv:1909.05350.
- Strom, N. 2015. Scalable distributed DNN training using commodity GPU cloud computing. In Sixteenth Annual Conference of the International Speech Communication Association.
- Wang, J.; and Joshi, G. 2018. Cooperative SGD: A unified framework for the design and analysis of communicationefficient SGD algorithms. arXiv preprint arXiv:1808.07576.
- Wangni, J.; Wang, J.; Liu, J.; and Zhang, T. 2018. Gradient sparsification for communication-efficient distributed optimization. In NIPS, 1306–1316.
- Woodworth, B. E.; Wang, J.; Smith, A.; McMahan, B.; and Srebro, N. 2018. Graph oracle models, lower bounds, and gaps for parallel stochastic optimization. In Advances in neural information processing systems, 8496–8506.
- Xing, E. P.; Ho, Q.; Dai, W.; Kim, J. K.; Wei, J.; Lee, S.; Zheng, X.; Xie, P.; Kumar, A.; and Yu, Y. 2015. Petuum: A new platform for distributed machine learning on big data. IEEE Transactions on Big Data 1(2): 49–67.
- Zagoruyko, S.; and Komodakis, N. 2016. Wide residual networks. arXiv preprint arXiv:1605.07146.
- Zhang, S.; Choromanska, A. E.; and LeCun, Y. 2015. Deep learning with elastic averaging SGD. In Advances in neural information processing systems, 685–693.