AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We introduce a general consistency condition covering communication-reduced and asynchronous distributed stochastic gradient descent implementations

Elastic Consistency: A Practical Consistency Model for Distributed Stochastic Gradient Descent

AAAI, pp.9037-9045, (2021)

被引用0|浏览15
EI
下载 PDF 全文
引用
微博一下

摘要

One key element behind the recent progress of machine learning has been the ability to train machine learning models in large-scale distributed shared-memory and message-passing environments. Most of these models are trained employing variants of stochastic gradient descent (SGD) based optimization, but most methods involve some type of ...更多

代码

数据

0
简介
  • Machine learning models can match or surpass humans on specialized tasks such as image classification (Krizhevsky, Sutskever, and Hinton 2012; He et al 2016), speech recognition (Seide et al 2014), or complex games (Silver et al 2016).
  • For synchronous message-passing with communicationcompression, the framework implies the first general bounds for the parallel, multi-node case: references (Stich, Cordonnier, and Jaggi 2018; Karimireddy et al 2019) derive tight rates for such methods, but in the sequential case, where there is a single node which applies the compressed gradient onto its model, whereas (Alistarh et al 2018) considers the multi-node case, but requires an additional analytic assumption.
重点内容
  • Machine learning models can match or surpass humans on specialized tasks such as image classification (Krizhevsky, Sutskever, and Hinton 2012; He et al 2016), speech recognition (Seide et al 2014), or complex games (Silver et al 2016)
  • We introduce a convergence criterion for stochastic gradient descent (SGD)-based optimization called elastic consistency, which is independent of the system model, but can be specialized to cover various model consistency relaxations
  • Under standard smoothness assumptions on the loss, elastic consistency is sufficient to guarantee convergence rates for inconsistent SGD iterations for both convex and nonconvex objectives. This condition is necessary for SGD convergence: we provide simple worst-case instances where SGD convergence is linear in the elastic consistency parameter, showing that the iterations will diverge if elastic consistency is regularly broken
  • We show that elastic consistency is satisfied by both asynchronous message-passing and shared-memory models, centralized or decentralized, with or without faults, and by communication-reduced methods
  • Please note that, in the above, time t counts each time step at which a stochastic gradient is generated at a node, in sequential order
  • Assuming that the parameters are constant, the c√onvergence of the non-convex objectives is at a ra√te of O(1/ T ) for SGD iterations defined in (10) and O(1/ T p) for SGD iterations defined in (11)
结果
  • In the crash-prone case, elastic consistency implies new convergence bounds for crash or message-omission faults.
  • The authors will allow processes to start their forward pass before all layers are synchronized, as long as enough gradient norm has been received to ensure a small elastic consistency constant.
  • The elastic scheduling rule will allow the processor to start its forward pass before this point, on the inconsistent view, as long as the norm of the received update is at least a β-fraction of its own gradient at the step.
  • The authors can show that the elastic consistency constant B is upper bounded by O(M ), since a processor cannot miss more than one gradient.
  • Variance-bounded scheduler inspires a way of improving the elastic consistency bounds for crash and message-drop faults and asynchrony with delay τmax : instead of proceeding without the dropped messages, each node can replace the corresponding missing gradient with its own.
  • The authors' framework improves in three respects: (i) it does not require stringent gradient sparsity assumptions; (ii) it is able to analyze the case where the updates are not unbiased estimators of the gradient, which allows extensions to error-feedback communication-reduction; and (3) it tackles convergence for general non-convex objectives.
  • Qiao et al (Qiao et al 2019) model asynchrony and communication reduction as perturbations of the SGD iteration, and introduce a metric called “rework cost,” which can be subsumed into the elastic consistency bound.
结论
  • Karimireddy et al (Karimireddy et al 2019) analyze communication-compression with error feedback, and present a general notion of δ-compressor to model communication-reduced consistency relaxations; later, the framework was extended to include asynchronous iterations (Stich and Karimireddy 2019).
  • The authors' framework generalizes in one important practical aspect, as it allows the analysis in distributed settings: (Karimireddy et al 2019; Stich and Karimireddy 2019) assume that the iterations are performed at a single processor, which may compress gradients or view inconsistent information only with respect to its own earlier iterations.
表格
  • Table1: Summary of elastic consistency bounds
Download tables as Excel
相关工作
基金
  • This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 805223 ScaleML)
  • Bapi Chatterjee was supported by the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 754411 (ISTPlus)
引用论文
  • Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. 2016. TensorFlow: A System for Large-Scale Machine Learning. In OSDI, volume 16, 265–283.
    Google ScholarLocate open access versionFindings
  • Aji, A. F.; and Heafield, K. 2017. Sparse Communication for Distributed Gradient Descent. In EMNLP, 440–445.
    Google ScholarLocate open access versionFindings
  • Alistarh, D.; De Sa, C.; and Konstantinov, N. 2018. The convergence of stochastic gradient descent in asynchronous shared memory. In PODC, 169–178.
    Google ScholarLocate open access versionFindings
  • Alistarh, D.; Grubic, D.; Li, J.; Tomioka, R.; and Vojnovic, M. 2017. QSGD: Communication-efficient SGD via gradient quantization and encoding. In NIPS, 1709–1720.
    Google ScholarFindings
  • Alistarh, D.; Hoefler, T.; Johansson, M.; Konstantinov, N.; Khirirat, S.; and Renggli, C. 2018. The convergence of sparsified gradient methods. In NIPS, 5977–5987.
    Google ScholarLocate open access versionFindings
  • Attiya, H.; and Welch, J. 2004. Distributed computing: fundamentals, simulations, and advanced topics, volume 19. J. W. & Sons.
    Google ScholarLocate open access versionFindings
  • Bertsekas, D. P.; and Tsitsiklis, J. N. 1989. Parallel and distributed computation: numerical methods, volume 23. Prentice hall Englewood Cliffs, NJ.
    Google ScholarLocate open access versionFindings
  • Bubeck, S.; et al. 2015. Convex optimization: Algorithms and complexity. Foundations and Trends R in Machine Learning 8(3-4): 231–357.
    Google ScholarLocate open access versionFindings
  • Chaturapruek, S.; Duchi, J. C.; and Ré, C. 2015. Asynchronous stochastic convex optimization: the noise is in the noise and SGD don’t care. In NIPS, 1531–1539.
    Google ScholarFindings
  • Chilimbi, T. M.; Suzue, Y.; Apacible, J.; and Kalyanaraman, K. 2014. Project Adam: Building an Efficient and Scalable Deep Learning Training System. In OSDI, volume 14, 571– 582.
    Google ScholarLocate open access versionFindings
  • Dean, J.; Corrado, G.; Monga, R.; Chen, K.; Devin, M.; Mao, M.; Senior, A.; Tucker, P.; Yang, K.; Le, Q. V.; et al. 2012. Large scale distributed deep networks. In NIPS, 1223–1231.
    Google ScholarFindings
  • Ghadimi, S.; and Lan, G. 2013. Stochastic first-and zerothorder methods for nonconvex stochastic programming. SIAM Journal on Optimization 23(4): 2341–2368.
    Google ScholarLocate open access versionFindings
  • Goyal, P.; Dollár, P.; Girshick, R.; Noordhuis, P.; Wesolowski, L.; Kyrola, A.; Tulloch, A.; Jia, Y.; and He, K. 2017.
    Google ScholarFindings
  • He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR, 770–778.
    Google ScholarLocate open access versionFindings
  • Ho, Q.; Cipar, J.; Cui, H.; Lee, S.; Kim, J. K.; Gibbons, P. B.; Gibson, G. A.; Ganger, G.; and Xing, E. P. 2013. More effective distributed ml via a stale synchronous parallel parameter server. In NIPS, 1223–1231.
    Google ScholarFindings
  • Jayarajan, A.; Wei, J.; Gibson, G.; Fedorova, A.; and Pekhimenko, G. 2019. Priority-based parameter propagation for distributed DNN training. arXiv preprint arXiv:1905.03960.
    Findings
  • Karimireddy, S. P.; Rebjock, Q.; Stich, S. U.; and Jaggi, M. 2019. Error Feedback Fixes SignSGD and other Gradient Compression Schemes. In ICML, 3252–3261.
    Google ScholarFindings
  • Ketkar, N. 2017. Introduction to PyTorch. In Deep Learning with Python, 195–208. Springer.
    Google ScholarFindings
  • Krizhevsky, A.; and Hinton, G. 2009. Learning multiple layers of features from tiny images.
    Google ScholarFindings
  • Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In NIPS, 1097–1105.
    Google ScholarLocate open access versionFindings
  • Leblond, R.; Pedregosa, F.; and Lacoste-Julien, S. 2017. ASAGA: Asynchronous Parallel SAGA. In AISTATS, 46– 54.
    Google ScholarLocate open access versionFindings
  • Li, M.; Andersen, D. G.; Park, J. W.; Smola, A. J.; Ahmed, A.; Josifovski, V.; Long, J.; Shekita, E. J.; and Su, B.-Y. 2014. Scaling Distributed Machine Learning with the Parameter Server. In OSDI, volume 1, 3.
    Google ScholarLocate open access versionFindings
  • Lian, X.; Huang, Y.; Li, Y.; and Liu, J. 2015. Asynchronous parallel stochastic gradient for nonconvex optimization. In NIPS, 2737–2745.
    Google ScholarLocate open access versionFindings
  • Lian, X.; Zhang, C.; Zhang, H.; Hsieh, C.; Zhang, W.; and Liu, J. 2017. Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent. In NIPS, 5330–5340.
    Google ScholarFindings
  • Lin, T.; Stich, S. U.; Patel, K. K.; and Jaggi, M. 2018a. Don’t Use Large Mini-Batches, Use Local SGD. arXiv preprint arXiv:1808.07217.
    Findings
  • Lin, Y.; Han, S.; Mao, H.; Wang, Y.; and Dally, B. 2018b. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training. In ICLR, Poster.
    Google ScholarLocate open access versionFindings
  • Lu, Y.; Li, Z.; and Sa, C. D. 2020. Towards Optimal Convergence Rate in Decentralized Stochastic Training. ArXiv abs/2006.08085.
    Findings
  • Lu, Y.; Nash, J.; and De Sa, C. 2020. MixML: A Unified Analysis of Weakly Consistent Parallel Learning. arXiv preprint arXiv:2005.06706.
    Findings
  • Nadiradze, G.; Markov, I.; Chatterjee, B.; Kungurtsev, V.; and Alistarh, D. 2020. Elastic Consistency: A General Consistency Model for Distributed Stochastic Gradient Descent. arXiv preprint arXiv:2001.05918.
    Findings
  • Nguyen, L. M.; Nguyen, P. H.; van Dijk, M.; Richtárik, P.; Scheinberg, K.; and Takác, M. 2018. SGD and Hogwild! Convergence Without the Bounded Gradients Assumption. In ICML, 3747–3755.
    Google ScholarFindings
  • Peng, Y.; Zhu, Y.; Chen, Y.; Bao, Y.; Yi, B.; Lan, C.; Wu, C.; and Guo, C. 2019. A generic communication scheduler for distributed DNN training acceleration. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, 16–29.
    Google ScholarLocate open access versionFindings
  • Qiao, A.; Aragam, B.; Zhang, B.; and Xing, E. P. 2019. Fault Tolerance in Iterative-Convergent Machine Learning. In ICML, 5220–5230.
    Google ScholarLocate open access versionFindings
  • Recht, B.; Re, C.; Wright, S.; and Niu, F. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In NIPS, 693–701.
    Google ScholarLocate open access versionFindings
  • Robbins, H.; and Monro, S. 1951. A stochastic approximation method. The annals of mathematical statistics 400–407.
    Google ScholarFindings
  • Sa, C. D.; Zhang, C.; Olukotun, K.; and Ré, C. 2015. Taming the Wild: A Unified Analysis of Hogwild-Style Algorithms. In NIPS, 2674–2682.
    Google ScholarFindings
  • Seide, F.; Fu, H.; Droppo, J.; Li, G.; and Yu, D. 2014. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In INTERSPEECH, 1058–1062.
    Google ScholarFindings
  • Sergeev, A.; and Del Balso, M. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799.
    Findings
  • Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. 2016. Mastering the game of Go with deep neural networks and tree search. nature 529(7587): 484–489.
    Google ScholarLocate open access versionFindings
  • Stich, S. U. 2018. Local SGD converges fast and communicates little. arXiv preprint arXiv:1805.09767.
    Findings
  • Stich, S. U.; Cordonnier, J.; and Jaggi, M. 2018. Sparsified SGD with Memory. In NIPS, 4452–4463.
    Google ScholarLocate open access versionFindings
  • Stich, S. U.; and Karimireddy, S. P. 2019. The errorfeedback framework: Better rates for SGD with delayed gradients and compressed communication. arXiv preprint arXiv:1909.05350.
    Findings
  • Strom, N. 2015. Scalable distributed DNN training using commodity GPU cloud computing. In Sixteenth Annual Conference of the International Speech Communication Association.
    Google ScholarLocate open access versionFindings
  • Wang, J.; and Joshi, G. 2018. Cooperative SGD: A unified framework for the design and analysis of communicationefficient SGD algorithms. arXiv preprint arXiv:1808.07576.
    Findings
  • Wangni, J.; Wang, J.; Liu, J.; and Zhang, T. 2018. Gradient sparsification for communication-efficient distributed optimization. In NIPS, 1306–1316.
    Google ScholarLocate open access versionFindings
  • Woodworth, B. E.; Wang, J.; Smith, A.; McMahan, B.; and Srebro, N. 2018. Graph oracle models, lower bounds, and gaps for parallel stochastic optimization. In Advances in neural information processing systems, 8496–8506.
    Google ScholarLocate open access versionFindings
  • Xing, E. P.; Ho, Q.; Dai, W.; Kim, J. K.; Wei, J.; Lee, S.; Zheng, X.; Xie, P.; Kumar, A.; and Yu, Y. 2015. Petuum: A new platform for distributed machine learning on big data. IEEE Transactions on Big Data 1(2): 49–67.
    Google ScholarLocate open access versionFindings
  • Zagoruyko, S.; and Komodakis, N. 2016. Wide residual networks. arXiv preprint arXiv:1605.07146.
    Findings
  • Zhang, S.; Choromanska, A. E.; and LeCun, Y. 2015. Deep learning with elastic averaging SGD. In Advances in neural information processing systems, 685–693.
    Google ScholarLocate open access versionFindings
作者
Giorgi Nadiradze
Giorgi Nadiradze
Ilia Markov
Ilia Markov
Bapi Chatterjee
Bapi Chatterjee
Vyacheslav Kungurtsev
Vyacheslav Kungurtsev
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科