# Linear Convergent Decentralized Optimization with Compression

international conference on learning representations, 2021.

Weibo:

Abstract:

Communication compression has become a key strategy to speed up distributed optimization. However, existing decentralized algorithms with compression mainly focus on compressing DGD-type algorithms. They are unsatisfactory in terms of convergence rate, stability, and the capability to handle heterogeneous data. Motivated by primal-dual al...More

Code:

Data:

Introduction

- In the literature of decentralized optimization, it has been proved that primal-dual algorithms can achieve faster converge rates and better support heterogeneous data (Ling et al, 2015; Shi et al, 2015; Li et al, 2019; Yuan et al, 2020).
- To the best of the knowledge, LEAD is the first linear convergent decentralized algorithm with compression.
- Combining CHOCO-Gossip and D-PSGD leads to a decentralized algorithm with compression, CHOCO-SGD, which converges sublinearly under the strong convexity and gradient boundedness assumptions.

Highlights

- Distributed optimization solves the following optimization problem x⇤ := h arg min f (x) Xn i fi(x) (1)

x2Rd n i=1 with n computing agents and a communication network - We delineate two key challenges in the algorithm design for communication compression in decentralized optimization, i.e., data heterogeneity and compression error, and motivated by primal-dual algorithms, we propose a novel decentralized algorithm with compression, LEAD
- We prove that for LEAD, a constant stepsize in the range (0, 2/(μ + L)] is sufficient to ensure linear convergence for strongly convex and smooth objective functions
- We propose to carefully control the compression error by difference compression and error compensation such that the inexact dual update (Line 6) and primal update (Line 7) can still guarantee the convergence as proved in Section 4
- We investigate the communication compression in decentralized optimization
- The nontrivial analyses on the coupled dynamics of inexact primal and dual updates as well as compression error establish the linear convergence of LEAD when full gradient is used and the linear convergence to the O( 2) neighborhood of the optimum when stochastic gradient is used

Results

- With the establishment of how consensus leads to convergence, the obstacle becomes how to achieve consensus under local communication and compression challenges.
- Primal-dual algorithms or gradient tracking algorithms can handle the data heterogeneity issue to achieve a much faster convergence rate than DGD-type algorithms, as introduced in Section 2.
- Even if the compression error exists, LEAD essentially compensates for the error in the inexact dual update (Line 6), making the algorithm more stable and robust.
- The analysis of CHOCO-SGD relies on the bounded gradient assumptions, i.e., krfi(x)k2 G, which is restrictive because it conflicts with the strongly convexity while LEAD doesn’t need this assumption.
- The proposed LEAD is compared with QDGD (Reisizadeh et al, 2019a), DeepSqueeze (Tang et al, 2019a), CHOCO-SGD (Koloskova et al, 2019), and two non-compressed algorithms DGD (Yuan et al, 2016) and NIDS (Li et al, 2019).
- CHOCO-SGD, DeepSqueeze and LEAD perform and outperform the non-compressed variants in terms of communication efficiency, but CHOCO-SGD and DeepSqueeze need more efforts for parameter tuning because their convergence is sensitive to the setting of .
- Note that in this setting, sufficient information exchange is more important for convergence because models from different agents are moving to significantly diverse directions.
- DGD only converges with smaller stepsize and its communication compressed variants, including QDGD, DeepSqueeze and CHOCO-SGD, diverge in all parameter settings the authors try.

Conclusion

- Motivated by primal-dual algorithms, a novel decentralized algorithm with compression, LEAD, is proposed to achieve faster convergence rate and to better handle heterogeneous data while enjoying the benefit of efficient communication.
- The nontrivial analyses on the coupled dynamics of inexact primal and dual updates as well as compression error establish the linear convergence of LEAD when full gradient is used and the linear convergence to the O( 2) neighborhood of the optimum when stochastic gradient is used.
- LEAD is applicable to non-convex problems as empirically verified in the neural network experiments but the authors leave the non-convex analysis as future work

Summary

- In the literature of decentralized optimization, it has been proved that primal-dual algorithms can achieve faster converge rates and better support heterogeneous data (Ling et al, 2015; Shi et al, 2015; Li et al, 2019; Yuan et al, 2020).
- To the best of the knowledge, LEAD is the first linear convergent decentralized algorithm with compression.
- Combining CHOCO-Gossip and D-PSGD leads to a decentralized algorithm with compression, CHOCO-SGD, which converges sublinearly under the strong convexity and gradient boundedness assumptions.
- With the establishment of how consensus leads to convergence, the obstacle becomes how to achieve consensus under local communication and compression challenges.
- Primal-dual algorithms or gradient tracking algorithms can handle the data heterogeneity issue to achieve a much faster convergence rate than DGD-type algorithms, as introduced in Section 2.
- Even if the compression error exists, LEAD essentially compensates for the error in the inexact dual update (Line 6), making the algorithm more stable and robust.
- The analysis of CHOCO-SGD relies on the bounded gradient assumptions, i.e., krfi(x)k2 G, which is restrictive because it conflicts with the strongly convexity while LEAD doesn’t need this assumption.
- The proposed LEAD is compared with QDGD (Reisizadeh et al, 2019a), DeepSqueeze (Tang et al, 2019a), CHOCO-SGD (Koloskova et al, 2019), and two non-compressed algorithms DGD (Yuan et al, 2016) and NIDS (Li et al, 2019).
- CHOCO-SGD, DeepSqueeze and LEAD perform and outperform the non-compressed variants in terms of communication efficiency, but CHOCO-SGD and DeepSqueeze need more efforts for parameter tuning because their convergence is sensitive to the setting of .
- Note that in this setting, sufficient information exchange is more important for convergence because models from different agents are moving to significantly diverse directions.
- DGD only converges with smaller stepsize and its communication compressed variants, including QDGD, DeepSqueeze and CHOCO-SGD, diverge in all parameter settings the authors try.
- Motivated by primal-dual algorithms, a novel decentralized algorithm with compression, LEAD, is proposed to achieve faster convergence rate and to better handle heterogeneous data while enjoying the benefit of efficient communication.
- The nontrivial analyses on the coupled dynamics of inexact primal and dual updates as well as compression error establish the linear convergence of LEAD when full gradient is used and the linear convergence to the O( 2) neighborhood of the optimum when stochastic gradient is used.
- LEAD is applicable to non-convex problems as empirically verified in the neural network experiments but the authors leave the non-convex analysis as future work

Related work

- Decentralized optimization can be traced back to the work by Tsitsiklis et al (1986). DGD (Nedic

& Ozdaglar, 2009) is the most classical decentralized algorithm. It is intuitive and simple but converges slowly due to the diminishing stepsize that is needed to obtain the optimal solution (Yuan et al, 2016). Its stochastic version D-PSGD (Lian et al, 2017) has been shown effective for training nonconvex deep learning models. Algorithms based on primal-dual formulations or gradient tracking are proposed to eliminate the convergence bias in DGD-type algorithms and improve the convergence rate, such as D-ADMM (Mota et al, 2013), DLM (Ling et al, 2015), EXTRA (Shi et al., 2015), NIDS (Li et

2019), D (Tang et 2018b), Exact

Diffusion (Yuan et

2018), OPTRA(Xu et al, 2020), DIGing (Nedic et al, 2017), GSGT (Pu & Nedic, 2020), etc.

Recently, communication compression is applied to decentralized settings by Tang et al (2018a). It proposes two algorithms, i.e., DCD-SGD and ECD-SGD, which require compression of high accuracy and are not stable with aggressive compression. (Reisizadeh et al, 2019a;b) introduce QDGD and QuanTimed-DSGD to achieve exact convergence with small stepsize and the convergence is slow. DeepSqueeze (Tang et al, 2019a) compensates the compression error to the compression in the next iteration. Motivated by the quantized average consensus algorithms, such as (Carli et al, 2010), the quantized gossip algorithm CHOCO-Gossip (Koloskova et al, 2019) converges linearly to the consensual solution. Combining CHOCO-Gossip and D-PSGD leads to a decentralized algorithm with compression, CHOCO-SGD, which converges sublinearly under the strong convexity and gradient boundedness assumptions. Its nonconvex variant is further provided in (Koloskova et al, 2020). A new compression scheme using the modulo operation is introduced in (Lu & De Sa, 2020) for decentralized optimization.

Reference

- Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Communication-efficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pp. 1709–1720. 2017.
- Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. SIGNSGD: compressed optimisation for non-convex problems. In Proceedings of the 35th International Conference on Machine Learning, pp. 559–568, 2018.
- Ruggero Carli, Fabio Fagnani, Paolo Frasca, and Sandro Zampieri. Gossip consensus algorithms via quantized communication. Automatica, 46(1):70–80, 2010.
- Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Urban Stich, and Martin Jaggi. Error feedback fixes SignSGD and other gradient compression schemes. In Proceedings of the 36th International Conference on Machine Learning, pp. 3252–3261. PMLR, 2019.
- Anastasia Koloskova, Sebastian U. Stich, and Martin Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication. In Proceedings of the 36th International Conference on Machine Learning, pp. 3479–3487. PMLR, 2019.
- Anastasia Koloskova, Tao Lin, Sebastian U Stich, and Martin Jaggi. Decentralized deep learning with arbitrary communication compression. In International Conference on Learning Representations, 2020.
- Yao Li and Ming Yan. On linear convergence of two decentralized algorithms. arXiv preprint arXiv:1906.07225, 2019.
- Zhi Li, Wei Shi, and Ming Yan. A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates. IEEE Transactions on Signal Processing, 67 (17):4494–4506, 2019.
- Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pp. 5330–5340, 2017.
- Qing Ling, Wei Shi, Gang Wu, and Alejandro Ribeiro. DLM: Decentralized linearized alternating direction method of multipliers. IEEE Transactions on Signal Processing, 63(15):4051–4064, 2015.
- Xiaorui Liu, Yao Li, Jiliang Tang, and Ming Yan. A double residual compression algorithm for efficient distributed learning. The 23rd International Conference on Artificial Intelligence and Statistics, 2020.
- Yucheng Lu and Christopher De Sa. Moniqua: Modulo quantized communication in decentralized SGD. In Proceedings of the 37th International Conference on Machine Learning, 2020.
- Konstantin Mishchenko, Eduard Gorbunov, Martin Takac, and Peter Richtarik. Distributed learning with compressed gradient differences. arXiv preprint arXiv:1901.09269, 2019.
- Joao FC Mota, Joao MF Xavier, Pedro MQ Aguiar, and Markus Puschel. D-ADMM: A communication-efficient distributed algorithm for separable optimization. IEEE Transactions on Signal Processing, 61(10):2718–2723, 2013.
- Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009.
- Angelia Nedic, Alex Olshevsky, and Wei Shi. Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM Journal on Optimization, 27(4):2597–2633, 2017.
- Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
- Shi Pu and Angelia Nedic. Distributed stochastic gradient tracking methods. Mathematical Programming, pp. 1–49, 2020.
- Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani, and Ramtin Pedarsani. An exact quantized decentralized gradient descent algorithm. IEEE Transactions on Signal Processing, 67(19): 4934–4947, 2019a.
- Amirhossein Reisizadeh, Hossein Taheri, Aryan Mokhtari, Hamed Hassani, and Ramtin Pedarsani. Robust and communication-efficient collaborative learning. In Advances in Neural Information Processing Systems, pp. 8388–8399, 2019b.
- Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and application to data-parallel distributed training of speech DNNs. In Interspeech 2014, September 2014.
- Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. EXTRA: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966, 2015.
- Sebastian U. Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified SGD with memory. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 4452–4463, 2018.
- Hanlin Tang, Shaoduo Gan, Ce Zhang, Tong Zhang, and Ji Liu. Communication compression for decentralized training. In Advances in Neural Information Processing Systems, pp. 7652–7662. 2018a.
- Hanlin Tang, Xiangru Lian, Ming Yan, Ce Zhang, and Ji Liu. D2: Decentralized training over decentralized data. In Proceedings of the 35th International Conference on Machine Learning, pp. 4848–4856, 2018b.
- Hanlin Tang, Xiangru Lian, Shuang Qiu, Lei Yuan, Ce Zhang, Tong Zhang, and Ji Liu. Deepsqueeze: Decentralization meets error-compensated compression. CoRR, abs/1907.07346, 2019a. URL http://arxiv.org/abs/1907.07346.
- Hanlin Tang, Chen Yu, Xiangru Lian, Tong Zhang, and Ji Liu. DoubleSqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression. In Proceedings of the 36th International Conference on Machine Learning, pp. 6155–6165, 2019b.
- John Tsitsiklis, Dimitri Bertsekas, and Michael Athans. Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE transactions on automatic control, 31(9): 803–812, 1986.
- Jiaxiang Wu, Weidong Huang, Junzhou Huang, and Tong Zhang. Error compensated quantized SGD and its applications to large-scale distributed optimization. In Proceedings of the 35th International Conference on Machine Learning, pp. 5325–5333, 2018.
- Lin Xiao and Stephen Boyd. Fast linear iterations for distributed averaging. Systems & Control Letters, 53(1):65–78, 2004.
- Jinming Xu, Ye Tian, Ying Sun, and Gesualdo Scutari. Accelerated primal-dual algorithms for distributed smooth convex optimization over networks. In International Conference on Artificial Intelligence and Statistics, pp. 2381–2391. PMLR, 2020.
- Kun Yuan, Qing Ling, and Wotao Yin. On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26(3):1835–1854, 2016.
- Kun Yuan, Bicheng Ying, Xiaochuan Zhao, and Ali H Sayed. Exact diffusion for distributed optimization and learning—part i: Algorithm development. IEEE Transactions on Signal Processing, 67(3):708–723, 2018.
- Kun Yuan, Wei Xu, and Qing Ling. Can primal methods outperform primal-dual methods in decentralized dynamic optimization? arXiv preprint arXiv:2003.00816, 2020.

Tags

Comments