## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# New Bounds For Distributed Mean Estimation and Variance Reduction

ICLR, (2021)

EI

Keywords

Abstract

We consider the problem of distributed mean estimation (DME), in which n machines are each given a local d-dimensional vector xv∈Rd, and must cooperate to estimate the mean of their inputs μ=1n∑v=1nxv, while minimizing total communication cost. DME is a fundamental construct in distributed machine learning, and there has been considerable...More

Code:

Data:

Introduction

- Several problems in distributed machine learning and optimization can be reduced to variants distributed mean estimation problem, in which n machines must cooperate to jointly estimate d-dimensional inputs μ

1 n n v=1 xv as closely as possible, while minimizing communication. - The ideal output would be the mean of all machines’ inputs
- While variants of these fundamental problems have been considered since seminal work by Tsitsiklis & Luo (1987), the task has seen renewed attention recently in the context of distributed machine learning.
- A trade-off arises between the number of bits sent, and the added variance due of quantization

Highlights

- Several problems in distributed machine learning and optimization can be reduced to variants distributed mean estimation problem, in which n machines must cooperate to jointly estimate the mean of their d-dimensional inputs μ
- 1 n n v=1 xv as closely as possible, while minimizing communication. This construct is often used for distributed variance reduction: here, each machine receives as input an independent probabilistic estimate of a d-dimensional vector ∇, and the aim is for all machines to output a common estimate of ∇ with lower variance than the individual inputs, minimizing communication
- Variance reduction is a key component in data-parallel distributed stochastic gradient descent (SGD), the standard way to parallelize the training of deep neural networks, e.g. Bottou (2010); Abadi et al (2016), where it is used to estimate the average of gradient updates obtained in parallel at the nodes
- We provide the first bounds for distributed mean estimation and variance reduction which are still tight when inputs are not centered around the origin
- We have argued in this work that for the problems of distributed mean estimation and variance reduction, one should measure the output variance in terms of the input variance, rather than the input norm as used by previous works

Results

- The authors argue that it is both stronger and more natural to bound output variance in terms of input variance, rather than squared norm.
- Lattices are subgroups of Rd consisting of the integer combinations of a set of basis vectors.
- It is well-known (Minkowski, 1911) that certain lattices have desirable properties for covering and packing Euclidean space, and lattices have been previously used for some other applications of quantization (see, e.g., Gibson & Sayood (1988)), though mostly only in low dimension.
- By choosing an appropriate family of lattices, the authors show that any vector in Rd can be rounded to a nearby lattice point, and that there are not too many nearby lattice points, so the correct one can be specified using few bits

Conclusion

- The authors have argued in this work that for the problems of distributed mean estimation and variance reduction, one should measure the output variance in terms of the input variance, rather than the input norm as used by previous works
- Through this change in perspective, the authors have shown optimal algorithms, and matching lower bounds, for both problems, independently of the norms of the input vectors.
- The authors plan to explore practical applications for variants of the schemes, for instance in the context of federated or decentralized distributed learning

Related work

- Several recent works consider efficient compression schemes for stochastic gradients, e.g. Seide et al (2014b); Wang et al (2018); Alistarh et al (2017; 2018); Stich et al (2018); Wen et al (2017); Wangni et al (2018); Lu & Sa (2020). We emphasize that these works consider a related, but different problem: they usually rely on assumptions on the input structure—such as second-moment bounds on the gradients—and are evaluated primarily on the practical performance of SGD, rather than isolating the variance-reduction step. (In some cases, these schemes also rely on history/error-correction (Aji & Heafield, 2017; Dryden et al, 2016; Alistarh et al, 2018; Stich et al, 2018).) As a result, they do not provide theoretical bounds on the problems we consider. In this sense, our work is closer to Suresh et al (2017); Konečný & Richtárik (2018); Gandikota et al (2019), which focus primarily on the distributed mean estimation problem, and only use SGD as one of many potential applications.

For example, QSGD (Alistarh et al, 2017) considers a similar problem to VarianceReduction; the major difference is that coordinates of the input vectors are assumed to be specified by 32-bit floats, rather than arbitrary real values. Hence, transmitting input vectors exactly already requires only O(d) bits. They therefore focus on reducing the constant factor (and thereby improving practical performance for SGD), rather than providing asymptotic results on communication cost. They show that the expected number of bits per entry can be reduced from 32 to 2.8, at the expense of having an output variance bound in terms of input norm rather than input variance.

Study subjects and analysis

parallel workers: 8

RLQSGD (cubic) QSGD. This experiment is executed on 8 parallel workers; we found that the results are similar when throttling this parameter. Example 2: Local SGD

parallel workers: 8

Local SGD: convergence for different quantizers (left) and quantization error (right). center) shows convergence under different quantization schemes, while. right) shows that our method provides significantly lower quantization error across iterations. Input norms (left), convergence (center) and quantization error (right) when executing distributed power iteration on 8 parallel workers. Figure 4

Reference

- Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), pp. 265–283, 2016.
- Nir Ailon and Bernard Chazelle. The fast johnson–lindenstrauss transform and approximate nearest neighbors. SIAM Journal on Computing, 39(1):302–322, 2009. doi: 10.1137/ 060673096.
- Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient descent. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 440–445, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1045. URL https://www.aclweb.org/anthology/D17-1045.
- Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pp. 1709–1720, 2017.
- Dan Alistarh, Torsten Hoefler, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and Cédric Renggli. The convergence of sparsified gradient methods. In Advances in Neural Information Processing Systems, pp. 5973–5983, 2018.
- Tal Ben-Nun and Torsten Hoefler. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Computing Surveys (CSUR), 52(4):1–43, 2019.
- Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pp. 177–186.
- Mark Braverman, Ankit Garg, Tengyu Ma, Huy L Nguyen, and David P Woodruff. Communication lower bounds for statistical estimation problems via a distributed data processing inequality. In Proceedings of the 48th Annual ACM symposium on Theory of Computing (STOC 2016), pp. 1011–1020, 2016.
- Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
- Nikoli Dryden, Tim Moon, Sam Ade Jacobs, and Brian Van Essen. Communication quantization for data-parallel training of deep neural networks. In 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC), pp. 1–8. IEEE, 2016.
- Venkata Gandikota, Raj Kumar Maity, and Arya Mazumdar. vqsgd: Vector quantized stochastic gradient descent. arXiv preprint arXiv:1911.07971, 2019.
- Jerry D. Gibson and Khalid Sayood. Lattice quantization. In Advances in Electronics and Electron Physics, volume 72, pp. 259 – 330. Academic Press, 1988. doi: https://doi.org/10.1016/S0065-2539(08)60560-0.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Martin Henk. A note on lattice packings via lattice refinements. Experimental Mathematics, 27:1–9, 09 2016. doi: 10.1080/10586458.2016.1208595.
- K. J. Horadam. Hadamard Matrices and Their Applications. Princeton University Press, 2007. ISBN 9780691119212. URL http://www.jstor.org/stable/j.ctt7t6pw.
- Jeremy Howard. imagenette. https://github.com/fastai/imagenette/. URL https://github.com/fastai/imagenette/.
- Emilien Joly, Gábor Lugosi, Roberto Imbuzeiro Oliveira, et al. On the estimation of the mean of a random vector. Electronic Journal of Statistics, 11(1):440–451, 2017.
- Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977, 2019.
- S. P. Karimireddy, Q. Rebjock, S. Stich, and M. Jaggi. Error feedback fixes SignSGD and other gradient compression schemes. In Proc. International Conference on Machine Learning (ICML), 2019.
- Frederik Künstner. Fully quantized distributed gradient descent. http://infoscience.epfl.ch/record/234548, 2017. URL http://infoscience.epfl.ch/record/234548.
- Jakub Konečný and Peter Richtárik. Randomized distributed mean estimation: Accuracy vs. communication. Frontiers in Applied Mathematics and Statistics, 4:62, 2018. ISSN 22974687. doi: 10.3389/fams.2018.00062. URL https://www.frontiersin.org/article/10.3389/fams.2018.00062.
- A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
- Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The cifar-10 dataset. http://www.cs.toronto.edu/kriz/cifar.html, 2014.
- Yucheng Lu and Christopher De Sa. Moniqua: Modulo quantized communication in decentralized SGD. In Proc. International Conference on Machine Learning (ICML), 2020.
- Prathamesh Mayekar and Himanshu Tyagi. Ratq: A universal fixed-length quantizer for stochastic optimization. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), 2020.
- Hermann Minkowski. Gesammelte Abhandlungen. Teubner, 1911.
- Konstantin Mishchenko, Eduard Gorbunov, Martin Takáč, and Peter Richtárik. Distributed learning with compressed gradient differences. arXiv preprint arXiv:1901.09269, 2019.
- Ali Ramezani-Kebrya, Fartash Faghri, and Daniel M Roy. Nuqsgd: Improved communication efficiency for data-parallel sgd via nonuniform quantization. arXiv preprint arXiv:1908.06077, 2019.
- C. A. Rogers. A note on coverings and packings. Journal of the London Mathematical Society, s1-25(4):327–331, 1950. doi: 10.1112/jlms/s1-25.4.327. URL https://londmathsoc.onlinelibrary.wiley.com/doi/abs/10.1112/jlms/s1-25.4.327.
- F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In Proc. INTERSPEECH, 2014a.
- Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and application to data-parallel distributed training of speech dnns. In Interspeech 2014, September 2014b.
- Sebastian U Stich. Local sgd converges fast and communicates little. arXiv preprint arXiv:1805.09767, 2018.
- Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified sgd with memory. In Advances in Neural Information Processing Systems, pp. 4447–4458, 2018.
- Ananda Theertha Suresh, Felix X Yu, Sanjiv Kumar, and H Brendan McMahan. Distributed mean estimation with limited communication. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3329–3337, 2017.
- John N Tsitsiklis and Zhi-Quan Luo. Communication complexity of convex optimization. Journal of Complexity, 3(3):231–243, 1987.
- Thijs Vogels, Sai Praneeth Karimireddy, and Martin Jaggi. Powersgd: Practical low-rank gradient compression for distributed optimization. In Advances in Neural Information Processing Systems, pp. 14236–14245, 2019.
- Hongyi Wang, Scott Sievert, Shengchao Liu, Zachary Charles, Dimitris Papailiopoulos, and Stephen Wright. Atomo: Communication-efficient learning via atomic sparsification. In Advances in Neural Information Processing Systems, pp. 9850–9861, 2018.
- Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsification for communication-efficient distributed optimization. In Advances in Neural Information Processing Systems, pp. 1299–1309, 2018.
- Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pp. 1508–1518, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.
- Vol(Bδ+rp ) V ol(Brp )
- Vol(Br1 ) V ol(Br2 )
- Vol(Bδ−rc ) V ol(Brc )
- Proof. For 1 or 2-norm (or indeed any p norm, p ≥ 1), see Rogers (1950), or Proposition 1.1 of Henk (2016). Under ∞ norm, the standard cubic lattice (i.e. with the standard basis as lattice basis) clearly has this property (in fact, rc = rp).
- 1. Therefore, we have a positive probability that c satisfies the criteria of the lemma, so by the probabilistic method, such a good coloring must exist. Vol(B7 ) V ol(Bq2 )
- 21. In this case, y is either z, or is outside Br3 (λv). If xu − xv Using the cubic lattice, though, need not sacrifice too much by way of theoretical guarantees under 2-norm, since, as noted in Suresh et al. (2017), a random rotation using the WalshHadamard transform can ensure good bounds on the ratio between 2 norm and ∞ norm of vectors.
- Let H be the d × d normalized Hadamard matrix Hi,j = d−1/2(−1) i−1,j−1, where i, j is the dot-product of the log2 d-dimension {0, 1}-valued vectors given by i, j expressed in binary (we must assume here that d is a power of two, but this does not affect asymptotic results). We use the following well-known properties of H (see e.g. Horadam (2007)):
- Before applying our MeanEstimation or VarianceReduction algorithms, we apply the transformation HD to all inputs - we then invert the transform (i.e. apply (HD)−1 = D−1H) before final output. As shown in Ailon & Chazelle (2009), both the forward and inverse transform require only O(d log d) computation.
- Proof. We follow a similar argument to Ailon & Chazelle (2009). Fix some vector x ∈ S
- (2009)), we therefore obtain Pr [|(HDx)j| ≥ s x 2] ≤ 2e−s2d/2. Plugging in s = 2 ln nd d
- , when Bv is the string received. For any Bv, OU TBv has volume at most V ol(B2δ), since the δ-balls of any two points in OU TBv must intersect (as the probabilities of EST falling within the balls sum to more than 1). We further denote OU Tv to be the union of these sets over all Bv. Then: Vol(OU Tv) ≤
- Vol(OU TBv ) < 2b · V ol(B2δ) ≤ 2b
- Vol(B y ).
- , and see that: Vol(OU Tv) < 2b
- Vol(B y ) = 2b · 2−b · V ol(B y ) = V ol(B y ).
- Vol(OU Tv) < 2−dV ol(Bσ).
- Vol(OU Tv)
- 3. F Sublinear Communication Vol(V or+(0)) V ol(V or(0))
- 31. Therefore, by Markov’s inequality, the probability of falling within at least (1+2q)2d expanded Voronoi regions is at most (1 + 2q)−d.

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn