## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Distributed Stochastic Variance Reduced Gradient Methods by Sampling Extra Data with Replacement.

JOURNAL OF MACHINE LEARNING RESEARCH, (2017): 122:1-122:43

EI

Keywords

Abstract

We study the round complexity of minimizing the average of convex functions under a new setting of distributed optimization where each machine can receive two subsets of functions. The first subset is from a random partition and the second subset is randomly sampled with replacement. Under this setting, we define a broad class of distribu...More

Code:

Data:

Introduction

- The authors consider the distributed optimization problem of minimizing the average of N convex functions fi : Rd → R for i = 1, .
- The norm · represents the Euclidean norm and ·, · represents the inner product in Rd. Throughout the whole paper, the authors make the following standard assumptions on problem (1).
- The average function f is μ-convex with μ ≥ 0, i.e., f (x) ≥ f (y) + ∇f (y), x − y + μ 2 x−y ∀x, y ∈ Rd. When μ

Highlights

- In this paper, we consider the distributed optimization problem of minimizing the average of N convex functions fi : Rd → R for i = 1, . . . , N, i.e., min x∈Rd f (x) := N fi(x) i=1 (1)

using m machines - Because our lower bound for the round complexity is provided for the algorithms in Fα where {Sj}j∈[m] is a random partition, we present both distributed stochastic variance reduced gradient and distributed accelerated stochastic variance reduced gradient in the same setting to show that the theoretical lower bound is almost reachable and distributed accelerated stochastic variance reduced gradient is nearly optimal within Fα
- We study the round complexity for minimizing the average of N convex functions by distributed optimization with m machines under a new setting where each machine receives a subset of the N functions through both random partition and random sampling with replacement
- Distributed stochastic variance reduced gradient utilizes the local functions sampled with replacement to construct the unbiased stochastic gradient in each iterative update
- We provide the theoretical analysis on the rounds of communication needed by distributed stochastic variance reduced gradient to find an -optimal solution, showing that distributed stochastic variance reduced gradient is optimal in terms of runtime, the amount of communication and the rounds of communication when the condition number is small
- The rounds of communication needed by distributed accelerated stochastic variance reduced gradient matches this lower bound up to logarithmic terms, and is nearly optimal within this family

Methods

**Experiments with Simulated Data**

the authors conduct numerical experiments with simulated data to compare our

DSVRG algorithm with DISCO (Zhang and Lin, 2015) and a distributed implementation of gradient descent (GD) method.- The authors conduct numerical experiments with real data to compare the DSVRG and DASVRG algorithms with DisDSCA by Yang (2013) and a distributed implementation of the accelerated gradient method (Accel Grad) by Nesterov (2013).
- The authors apply these four algorithms to the ERM problem (2) with three data sets:6 Covtype, Million Song and Epsilon.
- To compare these methods in a challenging setting, the authors conduct experiments using random Fourier features

Conclusion

- The authors study the round complexity for minimizing the average of N convex functions by distributed optimization with m machines under a new setting where each machine receives a subset of the N functions through both random partition and random sampling with replacement.
- When the condition number is large, using an acceleration strategy by Frostig et al (2015) and Lin et al (2015), the authors proposed a DASVRG algorithm that requires even fewer rounds of communication than DSVRG and many existing methods that only store random partitioned data in machines, showing the advantage of the new distributed setting.
- The authors provide the minimum number of rounds of communication needed by this family of algorithms for finding an -solution.
- The rounds of communication needed by DASVRG matches this lower bound up to logarithmic terms, and is nearly optimal within this family

- Table1: Rounds and settings of different distributed optimization algorithms. Except DSVRG and DASVRG, all algorithms in this table only require α = 0 (i.e., they do not require a subset Rj sampled with replacement)

Related work

- The work most closely related to our paper is Arjevani and Shamir (2015), where a lower bound for the rounds of communication was established for solving min x∈Rd 1 m m fj (x) j=1 (3)

using a broad class of distributed algorithms when machine j has only access to the local function fj(x) for j = 1, 2, . . . , m. To connect (3) to (1), we can define the local function as fj (x) := 1 |Sj | i∈Sj fi(x) (4)

for a given partition {Sj}j∈[m] of {fi}i∈[N]. Arjevani and Shamir (2015) proved that, if the local functions {fj}j∈[m] are δ-related (see Arjevani and Shamir (2015) for the d√efinition) and f is strongly convex, the class of algorithms they considered needs at least Ω δκ log 1

rounds of communication to find an -optimal solution for (3). When δ = Ω(1), their lower bound can be achieved by a straightforward centralized distributed implementation of accelerated gradient methods. In a specific context of linear regression with functions in

Sj being a i.i.d.

Funding

- Tianbao Yang is partially supported by National Science Foundation (IIS-1463988, IIS-1545995)

Reference

- Alekh Agarwal and Leon Bottou. A lower bound for the optimization of finite sums. In International Conference on Machine Learning (ICML), 2015.
- Alekh Agarwal, Sahand Negahban, and Martin Wainwright. Fast global convergence of gradient methods for high-dimensional statistical recovery. The Annals of Statistics, 40 (5):2452–2482, 2012.
- Yossi Arjevani and Ohad Shamir. Communication complexity of distributed convex learning and optimization. In Advances in Neural Information Processing Systems (NIPS), 2015.
- Necdet Aybat, Zi Wang, and Garud Iyengar. An asynchronous distributed proximal gradient method for composite convex optimization. In International Conference on Machine Learning (ICML), 2015.
- Amir Beck, Angelia Nedic, Asuman Ozdaglar, and Marc Teboulle. An O(1/k) gradient method for network resource allocation problems. Transactions on Control of Network Systems, 1:64–73, 2014.
- Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122, 2011.
- Mark Braverman, Ankit Garg, Tengyu Ma, Huy L Nguyen, and David P Woodruff. Communication lower bounds for statistical estimation problems via a distributed data processing inequality. In The ACM Symposium on Theory of Computing, pages 1011–1020, 2016.
- Tsung-Hui Chang, Mingyi Hong, and Xiangfeng Wang. Multi-agent distributed optimization via inexact consensus ADMM. IEEE Transactions on Signal Processing, 63:482 – 497, 2015.
- Annie I. Chen and Asuman E. Ozdaglar. A fast distributed proximal-gradient method. In Annual Allerton Conference on Communication, Control, and Computing, pages 601–608, 2012.
- Zihao Chen, Luo Luo, and Zhihua Zhang. Communication lower bounds for distributed convex optimization: Partition data on features. In AAAI, pages 1812–1818, 2017.
- Vaclav Chvatal. The tail of the hypergeometric distribution. Discrete Mathematics, 25: 285–287, 1979.
- Jeffrey Dean and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.
- Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems (NIPS), 2014a.
- Aaron Defazio, Justin Domke, and Tiberio S. Caetano. Finito: A faster, permutable incremental gradient method for big data problems. In International Conference on Machine Learning (ICML), 2014b.
- Wei Deng and Wotao Yin. On the global and linear convergence of the generalized alternating direction method of multipliers. Journal of Scientific Computing, 66:889–916, 2016.
- John C. Duchi, Alekh Agarwal, and Martin J. Wainwright. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic Control, 57(3):592–606, 2012.
- Roy Frostig, Rong Ge, Sham Kakade, and Aaron Sidford. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In International Conference on Machine Learning (ICML), pages 2540–2548, 2015.
- William Gropp, Ewing Lusk, Nathan Doss, and Anthony Skjellum. A high-performance, portable implementation of the mpi message passing interface standard. Parallel Computing, 22(6):789–828, 1996.
- Martin Jaggi, Virginia Smith, Martin Takac, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, and Michael I. Jordan. Communication-efficient distributed dual coordinate ascent. In Advances in Neural Information Processing Systems (NIPS), 2014.
- Dusan Jakovetic, Jose M. F. Moura, and Joao Xavier. Distributed Nesterov-like gradient algorithms. In IEEE Annual Conference on Decision and Control (CDC), 2012.
- Dusan Jakovetic, Joao Xavier, and Jose M. F. Moura. Fast distributed gradient methods. IEEE Transactions on Automatic Control, 59:1131–1146, 2014.
- Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems (NIPS), pages 315–323, 2013.
- Jakub Konecnyand Peter Richtarik. Semi-stochastic gradient descent methods. Frontiers in Applied Mathematics and Statistics, 3:9, 2017.
- Jakub Konecny, H. Brendan McMahan, Daniel Ramage, and Peter Richtarik. Federated optimization: Distributed machine learning for on-device intelligence. arXiv preprint arXiv:1610.02527, 2016.
- Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. Mathematical Programming, 12:1–49, 2017.
- Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems (NIPS), 2015.
- Chenxin Ma, Virginia Smith, Martin Jaggi, Michael I Jordan, Peter Richtarik, and Martin Takac. Adding vs. averaging in distributed primal-dual optimization. In International Conference on Machine Learning (ICML), 2015.
- Mehrdad Mahdavi, Lijun Zhang, and Rong Jin. Mixed optimization for smooth functions. In Advances in Neural Information Processing Systems (NIPS), 2013.
- Julien Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25:829–855, 2015.
- Ali Makhdoumi and Asuman Ozdaglar. Convergence rate of distributed ADMM over networks. IEEE Transactions on Automatic Control, 2017.
- Aryan Mokhtari, Wei Shi, Qing Ling, and Alejandro Ribeiro. Dqm: Decentralized quadratically approximated alternating direction method of multipliers. IEEE Transactions on Signal Processing, 64:5158 –5173, 2016.
- Joao F. C. Mota, Joao M. F. Xavier, Pedro M. Q. Aguiar, and Markus Puschel. D-ADMM: A communication-efficient distributed algorithm for separable optimization. IEEE Transaction on Signal Processing, 61(10):2718–2723, 2013.
- Angelia Nedic and Asuman Ozdaglar. On the rate of convergence of distributed subgradient methods for multi-agent optimization. In IEEE Annual Conference on Decision and Control (CDC), 2007.
- Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54:49–61, 2009.
- Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Springer Science & Business Media, 2013.
- Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems (NIPS), pages 1177–1184, 2008.
- Sashank J Reddi, Jakub Konecny, Peter Richtarik, Barnabas Poczos, and Alex Smola. AIDE: Fast and communication efficient distributed optimization. arXiv preprint arXiv:1608.06879, 2016.
- Nicolas L Roux, Mark Schmidt, and Francis R Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems (NIPS), pages 2663–2671, 2012.
- Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112, 2017.
- Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14:567–599, 2013.
- Shai Shalev-Shwartz, Ohad Shamir, Karthik Sridharan, and Nathan Srebro. Stochastic convex optimization. In International Conference of Machine Learning (ICML), 2009.
- Ohad Shamir. Without-replacement sampling for stochastic gradient methods. In Advances in Neural Information Processing Systems (NIPS), 2016.
- Ohad Shamir and Nathan Srebro. On distributed stochastic optimization and learning. In Annual Allerton Conference on Communication, Control, and Computing, 2014.
- Ohad Shamir, Nati Srebro, and Tong Zhang. Communication-efficient distributed optimization using an approximate Newton-type method. In International Conference on Machine Learning (ICML), pages 1000–1008, 2014.
- Wei Shi, Qing Ling, Kun Yuan, Gang Wu, and Wotao Yin. On the linear convergence of the ADMM in decentralized consensus optimization. IEEE Transactions on Signal Processing, 62:1750–1761, 2014.
- Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. A proximal gradient algorithm for decentralized composite optimization. IEEE Transactions on Signal Processing, 63:6013–6023, 2015a.
- Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. Extra: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25:944–966, 2015b.
- J. V. Uspensky. Introduction to Mathematical Probability. McGraw-Hill, 1937. Ermin Wei and Asuman Ozdaglar. Distributed alternating direction method of multipliers. In IEEE Annual Conference on Decision and Control (CDC), pages 5445–5450, 2012. L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014. Tianbao Yang. Trading computation for communication: Distributed stochastic dual coordinate ascent. In Advances in Neural Information Processing Systems (NIPS), 2013. Yuchen Zhang and Xiao Lin. Disco: Distributed optimization for self-concordant empirical loss. In International Conference on Machine Learning (ICML), pages 362–370, 2015.

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn