AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We study the round complexity of minimizing the average of convex functions under a new setting of distributed optimization where each machine can receive two subsets of functions

Distributed Stochastic Variance Reduced Gradient Methods by Sampling Extra Data with Replacement.

JOURNAL OF MACHINE LEARNING RESEARCH, (2017): 122:1-122:43

Cited: 30|Views39
EI
Full Text
Bibtex
Weibo

Abstract

We study the round complexity of minimizing the average of convex functions under a new setting of distributed optimization where each machine can receive two subsets of functions. The first subset is from a random partition and the second subset is randomly sampled with replacement. Under this setting, we define a broad class of distribu...More

Code:

Data:

0
Introduction
  • The authors consider the distributed optimization problem of minimizing the average of N convex functions fi : Rd → R for i = 1, .
  • The norm · represents the Euclidean norm and ·, · represents the inner product in Rd. Throughout the whole paper, the authors make the following standard assumptions on problem (1).
  • The average function f is μ-convex with μ ≥ 0, i.e., f (x) ≥ f (y) + ∇f (y), x − y + μ 2 x−y ∀x, y ∈ Rd. When μ
Highlights
  • In this paper, we consider the distributed optimization problem of minimizing the average of N convex functions fi : Rd → R for i = 1, . . . , N, i.e., min x∈Rd f (x) := N fi(x) i=1 (1)

    using m machines
  • Because our lower bound for the round complexity is provided for the algorithms in Fα where {Sj}j∈[m] is a random partition, we present both distributed stochastic variance reduced gradient and distributed accelerated stochastic variance reduced gradient in the same setting to show that the theoretical lower bound is almost reachable and distributed accelerated stochastic variance reduced gradient is nearly optimal within Fα
  • We study the round complexity for minimizing the average of N convex functions by distributed optimization with m machines under a new setting where each machine receives a subset of the N functions through both random partition and random sampling with replacement
  • Distributed stochastic variance reduced gradient utilizes the local functions sampled with replacement to construct the unbiased stochastic gradient in each iterative update
  • We provide the theoretical analysis on the rounds of communication needed by distributed stochastic variance reduced gradient to find an -optimal solution, showing that distributed stochastic variance reduced gradient is optimal in terms of runtime, the amount of communication and the rounds of communication when the condition number is small
  • The rounds of communication needed by distributed accelerated stochastic variance reduced gradient matches this lower bound up to logarithmic terms, and is nearly optimal within this family
Methods
  • Experiments with Simulated Data

    the authors conduct numerical experiments with simulated data to compare our

    DSVRG algorithm with DISCO (Zhang and Lin, 2015) and a distributed implementation of gradient descent (GD) method.
  • The authors conduct numerical experiments with real data to compare the DSVRG and DASVRG algorithms with DisDSCA by Yang (2013) and a distributed implementation of the accelerated gradient method (Accel Grad) by Nesterov (2013).
  • The authors apply these four algorithms to the ERM problem (2) with three data sets:6 Covtype, Million Song and Epsilon.
  • To compare these methods in a challenging setting, the authors conduct experiments using random Fourier features
Conclusion
  • The authors study the round complexity for minimizing the average of N convex functions by distributed optimization with m machines under a new setting where each machine receives a subset of the N functions through both random partition and random sampling with replacement.
  • When the condition number is large, using an acceleration strategy by Frostig et al (2015) and Lin et al (2015), the authors proposed a DASVRG algorithm that requires even fewer rounds of communication than DSVRG and many existing methods that only store random partitioned data in machines, showing the advantage of the new distributed setting.
  • The authors provide the minimum number of rounds of communication needed by this family of algorithms for finding an -solution.
  • The rounds of communication needed by DASVRG matches this lower bound up to logarithmic terms, and is nearly optimal within this family
Tables
  • Table1: Rounds and settings of different distributed optimization algorithms. Except DSVRG and DASVRG, all algorithms in this table only require α = 0 (i.e., they do not require a subset Rj sampled with replacement)
Download tables as Excel
Related work
  • The work most closely related to our paper is Arjevani and Shamir (2015), where a lower bound for the rounds of communication was established for solving min x∈Rd 1 m m fj (x) j=1 (3)

    using a broad class of distributed algorithms when machine j has only access to the local function fj(x) for j = 1, 2, . . . , m. To connect (3) to (1), we can define the local function as fj (x) := 1 |Sj | i∈Sj fi(x) (4)

    for a given partition {Sj}j∈[m] of {fi}i∈[N]. Arjevani and Shamir (2015) proved that, if the local functions {fj}j∈[m] are δ-related (see Arjevani and Shamir (2015) for the d√efinition) and f is strongly convex, the class of algorithms they considered needs at least Ω δκ log 1

    rounds of communication to find an -optimal solution for (3). When δ = Ω(1), their lower bound can be achieved by a straightforward centralized distributed implementation of accelerated gradient methods. In a specific context of linear regression with functions in

    Sj being a i.i.d.
Funding
  • Tianbao Yang is partially supported by National Science Foundation (IIS-1463988, IIS-1545995)
Reference
  • Alekh Agarwal and Leon Bottou. A lower bound for the optimization of finite sums. In International Conference on Machine Learning (ICML), 2015.
    Google ScholarLocate open access versionFindings
  • Alekh Agarwal, Sahand Negahban, and Martin Wainwright. Fast global convergence of gradient methods for high-dimensional statistical recovery. The Annals of Statistics, 40 (5):2452–2482, 2012.
    Google ScholarLocate open access versionFindings
  • Yossi Arjevani and Ohad Shamir. Communication complexity of distributed convex learning and optimization. In Advances in Neural Information Processing Systems (NIPS), 2015.
    Google ScholarLocate open access versionFindings
  • Necdet Aybat, Zi Wang, and Garud Iyengar. An asynchronous distributed proximal gradient method for composite convex optimization. In International Conference on Machine Learning (ICML), 2015.
    Google ScholarLocate open access versionFindings
  • Amir Beck, Angelia Nedic, Asuman Ozdaglar, and Marc Teboulle. An O(1/k) gradient method for network resource allocation problems. Transactions on Control of Network Systems, 1:64–73, 2014.
    Google ScholarLocate open access versionFindings
  • Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122, 2011.
    Google ScholarLocate open access versionFindings
  • Mark Braverman, Ankit Garg, Tengyu Ma, Huy L Nguyen, and David P Woodruff. Communication lower bounds for statistical estimation problems via a distributed data processing inequality. In The ACM Symposium on Theory of Computing, pages 1011–1020, 2016.
    Google ScholarLocate open access versionFindings
  • Tsung-Hui Chang, Mingyi Hong, and Xiangfeng Wang. Multi-agent distributed optimization via inexact consensus ADMM. IEEE Transactions on Signal Processing, 63:482 – 497, 2015.
    Google ScholarLocate open access versionFindings
  • Annie I. Chen and Asuman E. Ozdaglar. A fast distributed proximal-gradient method. In Annual Allerton Conference on Communication, Control, and Computing, pages 601–608, 2012.
    Google ScholarLocate open access versionFindings
  • Zihao Chen, Luo Luo, and Zhihua Zhang. Communication lower bounds for distributed convex optimization: Partition data on features. In AAAI, pages 1812–1818, 2017.
    Google ScholarLocate open access versionFindings
  • Vaclav Chvatal. The tail of the hypergeometric distribution. Discrete Mathematics, 25: 285–287, 1979.
    Google ScholarLocate open access versionFindings
  • Jeffrey Dean and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.
    Google ScholarLocate open access versionFindings
  • Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems (NIPS), 2014a.
    Google ScholarLocate open access versionFindings
  • Aaron Defazio, Justin Domke, and Tiberio S. Caetano. Finito: A faster, permutable incremental gradient method for big data problems. In International Conference on Machine Learning (ICML), 2014b.
    Google ScholarLocate open access versionFindings
  • Wei Deng and Wotao Yin. On the global and linear convergence of the generalized alternating direction method of multipliers. Journal of Scientific Computing, 66:889–916, 2016.
    Google ScholarLocate open access versionFindings
  • John C. Duchi, Alekh Agarwal, and Martin J. Wainwright. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic Control, 57(3):592–606, 2012.
    Google ScholarLocate open access versionFindings
  • Roy Frostig, Rong Ge, Sham Kakade, and Aaron Sidford. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In International Conference on Machine Learning (ICML), pages 2540–2548, 2015.
    Google ScholarLocate open access versionFindings
  • William Gropp, Ewing Lusk, Nathan Doss, and Anthony Skjellum. A high-performance, portable implementation of the mpi message passing interface standard. Parallel Computing, 22(6):789–828, 1996.
    Google ScholarLocate open access versionFindings
  • Martin Jaggi, Virginia Smith, Martin Takac, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, and Michael I. Jordan. Communication-efficient distributed dual coordinate ascent. In Advances in Neural Information Processing Systems (NIPS), 2014.
    Google ScholarLocate open access versionFindings
  • Dusan Jakovetic, Jose M. F. Moura, and Joao Xavier. Distributed Nesterov-like gradient algorithms. In IEEE Annual Conference on Decision and Control (CDC), 2012.
    Google ScholarLocate open access versionFindings
  • Dusan Jakovetic, Joao Xavier, and Jose M. F. Moura. Fast distributed gradient methods. IEEE Transactions on Automatic Control, 59:1131–1146, 2014.
    Google ScholarLocate open access versionFindings
  • Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems (NIPS), pages 315–323, 2013.
    Google ScholarLocate open access versionFindings
  • Jakub Konecnyand Peter Richtarik. Semi-stochastic gradient descent methods. Frontiers in Applied Mathematics and Statistics, 3:9, 2017.
    Google ScholarLocate open access versionFindings
  • Jakub Konecny, H. Brendan McMahan, Daniel Ramage, and Peter Richtarik. Federated optimization: Distributed machine learning for on-device intelligence. arXiv preprint arXiv:1610.02527, 2016.
    Findings
  • Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. Mathematical Programming, 12:1–49, 2017.
    Google ScholarLocate open access versionFindings
  • Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems (NIPS), 2015.
    Google ScholarLocate open access versionFindings
  • Chenxin Ma, Virginia Smith, Martin Jaggi, Michael I Jordan, Peter Richtarik, and Martin Takac. Adding vs. averaging in distributed primal-dual optimization. In International Conference on Machine Learning (ICML), 2015.
    Google ScholarLocate open access versionFindings
  • Mehrdad Mahdavi, Lijun Zhang, and Rong Jin. Mixed optimization for smooth functions. In Advances in Neural Information Processing Systems (NIPS), 2013.
    Google ScholarLocate open access versionFindings
  • Julien Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25:829–855, 2015.
    Google ScholarLocate open access versionFindings
  • Ali Makhdoumi and Asuman Ozdaglar. Convergence rate of distributed ADMM over networks. IEEE Transactions on Automatic Control, 2017.
    Google ScholarLocate open access versionFindings
  • Aryan Mokhtari, Wei Shi, Qing Ling, and Alejandro Ribeiro. Dqm: Decentralized quadratically approximated alternating direction method of multipliers. IEEE Transactions on Signal Processing, 64:5158 –5173, 2016.
    Google ScholarLocate open access versionFindings
  • Joao F. C. Mota, Joao M. F. Xavier, Pedro M. Q. Aguiar, and Markus Puschel. D-ADMM: A communication-efficient distributed algorithm for separable optimization. IEEE Transaction on Signal Processing, 61(10):2718–2723, 2013.
    Google ScholarLocate open access versionFindings
  • Angelia Nedic and Asuman Ozdaglar. On the rate of convergence of distributed subgradient methods for multi-agent optimization. In IEEE Annual Conference on Decision and Control (CDC), 2007.
    Google ScholarLocate open access versionFindings
  • Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54:49–61, 2009.
    Google ScholarLocate open access versionFindings
  • Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Springer Science & Business Media, 2013.
    Google ScholarFindings
  • Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems (NIPS), pages 1177–1184, 2008.
    Google ScholarLocate open access versionFindings
  • Sashank J Reddi, Jakub Konecny, Peter Richtarik, Barnabas Poczos, and Alex Smola. AIDE: Fast and communication efficient distributed optimization. arXiv preprint arXiv:1608.06879, 2016.
    Findings
  • Nicolas L Roux, Mark Schmidt, and Francis R Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems (NIPS), pages 2663–2671, 2012.
    Google ScholarLocate open access versionFindings
  • Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112, 2017.
    Google ScholarLocate open access versionFindings
  • Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14:567–599, 2013.
    Google ScholarLocate open access versionFindings
  • Shai Shalev-Shwartz, Ohad Shamir, Karthik Sridharan, and Nathan Srebro. Stochastic convex optimization. In International Conference of Machine Learning (ICML), 2009.
    Google ScholarLocate open access versionFindings
  • Ohad Shamir. Without-replacement sampling for stochastic gradient methods. In Advances in Neural Information Processing Systems (NIPS), 2016.
    Google ScholarLocate open access versionFindings
  • Ohad Shamir and Nathan Srebro. On distributed stochastic optimization and learning. In Annual Allerton Conference on Communication, Control, and Computing, 2014.
    Google ScholarLocate open access versionFindings
  • Ohad Shamir, Nati Srebro, and Tong Zhang. Communication-efficient distributed optimization using an approximate Newton-type method. In International Conference on Machine Learning (ICML), pages 1000–1008, 2014.
    Google ScholarLocate open access versionFindings
  • Wei Shi, Qing Ling, Kun Yuan, Gang Wu, and Wotao Yin. On the linear convergence of the ADMM in decentralized consensus optimization. IEEE Transactions on Signal Processing, 62:1750–1761, 2014.
    Google ScholarLocate open access versionFindings
  • Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. A proximal gradient algorithm for decentralized composite optimization. IEEE Transactions on Signal Processing, 63:6013–6023, 2015a.
    Google ScholarLocate open access versionFindings
  • Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. Extra: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25:944–966, 2015b.
    Google ScholarLocate open access versionFindings
  • J. V. Uspensky. Introduction to Mathematical Probability. McGraw-Hill, 1937. Ermin Wei and Asuman Ozdaglar. Distributed alternating direction method of multipliers. In IEEE Annual Conference on Decision and Control (CDC), pages 5445–5450, 2012. L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014. Tianbao Yang. Trading computation for communication: Distributed stochastic dual coordinate ascent. In Advances in Neural Information Processing Systems (NIPS), 2013. Yuchen Zhang and Xiao Lin. Disco: Distributed optimization for self-concordant empirical loss. In International Conference on Machine Learning (ICML), pages 362–370, 2015.
    Google ScholarLocate open access versionFindings
0
Your rating :

No Ratings

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn