## AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically Go Generating

## AI Traceability

AI parses the academic lineage of this thesis Generate MRT

## AI Insight

AI extracts a summary of this paper

Weibo:
We study the round complexity of minimizing the average of convex functions under a new setting of distributed optimization where each machine can receive two subsets of functions

# Distributed Stochastic Variance Reduced Gradient Methods by Sampling Extra Data with Replacement.

JOURNAL OF MACHINE LEARNING RESEARCH, (2017): 122:1-122:43

Cited: 30|Views39
EI
Full Text
Bibtex
Weibo

Abstract

We study the round complexity of minimizing the average of convex functions under a new setting of distributed optimization where each machine can receive two subsets of functions. The first subset is from a random partition and the second subset is randomly sampled with replacement. Under this setting, we define a broad class of distribu...More

Code:

Data:

0
Introduction
• The authors consider the distributed optimization problem of minimizing the average of N convex functions fi : Rd → R for i = 1, .
• The norm · represents the Euclidean norm and ·, · represents the inner product in Rd. Throughout the whole paper, the authors make the following standard assumptions on problem (1).
• The average function f is μ-convex with μ ≥ 0, i.e., f (x) ≥ f (y) + ∇f (y), x − y + μ 2 x−y ∀x, y ∈ Rd. When μ
Highlights
• In this paper, we consider the distributed optimization problem of minimizing the average of N convex functions fi : Rd → R for i = 1, . . . , N, i.e., min x∈Rd f (x) := N fi(x) i=1 (1)

using m machines
• Because our lower bound for the round complexity is provided for the algorithms in Fα where {Sj}j∈[m] is a random partition, we present both distributed stochastic variance reduced gradient and distributed accelerated stochastic variance reduced gradient in the same setting to show that the theoretical lower bound is almost reachable and distributed accelerated stochastic variance reduced gradient is nearly optimal within Fα
• We study the round complexity for minimizing the average of N convex functions by distributed optimization with m machines under a new setting where each machine receives a subset of the N functions through both random partition and random sampling with replacement
• Distributed stochastic variance reduced gradient utilizes the local functions sampled with replacement to construct the unbiased stochastic gradient in each iterative update
• We provide the theoretical analysis on the rounds of communication needed by distributed stochastic variance reduced gradient to find an -optimal solution, showing that distributed stochastic variance reduced gradient is optimal in terms of runtime, the amount of communication and the rounds of communication when the condition number is small
• The rounds of communication needed by distributed accelerated stochastic variance reduced gradient matches this lower bound up to logarithmic terms, and is nearly optimal within this family
Methods
• Experiments with Simulated Data

the authors conduct numerical experiments with simulated data to compare our

DSVRG algorithm with DISCO (Zhang and Lin, 2015) and a distributed implementation of gradient descent (GD) method.
• The authors conduct numerical experiments with real data to compare the DSVRG and DASVRG algorithms with DisDSCA by Yang (2013) and a distributed implementation of the accelerated gradient method (Accel Grad) by Nesterov (2013).
• The authors apply these four algorithms to the ERM problem (2) with three data sets:6 Covtype, Million Song and Epsilon.
• To compare these methods in a challenging setting, the authors conduct experiments using random Fourier features
Conclusion
• The authors study the round complexity for minimizing the average of N convex functions by distributed optimization with m machines under a new setting where each machine receives a subset of the N functions through both random partition and random sampling with replacement.
• When the condition number is large, using an acceleration strategy by Frostig et al (2015) and Lin et al (2015), the authors proposed a DASVRG algorithm that requires even fewer rounds of communication than DSVRG and many existing methods that only store random partitioned data in machines, showing the advantage of the new distributed setting.
• The authors provide the minimum number of rounds of communication needed by this family of algorithms for finding an -solution.
• The rounds of communication needed by DASVRG matches this lower bound up to logarithmic terms, and is nearly optimal within this family
Tables
• Table1: Rounds and settings of different distributed optimization algorithms. Except DSVRG and DASVRG, all algorithms in this table only require α = 0 (i.e., they do not require a subset Rj sampled with replacement)
Related work
• The work most closely related to our paper is Arjevani and Shamir (2015), where a lower bound for the rounds of communication was established for solving min x∈Rd 1 m m fj (x) j=1 (3)

using a broad class of distributed algorithms when machine j has only access to the local function fj(x) for j = 1, 2, . . . , m. To connect (3) to (1), we can define the local function as fj (x) := 1 |Sj | i∈Sj fi(x) (4)

for a given partition {Sj}j∈[m] of {fi}i∈[N]. Arjevani and Shamir (2015) proved that, if the local functions {fj}j∈[m] are δ-related (see Arjevani and Shamir (2015) for the d√efinition) and f is strongly convex, the class of algorithms they considered needs at least Ω δκ log 1

rounds of communication to find an -optimal solution for (3). When δ = Ω(1), their lower bound can be achieved by a straightforward centralized distributed implementation of accelerated gradient methods. In a specific context of linear regression with functions in

Sj being a i.i.d.
Funding
• Tianbao Yang is partially supported by National Science Foundation (IIS-1463988, IIS-1545995)
Reference
Author    0