AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We summarize the key contributions of this paper

Linearly Converging Error Compensated SGD

NIPS 2020, (2020)

Cited by: 16|Views116
EI
Full Text
Bibtex
Weibo

Abstract

In this paper, we propose a unified analysis of variants of distributed SGD with arbitrary compressions and delayed updates. Our framework is general enough to cover different variants of quantized SGD, Error-Compensated SGD (EC-SGD) and SGD with delayed updates (D-SGD). Via a single theorem, we derive the complexity results for all the...More

Code:

Data:

0
Introduction
  • The information about function fi is stored on the i-th worker only
  • Problems of this form appear in the distributed or federated training of supervised machine learning models [42, 30].
  • In such applications, x ∈ Rd describes the parameters identifying a statistical model the authors wish to train, and fi is the loss of model x on the data accessible by worker i.
Highlights
  • We consider distributed optimization problems of the form n min x∈Rd f (x) = 1 n fi(x) i=1 (1)

    where n is the number of workers/devices/clients/nodes
  • We summarize the key contributions of this paper
  • In this work we propose a general theoretical framework for analyzing a wide class of methods that can be written in the the error-feedback form (4)-(5)
  • We prove a single theorem (Theorem 3.1) from which all our complexity results follow as special cases
  • For each existing or new specific method, we prove that one of our parametric assumptions holds, and specify the parameters for which it holds
  • A summary of the values of the parameters for all methods developed in this paper is provided in Table 5 in the appendix
Methods
  • Methods with delayed updates

    Following Stich [44], the authors show that the approach covers SGD with delayed updates [1, 3, 10] (D-SGD), and the analysis shows the best-known rate for this method.
  • Due to space limitations, the authors put these methods together with their convergence analyses in the appendix.
  • The authors introduce the assumption on f , which is a relaxation of μ-strong convexity.
  • Assumption 3.1 (μ-strong quasi-convexity).
  • The authors say that function f is strongly quasi-convex with parameter μ ≥ 0 if for all x ∈ Rd f (x∗) ≥ f (x) +
Results
  • Results in the convex case

    The authors' theoretical analysis goes beyond distributed optimization and recovers the results from Gorbunov et al [11], Khaled et al [25] in the special case when vik ≡ γgik.
  • In this case eki ≡ 0 for all i and k, and the error-feedback framework (4)–(5) reduces to distributed SGD (6)
  • In this regime, the relation (19) in Assumption 3.4 becomes void, while relations (15) and (16) with σ22,k ≡ 0 are precisely those used by Gorbunov et al [11] to analyze a wide array of SGD methods, including vanilla SGD [41], SGD with arbitrary sampling [13], as well as variance reduced methods such as SAGA [9], SVRG [20], LSVRG [17, 31], JacSketch [12], SEGA [16] and DIANA [37, 19].
  • To illustrate how the framework can be used even in the case when vik ≡ γgik, eki ≡ 0, the authors develop analyze a new version of DIANA called DIANAsr-DQ that uses arbitrary sampling on every node and double quantization3, i.e., unbiased compression on the workers’ side and on the master’s one
Conclusion
  • Summary of Contributions

    The authors summarize the key contributions of this paper.

    General theoretical framework.
  • In this work the authors propose a general theoretical framework for analyzing a wide class of methods that can be written in the the error-feedback form (4)-(5).
  • For each existing or new specific method, the authors prove that one of the parametric assumptions holds, and specify the parameters for which it holds.
  • These parameters have direct impact on the theoretical rate of the method.
  • The authors remark that the values of the parameters A, A , B1, B1, B2, B2, C1, C2 and ρ1, ρ2 influence the theoretical stepsize
Tables
  • Table1: Complexity of Error-Compensated SGD methods established in this paper. Symbols
  • Table2: Error compensated methods developed in this paper. In all cases, vik = C(eki + γgik). The full descriptions of the algorithms are included in the appendix
  • Table3: Summary of datasets: N = total # of data samples; d = # of features
  • Table4: Complexity of SGD methods with delayed updates established in this paper
  • Table5: The parameters for which the methods from Tables 1 and 4 satisfy Assumption 3.4. The meaning of the expressions appearing in the table, as well as their justification is defined in details in the Sections J and K. Symbols: ε = error tolerance; δ = contraction factor of compressor C; ω = variance parameter of compressor Q; κ = L/μ; L = expected smoothness constant; σ∗2 = variance of the stochastic gradients in the solution; ζ∗2 = average of ∇fi(x∗) 2; σ2 = average of the uniform bounds for the variances of stochastic gradients of workers
Download tables as Excel
Funding
  • Acknowledgments and Disclosure of Funding The work of Peter Richtárik, Eduard Gorbunov and Dmitry Kovalev was supported by KAUST Baseline Research Fund
  • Gorbunov was also partially supported by the Ministry of Science and Higher Education of the Russian Federation (Goszadaniye) 075-00337-20-03 and RFBR, project number 19-31-51001
Study subjects and analysis
workers: 100
We simulate parameter-server architecture using one machine with Intel(R) Core(TM) i7-9750 CPU 2.60 GHz in the following way. First of all, we always use such N that N = n · m and consider n = 20 and n = 100 workers. The choice of N for each dataset that we consider is stated in Table 3

workers: 20
f(xk) f(x * ) f(x0) f(x * ). a9a, 20 workers EC-SGD top-1 EC-SGD-DIANA top-1 rand-1 EC-SGD-DIANA top-1 l2-quant EC-L-SVRG top-1 EC-L-SVRG-DIANA top-1 rand-1 EC-L-SVRG-DIANA top-1 l2-quant. 0.00 0.25N0u.5m0b0e.r7o5f 1b.i0ts0p1e.2r 5w1o.r5k0er1.75 21.0e70

workers: 20
0.00 0.25N0u.5m0b0e.r7o5f 1b.i0ts0p1e.2r 5w1o.r5k0er1.75 21.0e70. a9a, 20 workers. EC-SGD top-1 EC-SGD-DIANA top-1 rand-1 EC-SGD-DIANA top-1 l2-quant EC-L-SVRG top-1 EC-L-SVRG-DIANA top-1 rand-1 EC-L-SVRG-DIANA top-1 l2-quant

workers: 20
f(xk) f(x * ) f(x0) f(x * ). 100 10 1 10 2 10 3 madelon, 20 workers EC-SGD top-5 EC-SGD-DIANA top-5 rand-5 EC-SGD-DIANA top-5 l2-quant EC-L-SVRG top-5 EC-L-SVRG-DIANA top-5 rand-5 EC-L-SVRG-DIANA top-5 l2-quant. 500N0u0m0 b1e00r0o0f0b0i1ts50p0e0r00w2o0r0k0e0r00 2500000 madelon, 20 workers

workers: 20
100 10 1 10 2 10 3 madelon, 20 workers EC-SGD top-5 EC-SGD-DIANA top-5 rand-5 EC-SGD-DIANA top-5 l2-quant EC-L-SVRG top-5 EC-L-SVRG-DIANA top-5 rand-5 EC-L-SVRG-DIANA top-5 l2-quant. 500N0u0m0 b1e00r0o0f0b0i1ts50p0e0r00w2o0r0k0e0r00 2500000 madelon, 20 workers. EC-SGD top-5 EC-SGD-DIANA top-5 rand-5 EC-SGD-DIANA top-5 l2-quant EC-L-SVRG top-5 EC-L-SVRG-DIANA top-5 rand-5 EC-L-SVRG-DIANA top-5 l2-quant

workers: 20
f(xk) f(x * ) f(x0) f(x * ). 100 10 1 10 2 10 3 phishing, 20 workers. EC-SGD top-1 EC-SGD-DIANA top-1 rand-1 EC-SGD-DIANA top-1 l2-quant EC-L-SVRG top-1 EC-L-SVRG-DIANA top-1 rand-1 EC-L-SVRG-DIANA top-1 l2-quant

workers: 20
EC-SGD top-1 EC-SGD-DIANA top-1 rand-1 EC-SGD-DIANA top-1 l2-quant EC-L-SVRG top-1 EC-L-SVRG-DIANA top-1 rand-1 EC-L-SVRG-DIANA top-1 l2-quant. 1000N00u0m2b0e00r0o00f b3i0t0s00p0e0r 4w00o0r0k00er5000000 phishing, 20 workers. EC-SGD top-1 EC-SGD-DIANA top-1 rand-1 EC-SGD-DIANA top-1 l2-quant EC-L-SVRG top-1 EC-L-SVRG-DIANA top-1 rand-1 EC-L-SVRG-DIANA top-1 l2-quant

workers: 20
f(xk) f(x * ) f(x0) f(x * ). a9a, 20 workers. EC-GD top-1 EC-GD top-2 EC-GD-star top-1 EC-GD-DIANA top-1 rand-1 EC-GD-DIANA top-1 l2-quant

workers: 20
0 100000N0u20m00b00e0r30o00f0b00it4s000p0e00r5w00o00r0k0e60r00000 7000000. a9a, 20 workers. 10 7 0 Nu1m0b0e0r0of2p0a0s0s0es 3th0r0o0u0gh4th0e00d0ata50000 f(xk) f(x * ) f(x0) f(x * )

workers: 20
f(xk) f(x * ) f(x0) f(x * ). madelon, 20 workers. EC-GD top-5

workers: 20
EC-GD-DIANA top-5 l2-quant. 0N.5umbe1r .o0f bits1p.e5r wor2ke.0r 21.5e7 madelon, 20 workers. EC-GD top-5 EC-GD top-10 EC-GD-star top-5 EC-GD-DIANA top-5 rand-5 EC-GD-DIANA top-5 l2-quant

workers: 20
f(xk) f(x * ) f(x0) f(x * ). phishing, 20 workers. EC-GD top-2 EC-GD-star top-1

workers: 20
EC-GD-DIANA top-1 rand-1 EC-GD-DIANA top-1 l2-quant. 0.0 0.2Num0.b4er 0of.6bits0p.8er w1o.0rker1.2 11.e47 phishing, 20 workers EC-GD top-1. 0 Nu2m0b0e0r0of4p0a0s0s0es 6th0r0o0u0gh8th0e00d0at1a00000 f(xk) f(x * ) f(x0) f(x * )

workers: 20
f(xk) f(x * ) f(x0) f(x * ). w8a, 20 workers EC-SGD top-3 EC-SGD-DIANA top-3 rand-3 EC-SGD-DIANA top-3 l2-quant EC-L-SVRG top-3 EC-L-SVRG-DIANA top-3 rand-3 EC-L-SVRG-DIANA top-3 l2-quant. 0.N5umb1e.r0of bi1t.s5per w2.o0rker 2.5 1e37.0 w8a, 20 workers

workers: 20
w8a, 20 workers EC-SGD top-3 EC-SGD-DIANA top-3 rand-3 EC-SGD-DIANA top-3 l2-quant EC-L-SVRG top-3 EC-L-SVRG-DIANA top-3 rand-3 EC-L-SVRG-DIANA top-3 l2-quant. 0.N5umb1e.r0of bi1t.s5per w2.o0rker 2.5 1e37.0 w8a, 20 workers. EC-SGD top-3 EC-SGD-DIANA top-3 rand-3 EC-SGD-DIANA top-3 l2-quant EC-L-SVRG top-3 EC-L-SVRG-DIANA top-3 rand-3 EC-L-SVRG-DIANA top-3 l2-quant

workers: 20
0 Num5ber o1f0passe1s5thro2u0gh th2e5data30. w8a, 20 workers. EC-SGD identical EC-SGD top-3 EC-SGD-DIANA top-3 rand-3 EC-SGD-DIANA top-3 l2-quant EC-L-SVRG top-3 EC-L-SVRG-DIANA top-3 rand-3 EC-L-SVRG-DIANA top-3 l2-quant

workers: 20
EC-SGD identical EC-SGD top-3 EC-SGD-DIANA top-3 rand-3 EC-SGD-DIANA top-3 l2-quant EC-L-SVRG top-3 EC-L-SVRG-DIANA top-3 rand-3 EC-L-SVRG-DIANA top-3 l2-quant. 0.2Nu0m.4ber0o.f6bits0.p8er w1.o0rke1r.2 11.4e9 w8a, 20 workers. EC-SGD identical EC-SGD top-3 EC-SGD-DIANA top-3 rand-3 EC-SGD-DIANA top-3 l2-quant EC-L-SVRG top-3 EC-L-SVRG-DIANA top-3 rand-3 EC-L-SVRG-DIANA top-3 l2-quant

workers: 20
f(xk) f(x * ) f(x0) f(x * ). mushrooms, 20 workers. EC-SGD top-1 EC-SGD-DIANA top-1 rand-1 EC-SGD-DIANA top-1 l2-quant EC-L-SVRG top-1 EC-L-SVRG-DIANA top-1 rand-1 EC-L-SVRG-DIANA top-1 l2-quant

workers: 20
EC-SGD top-1 EC-SGD-DIANA top-1 rand-1 EC-SGD-DIANA top-1 l2-quant EC-L-SVRG top-1 EC-L-SVRG-DIANA top-1 rand-1 EC-L-SVRG-DIANA top-1 l2-quant. 0N.u5mber 1o.f0bits pe1r.5worker2.0 1e7 mushrooms, 20 workers. EC-SGD top-1 EC-SGD-DIANA top-1 rand-1 EC-SGD-DIANA top-1 l2-quant EC-L-SVRG top-1 EC-L-SVRG-DIANA top-1 rand-1 EC-L-SVRG-DIANA top-1 l2-quant

workers: 20
EC-SGD top-1 EC-SGD-DIANA top-1 rand-1 EC-SGD-DIANA top-1 l2-quant EC-L-SVRG top-1 EC-L-SVRG-DIANA top-1 rand-1 EC-L-SVRG-DIANA top-1 l2-quant. 0 Nu1m0b0er o2f0p0asse3s00thro4u0g0h the50d0ata 600 mushrooms, 20 workers. EC-SGD identical EC-SGD top-1 EC-SGD-DIANA top-1 rand-1 EC-SGD-DIANA top-1 l2-quant EC-L-SVRG top-1 EC-L-SVRG-DIANA top-1 rand-1 EC-L-SVRG-DIANA top-1 l2-quant

workers: 20
EC-SGD identical EC-SGD top-1 EC-SGD-DIANA top-1 rand-1 EC-SGD-DIANA top-1 l2-quant EC-L-SVRG top-1 EC-L-SVRG-DIANA top-1 rand-1 EC-L-SVRG-DIANA top-1 l2-quant. 0.2Num0b.e4r of b0i.t6s pe0r .w8orke1r.0 11.e29 mushrooms, 20 workers. EC-SGD identical EC-SGD top-1 EC-SGD-DIANA top-1 rand-1 EC-SGD-DIANA top-1 l2-quant EC-L-SVRG top-1 EC-L-SVRG-DIANA top-1 rand-1 EC-L-SVRG-DIANA top-1 l2-quant

workers: 20
f(xk) f(x * ) f(x0) f(x * ). 100 10 1 10 2 gisette, 20 workers. EC-SGD top-50 EC-SGD-DIANA top-50 rand-50 EC-SGD-DIANA top-1 l2-quant EC-L-SVRG top-50 EC-L-SVRG-DIANA top-50 rand-50 EC-L-SVRG-DIANA top-50 l2-quant

workers: 20
0.25Nu0m.5b0er0o.f7b5its1.p0e0r w1o.2rk5er1.50. 100 10 1 10 2 gisette, 20 workers. EC-SGD top-50 EC-SGD-DIANA top-50 rand-50 EC-SGD-DIANA top-1 l2-quant EC-L-SVRG top-50 EC-L-SVRG-DIANA top-50 rand-50 EC-L-SVRG-DIANA top-50 l2-quant

workers: 20
10 5 0 Numb2e0r of pas4s0es throu6g0h the d8a0ta. 100 10 1 10 2 gisette, 20 workers. EC-SGD identical EC-SGD top-50 EC-SGD-DIANA top-50 rand-50 EC-SGD-DIANA top-1 l2-quant EC-L-SVRG top-50 EC-L-SVRG-DIANA top-50 rand-50 EC-L-SVRG-DIANA top-50 l2-quant

workers: 20
EC-SGD identical EC-SGD top-50 EC-SGD-DIANA top-50 rand-50 EC-SGD-DIANA top-1 l2-quant EC-L-SVRG top-50 EC-L-SVRG-DIANA top-50 rand-50 EC-L-SVRG-DIANA top-50 l2-quant. Nu2mber of 4bits per w6orker 8 1e9 gisette, 20 workers. EC-SGD identical EC-SGD top-50 EC-SGD-DIANA top-50 rand-50 EC-SGD-DIANA top-1 l2-quant EC-L-SVRG top-50 EC-L-SVRG-DIANA top-50 rand-50 EC-L-SVRG-DIANA top-50 l2-quant

workers: 20
Compressing full gradients f(xk) f(x * ) f(x0) f(x * ). w8a, 20 workers EC-GD top-3 EC-GD top-6 EC-GD-star top-3 EC-GD-DIANA top-3 rand-3 EC-GD-DIANA top-3 l2-quant. 0.00 0.25N0u.m50be0r.7o5f b1it.0s0pe1r.2w5or1k.5er0 1.7512e.700

workers: 20
0.00 0.25N0u.m50be0r.7o5f b1it.0s0pe1r.2w5or1k.5er0 1.7512e.700. w8a, 20 workers. EC-GD top-3 EC-GD top-6 EC-GD-star top-3 EC-GD-DIANA top-3 rand-3 EC-GD-DIANA top-3 l2-quant

workers: 20
f(xk) f(x * ) f(x0) f(x * ). mushrooms, 20 workers. Number of bits per worker 1000000 2000000 3000000 4000000 5000000 6000000 7000000 mushrooms, 20 workers

workers: 20
mushrooms, 20 workers. Number of bits per worker 1000000 2000000 3000000 4000000 5000000 6000000 7000000 mushrooms, 20 workers. 0 Nu1m0b0e0r0of2p0a0s0s0es 3th0r0o0u0gh4th0e00d0ata50000 f(xk) f(x * ) f(x0) f(x * )

workers: 20
f(xk) f(x * ) f(x0) f(x * ). gisette, 20 workers. EC-GD-star top-50 EC-DIANA top-50 l2-quant

workers: 20
EC-DIANA top-50 rand-50. 0.5Num1b.0er of1b.5its p2e.r0wor2k.e5r 3.01e8 gisette, 20 workers. EC-GD top-50 EC-GD top-100 EC-GD-star top-50 EC-DIANA top-50 l2-quant EC-DIANA top-50 rand-50

workers: 100
f(xk) f(x * ) f(x0) f(x * ). a9a, 100 workers. 0 100000N0u20m00b00e0r30o00f0b00it4s000p0e00r5w00o00r0k0e60r00000 7000000

workers: 100
0 100000N0u20m00b00e0r30o00f0b00it4s000p0e00r5w00o00r0k0e60r00000 7000000. a9a, 100 workers. 0 Nu1m0b0e0r0of2p0a0s0s0es 3th0r0o0u0gh4th0e00d0ata50000

workers: 100
0 Nu1m0b0e0r0of2p0a0s0s0es 3th0r0o0u0gh4th0e00d0ata50000. w8a, 100 workers EC-GD top-3. EC-GD-star top-3 EC-GD-DIANA top-3 rand-3

workers: 100
0.00 0.25N0u.m50be0r.7o5f b1it.0s0pe1r.2w5or1k.5er0 1.7512e.700. w8a, 100 workers. 0 Nu1m0b0e0r0of2p0a0s0s0es 3th0r0o0u0gh4th0e00d0ata50000 f(xk) f(x * ) f(x0) f(x * )

workers: 100
f(xk) f(x * ) f(x0) f(x * ). madelon, 100 workers. 0N.5umbe1r .o0f bits1p.e5r wor2ke.0r 21.5e7 madelon, 100 workers

workers: 100
madelon, 100 workers. 0N.5umbe1r .o0f bits1p.e5r wor2ke.0r 21.5e7 madelon, 100 workers. 0 N5u00m0 b1e0r00o0f1p50a0s0s2e0s00t0hr2o50u0g0h30t0h0e0 3d5a00t0a 40000 mushrooms, 100 workers

workers: 100
0N.5umbe1r .o0f bits1p.e5r wor2ke.0r 21.5e7 madelon, 100 workers. 0 N5u00m0 b1e0r00o0f1p50a0s0s2e0s00t0hr2o50u0g0h30t0h0e0 3d5a00t0a 40000 mushrooms, 100 workers. Number of bits per worker 1000000 2000000 3000000 4000000 5000000 6000000 7000000 mushrooms, 100 workers

workers: 100
0 N5u00m0 b1e0r00o0f1p50a0s0s2e0s00t0hr2o50u0g0h30t0h0e0 3d5a00t0a 40000 mushrooms, 100 workers. Number of bits per worker 1000000 2000000 3000000 4000000 5000000 6000000 7000000 mushrooms, 100 workers. 0 Num10b0e00r of p2a00s0s0es th30r0o0u0gh t4h0e00d0 ata 50000 f(xk) f(x * ) f(x0) f(x * )

workers: 100
f(xk) f(x * ) f(x0) f(x * ). phishing, 100 workers. 0.0 0.2Num0.b4er 0of.6bits0p.8er w1o.0rker1.2 11.e47 phishing, 100 workers EC-GD top-1

workers: 100
phishing, 100 workers. 0.0 0.2Num0.b4er 0of.6bits0p.8er w1o.0rker1.2 11.e47 phishing, 100 workers EC-GD top-1. 0 Nu2m0b0e0r0of4p0a0s0s0es 6th0r0o0u0gh8th0e00d0at1a00000 gisette, 100 workers

workers: 100
0.0 0.2Num0.b4er 0of.6bits0p.8er w1o.0rker1.2 11.e47 phishing, 100 workers EC-GD top-1. 0 Nu2m0b0e0r0of4p0a0s0s0es 6th0r0o0u0gh8th0e00d0at1a00000 gisette, 100 workers. 0.5Num1b.0er of1b.5its p2e.r0wor2k.e5r 3.01e8 gisette, 100 workers

workers: 100
0 Nu2m0b0e0r0of4p0a0s0s0es 6th0r0o0u0gh8th0e00d0at1a00000 gisette, 100 workers. 0.5Num1b.0er of1b.5its p2e.r0wor2k.e5r 3.01e8 gisette, 100 workers. 0 Nu1m0b0e0r0of2p0a0s0s0es 3th0r0o0u0gh4th0e00d0ata50000 f(xk) f(x * ) f(x0) f(x * )

workers: 20
f(xk) f(x * ) f(x0) f(x * ). a9a, 20 workers. GD EC-GD top-1 EC-GD top-2 EC-GD-star top-1 EC-GD-DIANA top-1 rand-1 EC-GD-DIANA top-1 l2-quant

workers: 20
0.0 0.5 Nu1m.0be1r.o5f b2it.0s pe2r.5wo3rk.0er 3.5 14e.80. a9a, 20 workers. 0 Nu1m0b0e0r0of2p0a0s0s0es 3th0r0o0u0gh4th0e00d0ata50000

workers: 20
0 Nu1m0b0e0r0of2p0a0s0s0es 3th0r0o0u0gh4th0e00d0ata50000. w8a, 20 workers. GD EC-GD top-3 EC-GD top-6 EC-GD-star top-3 EC-GD-DIANA top-3 rand-3 EC-GD-DIANA top-3 l2-quant

workers: 20
GD EC-GD top-3 EC-GD top-6 EC-GD-star top-3 EC-GD-DIANA top-3 rand-3 EC-GD-DIANA top-3 l2-quant. 0N.u2mber0o.f4bits p0e.r6worke0r.8 1e19.0 w8a, 20 workers. 0 Nu1m0b0e0r0of2p0a0s0s0es 3th0r0o0u0gh4th0e00d0ata50000 f(xk) f(x * ) f(x0) f(x * )

workers: 20
f(xk) f(x * ) f(x0) f(x * ). madelon, 20 workers. EC-GD top-10 EC-GD-star top-5

workers: 20
EC-GD-DIANA top-5 rand-5 EC-GD-DIANA top-5 l2-quant. 0.2Num0b.4er of0b.6its p0e.r8wor1ke.0r 1.21e9 madelon, 20 workers. GD EC-GD top-5 EC-GD top-10 EC-GD-star top-5 EC-GD-DIANA top-5 rand-5 EC-GD-DIANA top-5 l2-quant

workers: 20
GD EC-GD top-5 EC-GD top-10 EC-GD-star top-5 EC-GD-DIANA top-5 rand-5 EC-GD-DIANA top-5 l2-quant. 0 N5u00m0 b1e0r00o0f1p50a0s0s2e0s00t0hr2o50u0g0h30t0h0e0 3d5a00t0a 40000 mushrooms, 20 workers. EC-GD top-1 EC-GD top-2

workers: 20
EC-GD-star top-1 EC-GD-DIANA top-1 rand-1 EC-GD-DIANA top-1 l2-quant. 0.5Nu1m.0ber1o.f5bits2.p0er w2.o5rke3r.0 31.5e8 mushrooms, 20 workers. 0 Num10b0e00r of p2a00s0s0es th30r0o0u0gh t4h0e00d0 ata 50000 f(xk) f(x * ) f(x0) f(x * )

workers: 20
f(xk) f(x * ) f(x0) f(x * ). phishing, 20 workers. EC-GD-star top-1 EC-GD-DIANA top-1 rand-1

workers: 20
EC-GD-DIANA top-1 l2-quant. 0 Nu1mber of2bits per w3orker 4 1e8 phishing, 20 workers GD. 0 Nu2m0b0e0r0of4p0a0s0s0es 6th0r0o0u0gh8th0e00d0at1a00000 gisette, 20 workers

workers: 20
0 Nu1mber of2bits per w3orker 4 1e8 phishing, 20 workers GD. 0 Nu2m0b0e0r0of4p0a0s0s0es 6th0r0o0u0gh8th0e00d0at1a00000 gisette, 20 workers. EC-GD top-100 EC-GD-star top-50

workers: 20
EC-DIANA top-50 l2-quant. 0.0 0.2 N0u.m4be0r.6of b0i.t8s p1e.r0wo1r.k2er 1.4 11e1.60 gisette, 20 workers. 0 Nu1m0b0e0r0of2p0a0s0s0es 3th0r0o0u0gh4th0e00d0ata50000 f(xk) f(x * ) f(x0) f(x * )

workers: 100
f(xk) f(x * ) f(x0) f(x * ). a9a, 100 workers. 0.0 0.5 Nu1m.0be1r.o5f b2it.0s pe2r.5wo3rk.0er 3.5 14e.80

workers: 100
0.0 0.5 Nu1m.0be1r.o5f b2it.0s pe2r.5wo3rk.0er 3.5 14e.80. a9a, 100 workers. 0 Nu1m0b0e0r0of2p0a0s0s0es 3th0r0o0u0gh4th0e00d0ata50000

workers: 100
0 Nu1m0b0e0r0of2p0a0s0s0es 3th0r0o0u0gh4th0e00d0ata50000. w8a, 100 workers. 0N.u2mber0o.f4bits p0e.r6worke0r.8 1e19.0 w8a, 100 workers

workers: 100
w8a, 100 workers. 0N.u2mber0o.f4bits p0e.r6worke0r.8 1e19.0 w8a, 100 workers. 0 Nu1m0b0e0r0of2p0a0s0s0es 3th0r0o0u0gh4th0e00d0ata50000 f(xk) f(x * ) f(x0) f(x * )

workers: 100
f(xk) f(x * ) f(x0) f(x * ). madelon, 100 workers. 0.2Num0b.4er of0b.6its p0e.r8wor1ke.0r 1.21e9 madelon, 100 workers

workers: 100
madelon, 100 workers. 0.2Num0b.4er of0b.6its p0e.r8wor1ke.0r 1.21e9 madelon, 100 workers. 0 N5u00m0 b1e0r00o0f1p50a0s0s2e0s00t0hr2o50u0g0h30t0h0e0 3d5a00t0a 40000 mushrooms, 100 workers

workers: 100
0.2Num0b.4er of0b.6its p0e.r8wor1ke.0r 1.21e9 madelon, 100 workers. 0 N5u00m0 b1e0r00o0f1p50a0s0s2e0s00t0hr2o50u0g0h30t0h0e0 3d5a00t0a 40000 mushrooms, 100 workers. 0.5Nu1m.0ber1o.f5bits2.p0er w2.o5rke3r.0 31.5e8 mushrooms, 100 workers

workers: 100
0 N5u00m0 b1e0r00o0f1p50a0s0s2e0s00t0hr2o50u0g0h30t0h0e0 3d5a00t0a 40000 mushrooms, 100 workers. 0.5Nu1m.0ber1o.f5bits2.p0er w2.o5rke3r.0 31.5e8 mushrooms, 100 workers. 0 Num10b0e00r of p2a00s0s0es th30r0o0u0gh t4h0e00d0 ata 50000 f(xk) f(x * ) f(x0) f(x * )

workers: 100
f(xk) f(x * ) f(x0) f(x * ). phishing, 100 workers. 0 Nu1mber of2bits per w3orker 4 1e8 phishing, 100 workers GD

workers: 100
phishing, 100 workers. 0 Nu1mber of2bits per w3orker 4 1e8 phishing, 100 workers GD. 0 Nu2m0b0e0r0of4p0a0s0s0es 6th0r0o0u0gh8th0e00d0at1a00000 gisette, 100 workers

workers: 100
0 Nu1mber of2bits per w3orker 4 1e8 phishing, 100 workers GD. 0 Nu2m0b0e0r0of4p0a0s0s0es 6th0r0o0u0gh8th0e00d0at1a00000 gisette, 100 workers. 0.0 0.2 N0u.m4be0r.6of b0i.t8s p1e.r0wo1r.k2er 1.4 11e1.60 gisette, 100 workers

workers: 100
0 Nu2m0b0e0r0of4p0a0s0s0es 6th0r0o0u0gh8th0e00d0at1a00000 gisette, 100 workers. 0.0 0.2 N0u.m4be0r.6of b0i.t8s p1e.r0wo1r.k2er 1.4 11e1.60 gisette, 100 workers. 0 Nu1m0b0e0r0of2p0a0s0s0es 3th0r0o0u0gh4th0e00d0ata50000 f(xk) f(x * ) f(x0) f(x * )

Reference
  • A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems, pages 873–881, 2011.
    Google ScholarLocate open access versionFindings
  • D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. QSGD: Communicationefficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1709–1720, 2017.
    Google ScholarLocate open access versionFindings
  • Y. Arjevani, O. Shamir, and N. Srebro. A tight convergence analysis for stochastic gradient descent with delayed updates. arXiv preprint arXiv:1806.10188, 2018.
    Findings
  • D. Basu, D. Data, C. Karakus, and S. Diggavi. Qsparse-local-SGD: Distributed SGD with quantization, sparsification and local computations. In Advances in Neural Information Processing Systems, pages 14668–14679, 2019.
    Google ScholarLocate open access versionFindings
  • J. Bernstein, J. Zhao, K. Azizzadenesheli, and A. Anandkumar. SignSGD with majority vote is communication efficient and fault tolerant. In ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • D. P. Bertsekas and J. N. Tsitsiklis. Parallel and distributed computation: numerical methods, volume 23. Prentice hall Englewood Cliffs, NJ, 1989.
    Google ScholarLocate open access versionFindings
  • A. Beznosikov, S. Horváth, P. Richtárik, and M. Safaryan. On biased compression for distributed learning. arXiv preprint arXiv:2002.12410, 2020.
    Findings
  • C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1–27, 2011.
    Google ScholarLocate open access versionFindings
  • A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654, 2014.
    Google ScholarLocate open access versionFindings
  • H. R. Feyzmahdavian, A. Aytekin, and M. Johansson. An asynchronous mini-batch algorithm for regularized stochastic optimization. IEEE Transactions on Automatic Control, 61(12):3740–3754, 2016.
    Google ScholarLocate open access versionFindings
  • E. Gorbunov, F. Hanzely, and P. Richtárik. A unified theory of SGD: Variance reduction, sampling, quantization and coordinate descent. In The 23rd International Conference on Artificial Intelligence and Statistics (AISTATS 2020), 2020.
    Google ScholarLocate open access versionFindings
  • R. M. Gower, P. Richtárik, and F. Bach. Stochastic quasi-gradient methods: Variance reduction via Jacobian sketching. arXiv preprint arXiv:1805.02632, 2018.
    Findings
  • R. M. Gower, N. Loizou, X. Qian, A. Sailanbayev, E. Shulgin, and P. Richtárik. SGD: General analysis and improved rates. In International Conference on Machine Learning, pages 5200–5209, 2019.
    Google ScholarLocate open access versionFindings
  • P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
    Findings
  • F. Haddadpour and M. Mahdavi. On the convergence of local descent methods in federated learning. arXiv preprint arXiv:1910.14425, 2019.
    Findings
  • F. Hanzely, K. Mishchenko, and P. Richtárik. SEGA: Variance reduction via gradient sketching. In Advances in Neural Information Processing Systems, pages 2082–2093, 2018.
    Google ScholarLocate open access versionFindings
  • T. Hofmann, A. Lucchi, S. Lacoste-Julien, and B. McWilliams. Variance reduced stochastic gradient descent with neighbors. In Advances in Neural Information Processing Systems, pages 2305–2313, 2015.
    Google ScholarLocate open access versionFindings
  • S. Horváth, C.-Y. Ho, Ľudovít Horváth, A. N. Sahu, M. Canini, and P. Richtárik. Natural compression for distributed deep learning. arXiv preprint arXiv:1905.10988, 2019.
    Findings
  • S. Horváth, D. Kovalev, K. Mishchenko, S. Stich, and P. Richtárik. Stochastic distributed learning with gradient quantization and variance reduction. arXiv preprint arXiv:1904.05115, 2019.
    Findings
  • R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26, pages 315–323, 2013.
    Google ScholarLocate open access versionFindings
  • P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al. Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977, 2019.
    Findings
  • S. P. Karimireddy, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and A. T. Suresh. Scaffold: Stochastic controlled averaging for federated learning. arXiv preprint arXiv:1910.06378, 2019.
    Findings
  • S. P. Karimireddy, Q. Rebjock, S. Stich, and M. Jaggi. Error feedback fixes signSGD and other gradient compression schemes. In International Conference on Machine Learning, pages 3252–3261, 2019.
    Google ScholarLocate open access versionFindings
  • A. Khaled, K. Mishchenko, and P. Richtárik. Tighter theory for local SGD on identical and heterogeneous data. In The 23rd International Conference on Artificial Intelligence and Statistics (AISTATS 2020), 2020.
    Google ScholarLocate open access versionFindings
  • A. Khaled, O. Sebbouh, N. Loizou, R. M. Gower, and P. Richtárik. Unified analysis of stochastic gradient methods for composite convex and smooth optimization. arXiv preprint arXiv:2006.11573, 2020.
    Findings
  • S. Khirirat, H. R. Feyzmahdavian, and M. Johansson. Distributed learning with compressed gradients. arXiv preprint arXiv:1806.06573, 2018.
    Findings
  • A. Koloskova, S. Stich, and M. Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication. In International Conference on Machine Learning, pages 3478–3487, 2019.
    Google ScholarLocate open access versionFindings
  • A. Koloskova, T. Lin, S. U. Stich, and M. Jaggi. Decentralized deep learning with arbitrary communication compression. ICLR, page arXiv:1907.09356, 2020. URL https://arxiv.org/abs/1907.09356.
    Findings
  • A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, and S. U. Stich. A unified theory of decentralized SGD with changing topology and local updates. arXiv preprint arXiv:2003.10422, 2020.
    Findings
  • J. Konečný, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.
    Findings
  • D. Kovalev, S. Horváth, and P. Richtárik. Don’t jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop. In Proceedings of the 31st International Conference on Algorithmic Learning Theory, 2020.
    Google ScholarLocate open access versionFindings
  • R. Leblond, F. Pedregosa, and S. Lacoste-Julien. Improved asynchronous parallel optimization analysis for stochastic incremental methods. The Journal of Machine Learning Research, 19(1):3140–3207, 2018.
    Google ScholarLocate open access versionFindings
  • X. Lian, Y. Huang, Y. Li, and J. Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2737–2745, 2015.
    Google ScholarLocate open access versionFindings
  • X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 5330–5340, 2017.
    Google ScholarLocate open access versionFindings
  • X. Liu, Y. Li, J. Tang, and M. Yan. A double residual compression algorithm for efficient distributed learning. arXiv preprint arXiv:1910.07561, 2019.
    Findings
  • H. Mania, X. Pan, D. Papailiopoulos, B. Recht, K. Ramchandran, and M. I. Jordan. Perturbed iterate analysis for asynchronous stochastic optimization. SIAM Journal on Optimization, 27(4):2202–2229, 2017.
    Google ScholarLocate open access versionFindings
  • K. Mishchenko, E. Gorbunov, M. Takáč, and P. Richtárik. Distributed learning with compressed gradient differences. arXiv preprint arXiv:1901.09269, 2019.
    Findings
  • [39] L. Nguyen, P. H. Nguyen, M. Dijk, P. Richtárik, K. Scheinberg, and M. Takáč. SGD and Hogwild! convergence without the bounded gradients assumption. In International Conference on Machine Learning, pages 3750–3758, 2018.
    Google ScholarLocate open access versionFindings
  • [40] C. Philippenko and A. Dieuleveut. Artemis: tight convergence guarantees for bidirectional compression in federated learning. arXiv preprint arXiv:2006.14591, 2020.
    Findings
  • [41] H. Robbins and S. Monro. A stochastic approximation method. In Herbert Robbins Selected Papers, pages 102–109.
    Google ScholarLocate open access versionFindings
  • [42] O. Shamir, N. Srebro, and T. Zhang. Communication-efficient distributed optimization using an approximate Newton-type method. In International conference on machine learning, pages 1000–1008, 2014.
    Google ScholarLocate open access versionFindings
  • [43] S. U. Stich. Local SGD converges fast and communicates little. International Conference on Learning Representations (ICLR), page arXiv:1805.09767, 2019. URL https://arxiv.org/abs/1805.09767.
    Findings
  • [44] S. U. Stich. Unified optimal analysis of the (stochastic) gradient method. arXiv preprint arXiv:1907.04232, 2019.
    Findings
  • [45] S. U. Stich and S. P. Karimireddy. The error-feedback framework: Better rates for SGD with delayed gradients and compressed communication. arXiv preprint arXiv:1909.05350, 2019.
    Findings
  • [46] S. U. Stich, J.-B. Cordonnier, and M. Jaggi. Sparsified SGD with memory. In Advances in Neural Information Processing Systems, pages 4447–4458, 2018.
    Google ScholarLocate open access versionFindings
  • [47] H. Tang, C. Yu, X. Lian, T. Zhang, and J. Liu. DoubleSqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression. In International Conference on Machine Learning, pages 6155–6165, 2019.
    Google ScholarLocate open access versionFindings
  • [48] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems, pages 1509–1519, 2017.
    Google ScholarLocate open access versionFindings
  • [49] B. Woodworth, K. K. Patel, S. U. Stich, Z. Dai, B. Bullins, H. B. McMahan, O. Shamir, and N. Srebro. Is local SGD better than minibatch SGD? arXiv preprint arXiv:2002.07839, 2020.
    Findings
  • 1. TopK sparsification. This compression operator is defined as follows: K
    Google ScholarFindings
  • 2. RandK sparsification operator is defined as d
    Google ScholarFindings
  • 2. Summing up these inequalities for k = 0,..., K with weights wk = (1 − η)−(k+1) we get
    Google ScholarLocate open access versionFindings
  • 0. In this case we assume that
    Google ScholarFindings
  • 0. In this case we assume that
    Google ScholarFindings
Author
Dmitry Koralev
Dmitry Koralev
Dmitry Makarenko
Dmitry Makarenko
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科