AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We give a new algorithm for Byzantine-resilient non-convex distributed optimization, with strong theoretical guarantees, which improves on the performance of prior methods for training deep neural networks against Byzantine attacks.

Byzantine-Resilient Non-Convex Stochastic Gradient Descent

international conference on learning representations, (2021)

Cited by: 0|Views29
Full Text
Bibtex
Weibo

Abstract

We study adversary-resilient stochastic distributed optimization, in which m machines can independently compute stochastic gradients, and cooperate to jointly optimize over their local objective functions. However, an α-fraction of the machines are Byzantine, in that they may behave in arbitrary, adversarial ways. We consider a variant of...More

Code:

Data:

0
Introduction
  • Motivated by the pervasiveness of large-scale distributed machine learning, there has recently been significant interest in providing distributed optimization algorithms with strong fault-tolerance guarantees
  • In this context, the strongest, most stringent fault model is that of Byzantine faults (Lamport et al, 1982): given m machines, each having access to private data, at most an α fraction of the machines can behave in arbitrary, possibly adversarial ways, with the goal of breaking or at least slowing down the algorithm.
  • The above description only applies to honest workers; Byzantine workers may deviate arbitrarily and return adversarial “gradient” vectors to the master in every iteration
Highlights
  • Motivated by the pervasiveness of large-scale distributed machine learning, there has recently been significant interest in providing distributed optimization algorithms with strong fault-tolerance guarantees
  • The strongest, most stringent fault model is that of Byzantine faults (Lamport et al, 1982): given m machines, each having access to private data, at most an α fraction of the machines can behave in arbitrary, possibly adversarial ways, with the goal of breaking or at least slowing down the algorithm. This fault model is the “gold standard” in distributed computing (Lynch, 1996; Lamport et al, 1982; Castro et al, 1999), as algorithms proven to be correct in this setting are guaranteed to converge under arbitrary system behaviour
  • We focus on the more challenging non-convex setting, and shoot for the strong goal of finding approximate local minima (a.k.a. second-order critical points)
  • Our experiments show that SafeguardSGD generally outperforms previous methods in convergence speed and final accuracy, sometimes by a wide accuracy margin
  • To prove this, motivated by Jin et al (2017), we study two executions of Algorithm 2 where their randomness are coupled
  • We argue that at least one of them has to escape from w0
Methods
  • Krum* Safeguard Attack GeoMeVdariance Attack Media(ns(tcdoeorvdifnaatcet-owirse=) 0.3) Ideal Baseline
Results
  • The faulty nodes execute the Variance attack throughout the execution.
  • This explains the zig-zagging pattern of the accuracy, where drops correspond to epochs where the bad nodes are re-enabled, and temporarily manage to shift the gradient mean.
  • (A milder attack would correspond to a single attack epoch, which is handled.) Safeguard maintains good accuracy: its best Top-1 validation accuracy is 87%; the best alternative method reaches 29% accuracy under this attack
  • This explains the zig-zagging pattern of the accuracy, where drops correspond to epochs where the bad nodes are re-enabled, and temporarily manage to shift the gradient mean. (A milder attack would correspond to a single attack epoch, which is handled.) Safeguard maintains good accuracy: its best Top-1 validation accuracy is 87%; the best alternative method reaches 29% accuracy under this attack
Conclusion
  • Figure 1 compares the convergence curves, while Table 1 compares the best test accuracy.
  • SafeguardSGD generally outperforms the previous methods in test accuracy and convergence, and closely tracks the performance of the ideal baseline, across all attacks.
  • The test accuracy difference can be > 10% between the algorithm and the best prior work.
  • SafeguardSGD slightly outperforms all other algorithms even for the customized safeguard attacks, which were designed to maximally impact its performance.
  • The single-safeguard algorithm is close to double-safeguard, except for the label-flipping attack.
  • The authors conclude that SafeguardSGD can be practical, and outperforms previous approaches
Tables
  • Table1: Test accuracy comparison under different attacks. For full results see Table 2 in the Appendix
  • Table2: Table of results for CIFAR10 dataset and ResNet20 model
Download tables as Excel
Funding
  • The faulty nodes execute the Variance attack throughout the execution. This explains the zig-zagging pattern of the accuracy, where drops correspond to epochs where the bad nodes are re-enabled, and therefore temporarily manage to shift the gradient mean. (A milder attack would correspond to a single attack epoch, which is easily handled.) Safeguard maintains good accuracy: its best Top-1 validation accuracy is 87%; the best alternative method reaches 29% accuracy under this attack
Study subjects and analysis
workers: 10
Our algorithm is practical: it improves upon the performance of prior methods when training deep neural networks, it is relatively lightweight, and is the first method to withstand two recently-proposed Byzantine attacks. Delayed-gradient

LabKGMereeulodm-Mifael(niwdp(ictphooi3rndfaignualttye-nwoisdee)s)*Sign-flipping40 Zero-gradient

Krum (with 3 faulty nodes)* Safeguard Attack GeoMeVdariance Attack (rescaling-factor = 0.4) Media(ns(tcdoeorvdifnaatcet-owirse=) 0.3) Ideal Baseline

Double0Safe-g25uard 50 Single Safe-guard

75 Ep18o08c0h.3 125

Best Other Method 79.6 (Zeno) 85.3 (Zeno) 74.4 (Zeno)In all of our experiments, we use 10 workers and the mini-batch size (per worker) is 8
. We run all algorithms for 200 epochs, with initial learning rate η = 0.1, but the learning rate decreases by a factor of 10 on epochs 80, 120, and 160

Byzantine workers: 4
The ideal baseline is gradient “mean without attacks”, and we compare against (Naive) Mean, Geometric Median Chen et al (2017), Coordinate-wise Median Yin et al (2018; 2019), Krum Blanchard et al (2017), and Zeno Xie et al (2018b) with attacks. We set α = 0.4 so there are 4 Byzantine workers. (This exceeds the fault-tolerance of Krum, and so we also tested Krum with only 3 Byzantine workers.) We formally define those prior works as follows. Definition D.1 (GeoMed Chen et al (2017))

Byzantine workers: 3
Note that Krum requires 2b + 2 < m. So, we have also repeated the experiments for Krum with 3 Byzantine workers (out of 10 workers) to be more fair. Definition D.4 (Zeno Xie et al (2018b))

workers: 10
A full experimental report is given in the Appendix. We instantiate m = 10 workers and one master executing data-parallel SGD for 200 passes (i.e. epochs) over the training dataset.5. The results for higher number of workers are similar

Byzantine workers: 4
Attackers. We set α = 0.4 so there are 4 Byzantine workers. (This exceeds the fault-tolerance of Krum, and so we also tested Krum with only 3 Byzantine workers.). • SIGN-FLIPPING ATTACK: each Byzantine worker sends the negative gradient to the master

Reference
  • Dan Alistarh, Zeyuan Allen-Zhu, and Jerry Li. Byzantine stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 4613–4623, 2018.
    Google ScholarLocate open access versionFindings
  • Zeyuan Allen-Zhu. Natasha 2: Faster Non-Convex Optimization Than SGD. In NeurIPS, 2018a. Full version available at http://arxiv.org/abs/1708.08694.
    Findings
  • Zeyuan Allen-Zhu. How To Make the Gradients Small Stochastically. In NeurIPS, 2018b. Full version available at http://arxiv.org/abs/1801.02982.
    Findings
  • Gilad Baruch, Moran Baruch, and Yoav Goldberg. A little is enough: Circumventing defenses for distributed learning. In Advances in Neural Information Processing Systems, pages 8635–8645, 2019.
    Google ScholarLocate open access versionFindings
  • Peva Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer. Machine learning with adversaries: Byzantine tolerant gradient descent. In NIPS, pages 118–128, 2017.
    Google ScholarLocate open access versionFindings
  • Saikiran Bulusu, Prashant Khanduri, Pranay Sharma, and Pramod K Varshney. On distributed stochastic gradient descent for nonconvex functions in the presence of byzantines. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3137–3141. IEEE, 2020.
    Google ScholarLocate open access versionFindings
  • Under review as a conference paper at ICLR 2021 product manipulation. volume 115 of Proceedings of Machine Learning Research, pages 261–270, Tel Aviv, Israel, 22–25 Jul 2020. PMLR. URL http://proceedings.mlr.press/v115/xie20a.html.
    Locate open access versionFindings
  • Haibo Yang, Xin Zhang, Minghong Fang, and Jia Liu. Byzantine-resilient stochastic gradient descent for distributed learning: A lipschitz-inspired coordinate-wise median approach. arXiv preprint arXiv:1909.04532, 2019.
    Findings
  • Dong Yin, Yudong Chen, Kanna Ramchandran, and Peter Bartlett. Byzantine-robust distributed learning: Towards optimal statistical rates. arXiv preprint arXiv:1803.01498, 2018.
    Findings
  • Dong Yin, Yudong Chen, Ramchandran Kannan, and Peter Bartlett. Defending against saddle point attack in byzantine-robust distributed learning. In International Conference on Machine Learning, pages 7074–7084, 2019.
    Google ScholarLocate open access versionFindings
Author
Faeze Ebrahimianghazani
Faeze Ebrahimianghazani
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科