# Federated Learning Based on Dynamic Regularization

international conference on learning representations, 2020.

Weibo:

Abstract:

We propose a novel federated learning method for distributively training neural network models, where the server orchestrates cooperation between a subset of randomly chosen devices in each round. We view Federated Learning problem primarily from a communication perspective and allow more device level computations to save transmission cos...More

Code:

Data:

Introduction

- In (McMahan et al, 2017), the authors proposed federated learning (FL), a concept that leverages data spread across many devices, to learn classification tasks distributively without recourse to data sharing.
- Data is massively distributed, namely the number of devices are large, while amount of data per device is small.
- Device data is heterogeneous, in that data in different devices are sampled from different parts of the sample space.
- Data is unbalanced, in that the amount of data per device is highly variable

Highlights

- In (McMahan et al, 2017), the authors proposed federated learning (FL), a concept that leverages data spread across many devices, to learn classification tasks distributively without recourse to data sharing
- We provide an analysis of our proposed FL algorithm and demonstrate convergence of the local device models to models that satisfy conditions for local minima of global empirical loss with a rate of O
- We proposed Federated Dynamic Regularizer - (FedDyn), a novel FL method for distributively training neural network models
- FedDyn is based on exact minimization, wherein at each round, each participating device, dynamically updates its regularizer so that the optimal model for the regularized loss is in conformity with the global empirical loss
- Our approach is different from prior works that attempt to parallelize gradient computation, and in doing so they tradeoff target accuracy with communications, and necessitate inexact minimization
- All the methods require more communications to achieve a reasonable accuracy in the massive setting as the dataset is more decentralized
- We investigate different characteristic FL settings to validate our method

Methods

- The penalized risk, which is dynamically updated, is based on current local device model, and the received server model: θtk argmin θ.
- To build intuition into the method, the authors first highlight a fundamental issue about the Federated Dynamic Regularizer setup.
- It is that stationary points for device losses, in general, do not conform to global losses.

Results

- The number of neurons in the layers are 200 and 100; and the models achieve 98.4% and 95.0% test accuracy in MNIST and EMNIST-L respectively.
- For the character prediction task (Shakespeare), the authors use a stacked LSTM, similar to (Li et al, 2020a)
- This architecture achieves a test accuracy of 50.8% and 51.2% in IID and non-IID settings respectively.
- The standard goal in FL is to minimize amount of bits transferred
- For this reason, the authors adopt the number of models transmitted to achieve a target accuracy as the metric in the comparisons.
- All the methods require more communications to achieve a reasonable accuracy in the massive setting as the dataset is more decentralized

Conclusion

- The authors proposed FedDyn, a novel FL method for distributively training neural network models.
- The authors investigate different characteristic FL settings to validate the method.
- The authors demonstrate both through empirical results on real and synthetic data as well as analytical results that the scheme leads to efficient training with convergence rate as O where T is number of rounds, in both convex and non-convex settings, while being fully agnostic to device heterogeneity and robust to large number of devices, partial participation and unbalanced data

Summary

## Introduction:

In (McMahan et al, 2017), the authors proposed federated learning (FL), a concept that leverages data spread across many devices, to learn classification tasks distributively without recourse to data sharing.- Data is massively distributed, namely the number of devices are large, while amount of data per device is small.
- Device data is heterogeneous, in that data in different devices are sampled from different parts of the sample space.
- Data is unbalanced, in that the amount of data per device is highly variable
## Objectives:

Sample devices Pt ⊆ [m] and transmit θt−1 to each training instances in the form of feaselected device, tures, x ∈ X and corresponding labels y ∈ Y that are drawn IID from a device-indexed joint distribution, (x, y) ∼ Pk.- The authors' aim is to compare the relative performance of these models in FL using FedDyn and other strong baselines.
- The authors are not after state of the art model performances for these datasets, the aim is to compare the performances of these models in federated setting using FedDyn and other baselines.
- The authors aim to solve FL problem with four principle characteristic which are partial participation due to unreliable communication links, massive number of devices, heterogeneous device data and unbalanced data amounts per device.
- Since the authors aim to find a stationary in the nonconvex case, let’s define a new Ct and keep t the same as,
## Methods:

The penalized risk, which is dynamically updated, is based on current local device model, and the received server model: θtk argmin θ.- To build intuition into the method, the authors first highlight a fundamental issue about the Federated Dynamic Regularizer setup.
- It is that stationary points for device losses, in general, do not conform to global losses.
## Results:

The number of neurons in the layers are 200 and 100; and the models achieve 98.4% and 95.0% test accuracy in MNIST and EMNIST-L respectively.- For the character prediction task (Shakespeare), the authors use a stacked LSTM, similar to (Li et al, 2020a)
- This architecture achieves a test accuracy of 50.8% and 51.2% in IID and non-IID settings respectively.
- The standard goal in FL is to minimize amount of bits transferred
- For this reason, the authors adopt the number of models transmitted to achieve a target accuracy as the metric in the comparisons.
- All the methods require more communications to achieve a reasonable accuracy in the massive setting as the dataset is more decentralized
## Conclusion:

The authors proposed FedDyn, a novel FL method for distributively training neural network models.- The authors investigate different characteristic FL settings to validate the method.
- The authors demonstrate both through empirical results on real and synthetic data as well as analytical results that the scheme leads to efficient training with convergence rate as O where T is number of rounds, in both convex and non-convex settings, while being fully agnostic to device heterogeneity and robust to large number of devices, partial participation and unbalanced data

- Table1: Number of parameters transmitted relative to one round of FedAvg to reach target test accuracy for moderate and large number of devices in IID and Dirichlet .3 settings. SCAFFOLD communicates the current model and its associated gradient per round, while others communicate only the current model. As such number of rounds for SCAFFOLD is one half of those reported
- Table2: Number of parameters transmitted relative to one round of FedAvg to reach target test accuracy for 100% and 10% participation regimes in the IID, non-IID settings. SCAFFOLD communicates the current model and its associated gradient per round, while others communicate only the current model. As such number of rounds for SCAFFOLD is one half of those reported
- Table3: Datasets
- Table4: Number of parameters transmitted relative to one round of FedAvg to reach target test accuracy for balanced data and unbalanced data in IID and Dirichlet .3 settings with 10% participation. SCAFFOLD communicates the current model and its associated gradient per round, while others communicate only the current model. As such number of rounds for SCAFFOLD is one half of those reported
- Table5: Number of parameters transmitted relative to one round of FedAvg to reach target test accuracy for 1% participation regime in the IID, non-IID settings. SCAFFOLD communicates the current model and its associated gradient per round, while others communicate only the current model. As such number of rounds for SCAFFOLD is one half of those reported
- Table6: Number of parameters transmitted relative to one round of FedAvg to reach target test accuracy for convex synthetic problem in different types of heterogeneity settings. SCAFFOLD communicates the current model and its associated gradient per round, while others communicate only the current model. As such number of rounds for SCAFFOLD is one half of those reported

Related work

- FL is a fast evolving topic, and we only describe closely related approaches here. Comprehensive field studies have appeared in (Kairouz et al, 2019; Li et al, 2020). The general FL setup involves two types of updates, the server and device, and each of these updates are associated with minimizing some local loss function, which by itself could be updated dynamically over different rounds. At any round, there are methods that attempt to fully optimize or others that propose inexact optimization. We specifically focus on relevant works that address the four FL scenarios (massive distribution, heterogeneity, unreliable links, and unbalanced data) here.

One line of work proposes local SGD (Stich, 2019) based updates, wherein each participating device performs a single local SGD step. The server then averages received models. In contrast to local SGD, our method proposes to minimize a local penalized empirical loss.

Funding

- The number of neurons in the layers are 200 and 100; and the models achieve 98.4% and 95.0% test accuracy in MNIST and EMNIST-L respectively
- The model achieves 85.2% and 55.3% test accuracy for CIFAR-10 and CIFAR-100 respectively
- For the next character prediction task (Shakespeare), we use a stacked LSTM, similar to (Li et al, 2020a). This architecture achieves a test accuracy of 50.8% and 51.2% in IID and non-IID settings respectively
- The standard goal in FL is to minimize amount of bits transferred. For this reason, we adopt the number of models transmitted to achieve a target accuracy as our metric in our comparisons

Reference

- Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Communicationefficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pp. 1709–1720, 2017.
- Sebastian Caldas, Peter Wu, Tian Li, Jakub Konecny, H. Brendan McMahan, Virginia Smith, and Ameet Talwalkar. LEAF: A benchmark for federated settings. CoRR, abs/1812.01097, 2018. URL http://arxiv.org/abs/1812.01097.
- Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. Emnist: Extending mnist to handwritten letters. In 2017 International Joint Conference on Neural Networks (IJCNN), pp. 2921–2926. IEEE, 2017.
- Laurent Condat, Grigory Malinovsky, and Peter Richtarik. Distributed proximal splitting algorithms with rates and acceleration. arXiv preprint arXiv:2010.00952, 2020.
- Aritra Dutta, El Houcine Bergou, Ahmed M Abdelmoniem, Chen-Yu Ho, Atal Narayan Sahu, Marco Canini, and Panos Kalnis. On the discrepancy between the theoretical analysis and practical implementations of compressed communication for distributed deep learning. arXiv preprint arXiv:1911.08250, 2019.
- Eduard Gorbunov, Filip Hanzely, and Peter Richtarik. A unified theory of sgd: Variance reduction, sampling, quantization and coordinate descent. In International Conference on Artificial Intelligence and Statistics, pp. 680–690. PMLR, 2020.
- Malka N Halgamuge, Moshe Zukerman, Kotagiri Ramamohanarao, and Hai L Vu. An estimation of sensor energy consumption. Progress in Electromagnetics Research, 12:259–295, 2009.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of non-identical data distribution for federated visual classification. CoRR, abs/1909.06335, 201URL http://arxiv.org/abs/1909.06335.
- Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurelien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977, 2019.
- Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, and Ananda Theertha Suresh. SCAFFOLD: stochastic controlled averaging for on-device federated learning. CoRR, abs/1910.06378, 2019. URL http://arxiv.org/abs/1910.06378.
- Ahmed Khaled, Konstantin Mishchenko, and Peter Richtarik. Tighter theory for local sgd on identical and heterogeneous data. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, pp. 4519–4529, Online, 26– 28 Aug 2020a. PMLR. URL http://proceedings.mlr.press/v108/bayoumi20a.html.
- Ahmed Khaled, Othmane Sebbouh, Nicolas Loizou, Robert M Gower, and Peter Richtarik. Unified analysis of stochastic gradient methods for composite convex and smooth optimization. arXiv preprint arXiv:2006.11573, 2020b.
- Jakub Konecny, H Brendan McMahan, Daniel Ramage, and Peter Richtarik. Federated optimization: Distributed machine learning for on-device intelligence. arXiv preprint arXiv:1610.02527, 2016.
- Alex Krizhevsky et al. Learning multiple layers of features from tiny images. Technical report, 2009.
- Benoıt Latre, Bart Braem, Ingrid Moerman, Chris Blondia, and Piet Demeester. A survey on wireless body area networks. Wireless networks, 17(1):1–18, 2011.
- Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- T. Li, A. K. Sahu, A. Talwalkar, and V. Smith. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50–60, 2020.
- Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smithy. Feddane: A federated newton-type method. In 2019 53rd Asilomar Conference on Signals, Systems, and Computers, pp. 1227–1231. IEEE, 2019.
- Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. In Proceedings of Machine Learning and Systems 2020, pp. 429–450, 2020a.
- Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the convergence of fedavg on non-iid data. In International Conference on Learning Representations, 2020b. URL https://openreview.net/forum?id=HJxNAnVtDS.
- Zhize Li and Peter Richtarik. A unified analysis of stochastic gradient methods for nonconvex federated optimization. arXiv preprint arXiv:2006.07013, 2020.
- Zhize Li, Dmitry Kovalev, Xun Qian, and Peter Richtarik. Acceleration for compressed gradient descent in distributed and federated optimization. arXiv preprint arXiv:2002.11364, 2020c.
- Xianfeng Liang, Shuheng Shen, Jingchang Liu, Zhen Pan, Enhong Chen, and Yifei Cheng. Variance reduced local sgd with lower communication complexity. arXiv preprint arXiv:1912.12844, 2019.
- Ali Makhdoumi and Asuman Ozdaglar. Convergence rate of distributed admm over networks. IEEE Transactions on Automatic Control, 62(10):5082–5095, 2017.
- Grigory Malinovsky, Dmitry Kovalev, Elnur Gasanov, Laurent Condat, and Peter Richtarik. From local sgd to local fixed point methods for federated learning. arXiv preprint arXiv:2004.01442, 2020.
- Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp. 1273–1282, 2017.
- Konstantin Mishchenko, Eduard Gorbunov, Martin Takac, and Peter Richtarik. Distributed learning with compressed gradient differences. arXiv preprint arXiv:1901.09269, 2019.
- Yurii Nesterov, Alexander Gasnikov, Sergey Guminov, and Pavel Dvurechensky. Primal–dual accelerated gradient methods with small-dimensional relaxation oracle. Optimization Methods and Software, pp. 1–38, 2020.
- Reese Pathak and Martin J Wainwright. Fedsplit: An algorithmic framework for fast federated optimization. arXiv preprint arXiv:2005.05238, 2020.
- Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konecny, Sanjiv Kumar, and H Brendan McMahan. Adaptive federated optimization. arXiv preprint arXiv:2003.00295, 2020.
- William Shakespeare. The complete works of william shakespeare, 1994. URL http://www.gutenberg.org/files/100/old/1994-01-100.zip.
- Ohad Shamir, Nati Srebro, and Tong Zhang. Communication-efficient distributed optimization using an approximate newton-type method. In International conference on machine learning, pp. 1000–1008, 2014.
- Sebastian Urban Stich. Local SGD converges fast and communicates little. International Conference on Learning Representations (ICLR), pp. arXiv:1805.09767, 2019. URL https://arxiv.org/abs/1805.09767.
- Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European conference on computer vision (ECCV), pp. 3–19, 2018.
- Under review as a conference paper at ICLR 2021 Sarika Yadav and Rama Shankar Yadav. A review on energy efficient protocols in wireless sensor networks. Wireless Networks, 22(1):335–350, 2016. Honglin Yuan and Tengyu Ma. Federated accelerated stochastic gradient descent. arXiv preprint arXiv:2006.08950, 2020. Mikhail Yurochkin, Mayank Agarwal, Soumya Ghosh, Kristjan Greenewald, Nghia Hoang, and
- Yasaman Khazaeni. Bayesian nonparametric federated learning of neural networks. In International Conference on Machine Learning, pp. 7252–7261, 2019. Xinwei Zhang, Mingyi Hong, Sairaj Dhople, Wotao Yin, and Yang Liu. Fedpd: A federated learning framework with optimal rates and adaptivity to non-iid data. arXiv preprint arXiv:2005.11418, 2020. Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated learning with non-iid data. arXiv preprint arXiv:1806.00582, 2018.
- Dataset. We introduce a synthetic dataset to reflect different properties of FL by using a similar process as in (Li et al., 2020a). The datapoints (xj, yj) of device i are generated based on yj = arg max(θ∗i xj + b∗i ) where xj ∈ R30×1, yj ∈ {1, 2,... 5}, θ∗i ∈ R5×30, and b∗i ∈ R5×1. (θ∗i, b∗i ) tuple represents the optimal parameter set for device i and each element of these tuples are randomly drawn from N (μi, 1) where μi ∼ N (0, γ1). The features of datapoints are modeled as (xj ∼ N (νi, σ)) where σ is a diagonal covariance matrix with elements σk,k = k−1.2 and each element of νi is drawn from N (βi, 1) where βi ∼ N (0, γ2). The number of datapoints in device i follows a lognormal distribution with variance γ3. In this generation procees, γ1, γ2 and γ3 regulate the relation of the optimal models for each device, the distribution of the features for each device and the amount of datapoints per device respectively.

Tags

Comments