## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Personalized Federated Learning with Theoretical Guarantees: A Model-Agnostic Meta-Learning Approach

NIPS 2020, (2020)

EI

Keywords

Abstract

In Federated Learning, we aim to train models across multiple computing units (users), while users can only communicate with a common central server, without exchanging their data samples. This mechanism exploits the computational power of all users and allows users to obtain a richer model as their models are trained over a larger set of...More

Code:

Data:

Introduction

- In Federated Learning (FL), the authors consider a set of n users that are all connected to a central node, where each user has access only to its local data [1].
- The focus of this paper is on a data heterogeneous setting where the probability distribution pi of users are not identical
- To illustrate this formulation, consider the example of training a Natural Language Processing (NLP) model over the devices of a set of users.
- In this problem, pi represents the empirical distribution of words and expressions used by user i.

Highlights

- In Federated Learning (FL), we consider a set of n users that are all connected to a central node, where each user has access only to its local data [1]
- We focus on the convergence of Model-Agnostic Meta-Learning (MAML) methods for the FL setting that is more challenging as nodes perform multiple local updates before sending their updates to the server, which is not considered in previous theoretical works on meta-learning
- We focus on nonconvex settings, and characterize the overall communication rounds between server and users to find an -approximate first-order stationary point, where its formal definition follows
- We considered the Federated Learning (FL) problem in the heterogeneous case, and studied a personalized variant of the classic FL formulation in which our goal is to find a proper initialization model for the users that can be quickly adapted to the local data of each user after the training phase
- Federated Learning (FL) provides a framework for training machine learning models efficiently and in a distributed manner. Due to these favorable properties, it has gained significant attention and has been deployed in a broad range of applications with critical societal benefits. These applications go from healthcare systems, where machine learning models can be trained while preserving patients’ privacy, to image classification and Natural Language Processing (NLP) models, where tech companies can improve their neural networks without requiring users to share their data with a server or other users
- We divide the test data over the nodes with the same distribution as the one for the training data. Note that for this particular example in which the user’s distributions are significantly different, our goal is not to achieve state-of-the-art accuracy
- We show the answer is positive, and provide rigorous theoretical guarantees for algorithms that can be used in all applications mentioned above to achieve more personalized models in FL framework

Results

- The authors study the convergence properties of the Personalized FedAvg (Per-FedAvg) method.
- The authors focus on nonconvex settings, and characterize the overall communication rounds between server and users to find an -approximate first-order stationary point, where its formal definition follows.
- A random vector w ∈ Rd is called an -approximate First-Order Stationary Point (FOSP) for problem (3) if it satisfies E[ ∇F (w ) 2] ≤.
- Agent i sends wki +1,τ back to server; Server updates its model over received models: wk+1.
- 1 rn i∈Ak wki +1,τ ; the authors formally state the assumptions required for proving the main results.
- Functions fi are bounded below, i.e., minw∈Rd fi(w) > −∞

Conclusion

- The authors considered the Federated Learning (FL) problem in the heterogeneous case, and studied a personalized variant of the classic FL formulation in which the goal is to find a proper initialization model for the users that can be quickly adapted to the local data of each user after the training phase.
- Federated Learning (FL) provides a framework for training machine learning models efficiently and in a distributed manner
- Due to these favorable properties, it has gained significant attention and has been deployed in a broad range of applications with critical societal benefits.
- The authors show the answer is positive, and provide rigorous theoretical guarantees for algorithms that can be used in all applications mentioned above to achieve more personalized models in FL framework
- This result could have a broad impact on improving the quality of users’ models in several applications that deploy federated learning such as healthcare systems

Summary

## Introduction:

In Federated Learning (FL), the authors consider a set of n users that are all connected to a central node, where each user has access only to its local data [1].- The focus of this paper is on a data heterogeneous setting where the probability distribution pi of users are not identical
- To illustrate this formulation, consider the example of training a Natural Language Processing (NLP) model over the devices of a set of users.
- In this problem, pi represents the empirical distribution of words and expressions used by user i.
## Objectives:

In Federated Learning, the authors aim to train models across multiple computing units, while users can only communicate with a common central server, without exchanging their data samples.- The authors study a personalized variant of the federated learning in which the goal is to find an initial shared model that current or new users can adapt to their local dataset by performing one or a few steps of gradient descent with respect to their own data.
- The authors considered the Federated Learning (FL) problem in the heterogeneous case, and studied a personalized variant of the classic FL formulation in which the goal is to find a proper initialization model for the users that can be quickly adapted to the local data of each user after the training phase
## Results:

The authors study the convergence properties of the Personalized FedAvg (Per-FedAvg) method.- The authors focus on nonconvex settings, and characterize the overall communication rounds between server and users to find an -approximate first-order stationary point, where its formal definition follows.
- A random vector w ∈ Rd is called an -approximate First-Order Stationary Point (FOSP) for problem (3) if it satisfies E[ ∇F (w ) 2] ≤.
- Agent i sends wki +1,τ back to server; Server updates its model over received models: wk+1.
- 1 rn i∈Ak wki +1,τ ; the authors formally state the assumptions required for proving the main results.
- Functions fi are bounded below, i.e., minw∈Rd fi(w) > −∞
## Conclusion:

The authors considered the Federated Learning (FL) problem in the heterogeneous case, and studied a personalized variant of the classic FL formulation in which the goal is to find a proper initialization model for the users that can be quickly adapted to the local data of each user after the training phase.- Federated Learning (FL) provides a framework for training machine learning models efficiently and in a distributed manner
- Due to these favorable properties, it has gained significant attention and has been deployed in a broad range of applications with critical societal benefits.
- The authors show the answer is positive, and provide rigorous theoretical guarantees for algorithms that can be used in all applications mentioned above to achieve more personalized models in FL framework
- This result could have a broad impact on improving the quality of users’ models in several applications that deploy federated learning such as healthcare systems

- Table1: Comparison of test accuracy of different algorithms given different parameters

Related work

- Recently we have witnessed significant progress in developing novel methods that address different challenges in FL; see [4, 5]. In particular, there have been several works on various aspects of FL, including preserving the privacy of users [6,7,8,9] and lowering communication cost [10,11,12,13]. Several work develop algorithms for the homogeneous setting, where the data points of all users are sampled from the same probability distribution [14,15,16,17]. More related to our paper, there are several works that study statistical heterogeneity of users’ data points in FL [18,19,20,21,22,23], but they do not attempt to find a personalized solution for each user.

The centralized version of model-agnostic meta-learning (MAML) problem was first proposed in [2] and followed by a number of papers studying its empirical characteristics [24,25,26,27,28,29] as well as its convergence properties [30, 31]. In this work, we focus on the convergence of MAML methods for the FL setting that is more challenging as nodes perform multiple local updates before sending their updates to the server, which is not considered in previous theoretical works on meta-learning.

Funding

- Acknowledgments and Disclosure of Funding Research was sponsored by the United States Air Force Research Laboratory and was accomplished under Cooperative Agreement Number FA8750-19-2-1000
- Alireza Fallah acknowledges support from MathWorks Engineering Fellowship
- The research of Aryan Mokhtari is supported by NSF Award CCF-2007668

Study subjects and analysis

users: 50

We use a neural network with two hidden layers with sizes 80 and 60, and we use Exponential Linear Unit (ELU) activation function. We take n = 50 users in the network, and run all three algorithms for K = 1000 rounds. At each round, we assume rn agents with r = 0.2 are chosen to run τ local updates

main observations: 3

dataset. In particular, we have three main observations here: (i) For α = 0.001 and τ = 10, PerFedAvg (FO) and Per-FedAvg (HF) perform almost similarly, and better than FedAvg. In addition, decreasing τ leads to a decrease in the performance of all three algorithms, which is expected as the total number of iterations decreases. (ii) Next, we study the role of α

Reference

- J. Konecny, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,” arXiv preprint arXiv:1610.05492, 2016.
- C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in Proceedings of the 34th International Conference on Machine Learning, (Sydney, Australia), 06–11 Aug 2017.
- B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-Efficient Learning of Deep Networks from Decentralized Data,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, vol. 54 of Proceedings of Machine Learning Research, (Fort Lauderdale, FL, USA), pp. 1273–1282, PMLR, 20–22 Apr 2017.
- P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al., “Advances and open problems in federated learning,” arXiv preprint arXiv:1912.04977, 2019.
- T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Challenges, methods, and future directions,” IEEE Signal Process. Mag., vol. 37, no. 3, pp. 50–60, 2020.
- J. C. Duchi, M. I. Jordan, and M. J. Wainwright, “Privacy aware learning,” Journal of the ACM (JACM), vol. 61, no. 6, p. 38, 2014.
- H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang, “Learning differentially private recurrent language models,” arXiv preprint arXiv:1710.06963, 2017.
- N. Agarwal, A. T. Suresh, F. X. X. Yu, S. Kumar, and B. McMahan, “cpsgd: Communicationefficient and differentially-private distributed sgd,” in Advances in Neural Information Processing Systems, pp. 7564–7575, 2018.
- W. Zhu, P. Kairouz, B. McMahan, H. Sun, and W. Li, “Federated heavy hitters discovery with differential privacy,” in International Conference on Artificial Intelligence and Statistics, pp. 3837–3847, 2020.
- A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, and R. Pedarsani, “Fedpaq: A communication-efficient federated learning method with periodic averaging and quantization,” in International Conference on Artificial Intelligence and Statistics, pp. 2021–2031, 2020.
- X. Dai, X. Yan, K. Zhou, K. K. Ng, J. Cheng, and Y. Fan, “Hyper-sphere quantization: Communication-efficient sgd for federated learning,” arXiv preprint arXiv:1911.04655, 2019.
- D. Basu, D. Data, C. Karakus, and S. Diggavi, “Qsparse-local-sgd: Distributed sgd with quantization, sparsification and local computations,” in Advances in Neural Information Processing Systems, pp. 14668–14679, 2019.
- Z. Li, D. Kovalev, X. Qian, and P. Richtárik, “Acceleration for compressed gradient descent in distributed and federated optimization,” arXiv preprint arXiv:2002.11364, 2020.
- S. U. Stich, “Local sgd converges fast and communicates little,” arXiv preprint arXiv:1805.09767, 2018.
- J. Wang and G. Joshi, “Cooperative sgd: A unified framework for the design and analysis of communication-efficient sgd algorithms,” arXiv preprint arXiv:1808.07576, 2018.
- F. Zhou and G. Cong, “On the convergence properties of a k-step averaging stochastic gradient descent algorithm for nonconvex optimization,” in Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 3219–3227, 2018.
- T. Lin, S. U. Stich, K. K. Patel, and M. Jaggi, “Don’t use large mini-batches, use local SGD,” in 8th International Conference on Learning Representations, ICLR, 2020.
- Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated learning with non-iid data,” arXiv preprint arXiv:1806.00582, 2018.
- A. K. Sahu, T. Li, M. Sanjabi, M. Zaheer, A. Talwalkar, and V. Smith, “On the convergence of federated optimization in heterogeneous networks,” arXiv preprint arXiv:1812.06127, 2018.
- S. P. Karimireddy, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and A. T. Suresh, “Scaffold: Stochastic controlled averaging for on-device federated learning,” arXiv preprint arXiv:1910.06378, 2019.
- F. Haddadpour and M. Mahdavi, “On the convergence of local descent methods in federated learning,” arXiv preprint arXiv:1910.14425, 2019.
- X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence of fedavg on non-iid data,” arXiv preprint arXiv:1907.02189, 2019.
- A. K. R. Bayoumi, K. Mishchenko, and P. Richtarik, “Tighter theory for local sgd on identical and heterogeneous data,” in International Conference on Artificial Intelligence and Statistics, pp. 4519–4529, 2020.
- A. Antoniou, H. Edwards, and A. Storkey, “How to train your MAML,” in International Conference on Learning Representations, 2019.
- Z. Li, F. Zhou, F. Chen, and H. Li, “Meta-SGD: Learning to learn quickly for few-shot learning,” arXiv preprint arXiv:1707.09835, 2017.
- E. Grant, C. Finn, S. Levine, T. Darrell, and T. Griffiths, “Recasting gradient-based metalearning as hierarchical bayes,” in International Conference on Learning Representations, 2018.
- A. Nichol, J. Achiam, and J. Schulman, “On first-order meta-learning algorithms,” arXiv preprint arXiv:1803.02999, 2018.
- L. Zintgraf, K. Shiarli, V. Kurin, K. Hofmann, and S. Whiteson, “Fast context adaptation via meta-learning,” in Proceedings of the 36th International Conference on Machine Learning, pp. 7693–7702, 2019.
- H. S. Behl, A. G. Baydin, and P. H. S. Torr, “Alpha MAML: adaptive model-agnostic metalearning,” 2019.
- P. Zhou, X. Yuan, H. Xu, S. Yan, and J. Feng, “Efficient meta learning via minibatch proximal update,” in Advances in Neural Information Processing Systems 32, pp. 1534–1544, Curran Associates, Inc., 2019.
- A. Fallah, A. Mokhtari, and A. Ozdaglar, “On the convergence theory of gradient-based modelagnostic meta-learning algorithms,” in International Conference on Artificial Intelligence and Statistics, pp. 1082–1092, 2020.
- F. Chen, M. Luo, Z. Dong, Z. Li, and X. He, “Federated meta-learning with fast convergence and efficient communication,” arXiv preprint arXiv:1802.07876, 2018.
- Y. Jiang, J. Konecny, K. Rush, and S. Kannan, “Improving federated learning personalization via model agnostic meta learning,” arXiv preprint arXiv:1909.12488, 2019.
- T. Li, M. Sanjabi, and V. Smith, “Fair resource allocation in federated learning,” arXiv preprint arXiv:1905.10497, 2019.
- S. Lin, G. Yang, and J. Zhang, “A collaborative learning framework via federated meta-learning,” arXiv preprint arXiv:2001.03229, 2020.
- M. Khodak, M.-F. F. Balcan, and A. S. Talwalkar, “Adaptive gradient-based meta-learning methods,” in Advances in Neural Information Processing Systems, pp. 5915–5926, 2019.
- J. Li, M. Khodak, S. Caldas, and A. Talwalkar, “Differentially private meta-learning,” arXiv preprint arXiv:1909.05830, 2019.
- V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, “Federated multi-task learning,” in Advances in Neural Information Processing Systems, pp. 4424–4434, 2017.
- F. Hanzely and P. Richtárik, “Federated learning of a mixture of global and local models,” arXiv preprint arXiv:2002.05516, 2020.
- Y. Deng, M. M. Kamani, and M. Mahdavi, “Adaptive personalized federated learning,” arXiv preprint arXiv:2003.13461, 2020.
- E. del Barrio, E. Giné, and C. Matrán, “Central limit theorems for the wasserstein distance between the empirical and the true distributions,” Annals of Probability, pp. 1009–1071, 1999.
- Y. Arjevani, Y. Carmon, J. C. Duchi, D. J. Foster, N. Srebro, and B. Woodworth, “Lower bounds for non-convex stochastic optimization,” arXiv preprint arXiv:1912.02365, 2019.
- Y. LeCun, “The mnist database of handwritten digits,” http://yann.lecun.com/exdb/mnist/, 1998.
- A. Krizhevsky, G. Hinton, et al., “Learning multiple layers of features from tiny images,” 2009.
- J. Langelaar, “Mnist neural network training and testing,” MATLAB Central File Exchange, 2019.
- C. Villani, Optimal transport: old and new, vol.
- 338. Springer Science & Business Media, 2008.

Tags

Comments