# Salvaging Federated Learning by Local Adaptation

Weibo:

Abstract:

Federated learning (FL) is a heavily promoted approach for training ML models on sensitive data, e.g., text typed by users on their smartphones. FL is expressly designed for training on data that are unbalanced and non-iid across the participants. To ensure privacy and integrity of the federated model, latest FL approaches use different...More

Code:

Data:

Introduction

- Federated learning (McMahan et al, 2017) is a framework for large-scale, distributed learning on sensitive data: for example, training a next-word prediction model on texts typed by users into their smartphones or training a medical treatment model on patient records from multiple hospitals.
- In the original design (McMahan et al, 2017), the federated model is created by repeatedly averaging model updates from small subsets of participants
- Both the updates and the final model can leak participants’ training data, violating privacy (Shokri et al, 2017; Melis et al, 2019).
- Federated learning is a distributed learning paradigm for training a model on multiple participants’ data (McMahan et al, 2017)
- It consists of local training and aggregation.

Highlights

- Federated learning (McMahan et al, 2017) is a framework for large-scale, distributed learning on sensitive data: for example, training a next-word prediction model on texts typed by users into their smartphones or training a medical treatment model on patient records from multiple hospitals
- Differentially private federated learning (McMahan et al, 2018) bounds how much the federated model can reveal about the input from any individual participant
- We demonstrate that privacy and robustness protections destroy the accuracy of federated models for many individual participants, removing their main incentive to join federated learning
- Adapted Differential privacy (DP)-FED outperforms the local models of all participants
- Federated learning is a promising approach to large-scale model training on sensitive data
- Mean accuracy improvements due to adaptation are 2.32%, 2.12%, and 2.12% for BASIC-FED, DP-FED and ROBUST-FED, respectively
- We showed how local adaptation techniques based on fine-tuning, multi-task learning, and knowledge distillation help improve the accuracy of private and robust federated models for individual participants, enabling them to reap the benefits of federated learning without compromising privacy or integrity of their models

Results

- Mean accuracy improvements due to adaptation are 2.32%, 2.12%, and 2.12% for BASIC-FED, DP-FED and ROBUST-FED, respectively
- These improvements make up the loss of accuracy due to differential privacy (-1.42%) and robust aggregation (-2.81%).
- Mean accuracy improvements due to adaptation are 2.98%, 6.83%, and 6.34% for BASIC-FED, DPFED, and ROBUST-FED, respectively
- These improvements make up the loss of accuracy due to differential privacy (-7.83%) and robust aggregation (-11.89%).
- Adapted DP-FED outperforms the local models of all participants

Conclusion

- Federated learning is a promising approach to large-scale model training on sensitive data.
- Differential privacy and robust aggregation reduce accuracy of federated models below that of the locally trained models of many participants, removing their main incentive to join federated learning.
- The authors showed how local adaptation techniques based on fine-tuning, multi-task learning, and knowledge distillation help improve the accuracy of private and robust federated models for individual participants, enabling them to reap the benefits of federated learning without compromising privacy or integrity of their models

Summary

## Introduction:

Federated learning (McMahan et al, 2017) is a framework for large-scale, distributed learning on sensitive data: for example, training a next-word prediction model on texts typed by users into their smartphones or training a medical treatment model on patient records from multiple hospitals.- In the original design (McMahan et al, 2017), the federated model is created by repeatedly averaging model updates from small subsets of participants
- Both the updates and the final model can leak participants’ training data, violating privacy (Shokri et al, 2017; Melis et al, 2019).
- Federated learning is a distributed learning paradigm for training a model on multiple participants’ data (McMahan et al, 2017)
- It consists of local training and aggregation.
## Results:

Mean accuracy improvements due to adaptation are 2.32%, 2.12%, and 2.12% for BASIC-FED, DP-FED and ROBUST-FED, respectively- These improvements make up the loss of accuracy due to differential privacy (-1.42%) and robust aggregation (-2.81%).
- Mean accuracy improvements due to adaptation are 2.98%, 6.83%, and 6.34% for BASIC-FED, DPFED, and ROBUST-FED, respectively
- These improvements make up the loss of accuracy due to differential privacy (-7.83%) and robust aggregation (-11.89%).
- Adapted DP-FED outperforms the local models of all participants
## Conclusion:

Federated learning is a promising approach to large-scale model training on sensitive data.- Differential privacy and robust aggregation reduce accuracy of federated models below that of the locally trained models of many participants, removing their main incentive to join federated learning.
- The authors showed how local adaptation techniques based on fine-tuning, multi-task learning, and knowledge distillation help improve the accuracy of private and robust federated models for individual participants, enabling them to reap the benefits of federated learning without compromising privacy or integrity of their models

Related work

- Privacy and integrity of federated learning. The original federated learning framework suffers from privacy and integrity problems. Participants’ model updates leak their training data (Melis et al, 2019), and malicious participants can inject unwanted behaviors into the model (Bagdasaryan et al, 2018; Bhagoji et al, 2019). Secure aggregation (Bonawitz et al, 2017) prevents the global server from observing individual updates, but it also makes attacks on integrity impossible to detect and the final federated model may still leak training data.

To limit the leakage of training data, federated learning has been combined with differential privacy (McMahan et al, 2018). To limit the influence of individual participants on the federated model, several robust, “Byzantine-tolerant” aggregation schemes have been proposed (Blanchard et al, 2017; El Mhamdi et al, 2018; Damaskinos et al, 2019; Rajput et al, 2019; Chen et al, 2017). Alternative aggregation schemes (Yurochkin et al, 2019; Guha et al, 2019; Hsu et al, 2019) for various flavors of federated learning provide neither privacy, nor robustness, and because our focus is on mitigating the damage from privacy and robustness mechanisms, we do not analyze them in this paper.

Funding

- This research was supported in part by NSF grants 1704296 and 1916717, the generosity of Eric and Wendy Schmidt by recommendation of the Schmidt Futures program, and a Google Faculty Research Award

Study subjects and analysis

participants: 80000

Next-word prediction. We train word prediction models on a randomly chosen month (November 2017) of the Reddit dataset (Reddit) with 80,000 participants (i.e., Reddit users) who have between 150 and 500 posts, treating each post as one sentence. We compiled a dictionary of 50, 000 most frequent words and replaced all others with the unk token

participants: 100

To create BASICFED, DP-FED, and ROBUST-FED models, we train 2-layer LSTM models with 200 hidden units and 10 million parameters (pytorch). Following (McMahan et al, 2018), we run federated learning for 5,000 rounds with m = 100 participants per round, aggregation learning rate η = 1, batch size 20, and B = 2 internal epochs using SGD. For training participants’ models, we tried inner learning rates of 0.1, 1, 10, 20, and 40, yielding global test accuracy of, respectively, 9.07%, 14.34%, 18.83%, 19.20% and 19.29%

participants: 100

Image classification. We split the CIFAR-10 (Krizhevsky, 2009) training set into 100 participants. To simulate a non-iid distribution, we allocate images from each class to participants using Dirichlet distribution with α = 0.9, similar to (Hsu et al, 2019)

randomly selected participants: 10

We train all federated models for 1,000 rounds with the aggregation learning rate η = 1 and batch size of 32. Following (McMahan et al, 2017), in every round we aggregate 10 randomly selected participants, each of whom trains a ResNet-18 model (with 11.2 million parameters) with the inner learning rate of 0.1 and B = 2 internal epochs using SGD with momentum 0.9 and weight decay 0.0005. CIFAR-10 is not divided into distinct participants with their own training and test sets

participants with the learning rates of 0: 1000

Freezebase (FB) is a variant that freezes the base layers of the federated model and fine-tunes only the top layer. When using fine-tuning for local adaptation, we experimented on 1,000 participants with the learning rates of 0.1, 1, and 10, yielding mean accuracy of, respectively, 20.58%, 20.99% and 18.28%. Therefore, we set lr = 1

participants: 80000

Dots and bars in the figures are color-coded according to the adaptation technique that yielded the best accuracy improvement. For the word-prediction task, there are 80,000 participants, but we only adapt the models for the 79,097 participants whose vocabulary size (i.e., number of unique symbols) is over 100, the percentage of utility symbols (e.g., punctuation) is under 40%, and the difference between total and utility symbols is over 1,000. 7.1

participants: 7377

Removing disincentivized participants. As shown in section 5, there are 7,377 participants in the word-prediction task whose local models have higher accuracy on their data than the federated model and who thus have no incentive to participate. If we re-train the federated model on the remaining 72,623 participants, it achieves mean accuracy of 20.008% and median accuracy of 19.570% vs., respectively, 20.021% and 19.563% achieved by the original model on 80,000 participants

participants: 72623

As shown in section 5, there are 7,377 participants in the word-prediction task whose local models have higher accuracy on their data than the federated model and who thus have no incentive to participate. If we re-train the federated model on the remaining 72,623 participants, it achieves mean accuracy of 20.008% and median accuracy of 19.570% vs., respectively, 20.021% and 19.563% achieved by the original model on 80,000 participants. The new model performs well even on the removed 7,377 participants, with mean accuracy of 20.076% vs. 20.301% for the original

participants: 7377

If we re-train the federated model on the remaining 72,623 participants, it achieves mean accuracy of 20.008% and median accuracy of 19.570% vs., respectively, 20.021% and 19.563% achieved by the original model on 80,000 participants. The new model performs well even on the removed 7,377 participants, with mean accuracy of 20.076% vs. 20.301% for the original. Among the 72,623 participants used to train both models, the new model underperforms the original only on 974 (1.34%) participants

participants: 72623

The new model performs well even on the removed 7,377 participants, with mean accuracy of 20.076% vs. 20.301% for the original. Among the 72,623 participants used to train both models, the new model underperforms the original only on 974 (1.34%) participants. As discussed in subsection 7.2, the removed participants have (a) simpler and fewer words, and (b) their sentences are outliers, very different from the rest of the participants

Reference

- Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., and Zhang, L. Deep learning with differential privacy. In CCS, 2016.
- Bagdasaryan, E., Veit, A., Hua, Y., Estrin, D., and Shmatikov, V. How to backdoor federated learning. arXiv:1807.00459, 2018.
- Bagdasaryan, E., Poursaeed, O., and Shmatikov, V. Differential privacy has disparate impact on model accuracy. In NeurIPS, 2019.
- Bhagoji, A. N., Chakraborty, S., Mittal, P., and Calo, S. Analyzing federated learning through an adversarial lens. In ICML, 2019.
- Blanchard, P., El Mhamdi, E., Guerraoui, R., and Stainer, J. Machine learning with adversaries: Byzantine tolerant gradient descent. In NIPS, 2017.
- Bonawitz, K., Ivanov, V., Kreuter, B., Marcedone, A., McMahan, H. B., Patel, S., Ramage, D., Segal, A., and Seth, K. Practical secure aggregation for privacypreserving machine learning. In CCS, 2017.
- Bonawitz, K., Eichner, H., Grieskamp, W., Huba, D., Ingerman, A., Ivanov, V., Kiddon, C., Konecny, J., Mazzocchi, S., McMahan, H. B., Van Overveldt, T., Petrou, D., Ramage, D., and Roselander, J. Towards federated learning at scale: System design. In SysML, 2019.
- Chen, X., Chen, T., Sun, H., Wu, Z. S., and Hong, M. Distributed training with heterogeneous data: Bridging median and mean based algorithms. arXiv:1906.01736, 2019.
- Chen, Y., Su, L., and Xu, J. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. In POMACS, 2017.
- Damaskinos, G., El Mhamdi, E., Guerraoui, R., Guirguis, A. H. A., and Rouault, S. L. A. AGGREGATHOR: Byzantine machine learning via robust gradient aggregation. In SysML, 2019.
- https://doc.ai/blog/
- federated-future-ready-shipping/, 2019.
- Dwork, C. Differential privacy: A survey of results. In TAMC, 2008.
- El Mhamdi, E., Guerraoui, R., and Rouault, S. The hidden vulnerability of distributed learning in Byzantium. In ICML, 2018.
- French, R. M. Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3(4):128–135, 1999.
- Grosse, K., Trost, T. A., Mosbach, M., Backes, M., and Klakow, D. Adversarial initialization–when your network performs the way I want. arXiv:1902.03020, 2019.
- Guha, N., Talwalkar, A., and Smith, V. One-shot federated learning. arXiv:1902.11175, 2019.
- Hanin, B. and Rolnick, D. How to start training: The effect of initialization and architecture. In NIPS, 2018.
- Hard, A., Rao, K., Mathews, R., Beaufays, F., Augenstein, S., Eichner, H., Kiddon, C., and Ramage, D. Federated learning for mobile keyboard prediction. arXiv:1811.03604, 2018.
- Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015.
- Hsu, T.-M. H., Qi, H., and Brown, M. Measuring the effects of non-identical data distribution for federated visual classification. arXiv:1909.06335, 2019.
- Huang, Z., Li, J., Siniscalchi, S. M., Chen, I.-F., Wu, J., and Lee, C.-H. Rapid adaptation for deep neural networks through multi-task learning. In INTERSPEECH, 2015.
- Jiang, Y., Konecny, J., Rush, K., and Kannan, S. Improving federated learning personalization via model agnostic meta learning. arXiv:1909.12488, 2019.
- Kairouz, P. et al. Advances and open problems in federated learning. arXiv:1912.04977, 2019.
- Kirkpatrick, J. et al. Overcoming catastrophic forgetting in neural networks. Proc. NAS, 114(13):3521–3526, 2017.
- Krizhevsky, A. Learning multiple layers of features from tiny images, 2009.
- McMahan, H. B., Moore, E., Ramage, D., Hampson, S., and Aguera y Arcas, B. Communication-efficient learning of deep networks from decentralized data. In AISTATS, 2017.
- McMahan, H. B., Ramage, D., Talwar, K., and Zhang, L. Learning differentially private recurrent language models. In ICLR, 2018.
- Melis, L., Song, C., De Cristofaro, E., and Shmatikov, V. Exploiting unintended feature leakage in collaborative learning. In S&P, 2019.
- Miao, Y., Zhang, H., and Metze, F. Speaker adaptive training of deep neural network acoustic models using i-vectors. IEEE/ACM Trans. AUDIO SPE, 23(11):1938– 1949, 2015.
- io19-helpful-google-everyone/, 2019.
- pytorch. PyTorch examples. https://github.com/pytorch/examples/tree/master/word_language_model/, 2019.
- Rajput, S., Wang, H., Charles, Z., and Papailiopoulos, D. Detox: A redundancy-based framework for faster and more robust gradient aggregation. In NeurIPS, 2019.
- Reddit. Reddit comments. https://bigquery.cloud.google.com/dataset/fh-bigquery:reddit_comments, 2019.
- Samarakoon, L. and Sim, K. C. Factorized hidden layer adaptation for deep neural network based acoustic modeling. IEEE/ACM Trans. AUDIO SPE, 24(12):2241–2250, 2016.
- Shokri, R., Stronati, M., Song, C., and Shmatikov, V. Membership inference attacks against machine learning models. In S&P, 2017.
- Song, C., Ristenpart, T., and Shmatikov, V. Machine learning models that remember too much. In CCS, 2017.
- Tan, T., Qian, Y., Yin, M., Zhuang, Y., and Yu, K. Cluster adaptive training for deep neural network. In ICASSP, 2015.
- Wang, K., Mathews, R., Kiddon, C., Eichner, H., Beaufays, F., and Ramage, D. Federated evaluation of on-device personalization. arXiv:1910.10252, 2019.
- Yin, D., Chen, Y., Ramchandran, K., and Bartlett, P. Byzantine-robust distributed learning: Towards optimal statistical rates. In ICML, 2018.
- Yu, D. and Li, J. Recent progresses in deep learning based acoustic models. IEEE/CAA J. Automatica Sinica, 4(3): 396–409, 2017.
- Yurochkin, M., Agarwal, M., Ghosh, S., Greenewald, K., Hoang, T. N., and Khazaeni, Y. Bayesian nonparametric federated learning of neural networks. In ICML, 2019.

Full Text

Tags

Comments