Salvaging Federated Learning by Local Adaptation

Yu Tao
Yu Tao
Bagdasaryan Eugene
Bagdasaryan Eugene
Cited by: 2|Bibtex|Views29
Other Links: arxiv.org
Weibo:
We show that on standard tasks such as next-word prediction, many participants gain no benefit from Federated learning because the federated model is less accurate on their data than the models they can train locally on their own

Abstract:

Federated learning (FL) is a heavily promoted approach for training ML models on sensitive data, e.g., text typed by users on their smartphones. FL is expressly designed for training on data that are unbalanced and non-iid across the participants. To ensure privacy and integrity of the federated model, latest FL approaches use different...More

Code:

Data:

0
Introduction
  • Federated learning (McMahan et al, 2017) is a framework for large-scale, distributed learning on sensitive data: for example, training a next-word prediction model on texts typed by users into their smartphones or training a medical treatment model on patient records from multiple hospitals.
  • In the original design (McMahan et al, 2017), the federated model is created by repeatedly averaging model updates from small subsets of participants
  • Both the updates and the final model can leak participants’ training data, violating privacy (Shokri et al, 2017; Melis et al, 2019).
  • Federated learning is a distributed learning paradigm for training a model on multiple participants’ data (McMahan et al, 2017)
  • It consists of local training and aggregation.
Highlights
  • Federated learning (McMahan et al, 2017) is a framework for large-scale, distributed learning on sensitive data: for example, training a next-word prediction model on texts typed by users into their smartphones or training a medical treatment model on patient records from multiple hospitals
  • Differentially private federated learning (McMahan et al, 2018) bounds how much the federated model can reveal about the input from any individual participant
  • We demonstrate that privacy and robustness protections destroy the accuracy of federated models for many individual participants, removing their main incentive to join federated learning
  • Adapted Differential privacy (DP)-FED outperforms the local models of all participants
  • Federated learning is a promising approach to large-scale model training on sensitive data
  • Mean accuracy improvements due to adaptation are 2.32%, 2.12%, and 2.12% for BASIC-FED, DP-FED and ROBUST-FED, respectively
  • We showed how local adaptation techniques based on fine-tuning, multi-task learning, and knowledge distillation help improve the accuracy of private and robust federated models for individual participants, enabling them to reap the benefits of federated learning without compromising privacy or integrity of their models
Results
  • Mean accuracy improvements due to adaptation are 2.32%, 2.12%, and 2.12% for BASIC-FED, DP-FED and ROBUST-FED, respectively
  • These improvements make up the loss of accuracy due to differential privacy (-1.42%) and robust aggregation (-2.81%).
  • Mean accuracy improvements due to adaptation are 2.98%, 6.83%, and 6.34% for BASIC-FED, DPFED, and ROBUST-FED, respectively
  • These improvements make up the loss of accuracy due to differential privacy (-7.83%) and robust aggregation (-11.89%).
  • Adapted DP-FED outperforms the local models of all participants
Conclusion
  • Federated learning is a promising approach to large-scale model training on sensitive data.
  • Differential privacy and robust aggregation reduce accuracy of federated models below that of the locally trained models of many participants, removing their main incentive to join federated learning.
  • The authors showed how local adaptation techniques based on fine-tuning, multi-task learning, and knowledge distillation help improve the accuracy of private and robust federated models for individual participants, enabling them to reap the benefits of federated learning without compromising privacy or integrity of their models
Summary
  • Introduction:

    Federated learning (McMahan et al, 2017) is a framework for large-scale, distributed learning on sensitive data: for example, training a next-word prediction model on texts typed by users into their smartphones or training a medical treatment model on patient records from multiple hospitals.
  • In the original design (McMahan et al, 2017), the federated model is created by repeatedly averaging model updates from small subsets of participants
  • Both the updates and the final model can leak participants’ training data, violating privacy (Shokri et al, 2017; Melis et al, 2019).
  • Federated learning is a distributed learning paradigm for training a model on multiple participants’ data (McMahan et al, 2017)
  • It consists of local training and aggregation.
  • Results:

    Mean accuracy improvements due to adaptation are 2.32%, 2.12%, and 2.12% for BASIC-FED, DP-FED and ROBUST-FED, respectively
  • These improvements make up the loss of accuracy due to differential privacy (-1.42%) and robust aggregation (-2.81%).
  • Mean accuracy improvements due to adaptation are 2.98%, 6.83%, and 6.34% for BASIC-FED, DPFED, and ROBUST-FED, respectively
  • These improvements make up the loss of accuracy due to differential privacy (-7.83%) and robust aggregation (-11.89%).
  • Adapted DP-FED outperforms the local models of all participants
  • Conclusion:

    Federated learning is a promising approach to large-scale model training on sensitive data.
  • Differential privacy and robust aggregation reduce accuracy of federated models below that of the locally trained models of many participants, removing their main incentive to join federated learning.
  • The authors showed how local adaptation techniques based on fine-tuning, multi-task learning, and knowledge distillation help improve the accuracy of private and robust federated models for individual participants, enabling them to reap the benefits of federated learning without compromising privacy or integrity of their models
Related work
  • Privacy and integrity of federated learning. The original federated learning framework suffers from privacy and integrity problems. Participants’ model updates leak their training data (Melis et al, 2019), and malicious participants can inject unwanted behaviors into the model (Bagdasaryan et al, 2018; Bhagoji et al, 2019). Secure aggregation (Bonawitz et al, 2017) prevents the global server from observing individual updates, but it also makes attacks on integrity impossible to detect and the final federated model may still leak training data.

    To limit the leakage of training data, federated learning has been combined with differential privacy (McMahan et al, 2018). To limit the influence of individual participants on the federated model, several robust, “Byzantine-tolerant” aggregation schemes have been proposed (Blanchard et al, 2017; El Mhamdi et al, 2018; Damaskinos et al, 2019; Rajput et al, 2019; Chen et al, 2017). Alternative aggregation schemes (Yurochkin et al, 2019; Guha et al, 2019; Hsu et al, 2019) for various flavors of federated learning provide neither privacy, nor robustness, and because our focus is on mitigating the damage from privacy and robustness mechanisms, we do not analyze them in this paper.
Funding
  • This research was supported in part by NSF grants 1704296 and 1916717, the generosity of Eric and Wendy Schmidt by recommendation of the Schmidt Futures program, and a Google Faculty Research Award
Study subjects and analysis
participants: 80000
Next-word prediction. We train word prediction models on a randomly chosen month (November 2017) of the Reddit dataset (Reddit) with 80,000 participants (i.e., Reddit users) who have between 150 and 500 posts, treating each post as one sentence. We compiled a dictionary of 50, 000 most frequent words and replaced all others with the unk token

participants: 100
To create BASICFED, DP-FED, and ROBUST-FED models, we train 2-layer LSTM models with 200 hidden units and 10 million parameters (pytorch). Following (McMahan et al, 2018), we run federated learning for 5,000 rounds with m = 100 participants per round, aggregation learning rate η = 1, batch size 20, and B = 2 internal epochs using SGD. For training participants’ models, we tried inner learning rates of 0.1, 1, 10, 20, and 40, yielding global test accuracy of, respectively, 9.07%, 14.34%, 18.83%, 19.20% and 19.29%

participants: 100
Image classification. We split the CIFAR-10 (Krizhevsky, 2009) training set into 100 participants. To simulate a non-iid distribution, we allocate images from each class to participants using Dirichlet distribution with α = 0.9, similar to (Hsu et al, 2019)

randomly selected participants: 10
We train all federated models for 1,000 rounds with the aggregation learning rate η = 1 and batch size of 32. Following (McMahan et al, 2017), in every round we aggregate 10 randomly selected participants, each of whom trains a ResNet-18 model (with 11.2 million parameters) with the inner learning rate of 0.1 and B = 2 internal epochs using SGD with momentum 0.9 and weight decay 0.0005. CIFAR-10 is not divided into distinct participants with their own training and test sets

participants with the learning rates of 0: 1000
Freezebase (FB) is a variant that freezes the base layers of the federated model and fine-tunes only the top layer. When using fine-tuning for local adaptation, we experimented on 1,000 participants with the learning rates of 0.1, 1, and 10, yielding mean accuracy of, respectively, 20.58%, 20.99% and 18.28%. Therefore, we set lr = 1

participants: 80000
Dots and bars in the figures are color-coded according to the adaptation technique that yielded the best accuracy improvement. For the word-prediction task, there are 80,000 participants, but we only adapt the models for the 79,097 participants whose vocabulary size (i.e., number of unique symbols) is over 100, the percentage of utility symbols (e.g., punctuation) is under 40%, and the difference between total and utility symbols is over 1,000. 7.1

participants: 7377
Removing disincentivized participants. As shown in section 5, there are 7,377 participants in the word-prediction task whose local models have higher accuracy on their data than the federated model and who thus have no incentive to participate. If we re-train the federated model on the remaining 72,623 participants, it achieves mean accuracy of 20.008% and median accuracy of 19.570% vs., respectively, 20.021% and 19.563% achieved by the original model on 80,000 participants

participants: 72623
As shown in section 5, there are 7,377 participants in the word-prediction task whose local models have higher accuracy on their data than the federated model and who thus have no incentive to participate. If we re-train the federated model on the remaining 72,623 participants, it achieves mean accuracy of 20.008% and median accuracy of 19.570% vs., respectively, 20.021% and 19.563% achieved by the original model on 80,000 participants. The new model performs well even on the removed 7,377 participants, with mean accuracy of 20.076% vs. 20.301% for the original

participants: 7377
If we re-train the federated model on the remaining 72,623 participants, it achieves mean accuracy of 20.008% and median accuracy of 19.570% vs., respectively, 20.021% and 19.563% achieved by the original model on 80,000 participants. The new model performs well even on the removed 7,377 participants, with mean accuracy of 20.076% vs. 20.301% for the original. Among the 72,623 participants used to train both models, the new model underperforms the original only on 974 (1.34%) participants

participants: 72623
The new model performs well even on the removed 7,377 participants, with mean accuracy of 20.076% vs. 20.301% for the original. Among the 72,623 participants used to train both models, the new model underperforms the original only on 974 (1.34%) participants. As discussed in subsection 7.2, the removed participants have (a) simpler and fewer words, and (b) their sentences are outliers, very different from the rest of the participants

Reference
  • Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., and Zhang, L. Deep learning with differential privacy. In CCS, 2016.
    Google ScholarLocate open access versionFindings
  • Bagdasaryan, E., Veit, A., Hua, Y., Estrin, D., and Shmatikov, V. How to backdoor federated learning. arXiv:1807.00459, 2018.
    Findings
  • Bagdasaryan, E., Poursaeed, O., and Shmatikov, V. Differential privacy has disparate impact on model accuracy. In NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • Bhagoji, A. N., Chakraborty, S., Mittal, P., and Calo, S. Analyzing federated learning through an adversarial lens. In ICML, 2019.
    Google ScholarLocate open access versionFindings
  • Blanchard, P., El Mhamdi, E., Guerraoui, R., and Stainer, J. Machine learning with adversaries: Byzantine tolerant gradient descent. In NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Bonawitz, K., Ivanov, V., Kreuter, B., Marcedone, A., McMahan, H. B., Patel, S., Ramage, D., Segal, A., and Seth, K. Practical secure aggregation for privacypreserving machine learning. In CCS, 2017.
    Google ScholarLocate open access versionFindings
  • Bonawitz, K., Eichner, H., Grieskamp, W., Huba, D., Ingerman, A., Ivanov, V., Kiddon, C., Konecny, J., Mazzocchi, S., McMahan, H. B., Van Overveldt, T., Petrou, D., Ramage, D., and Roselander, J. Towards federated learning at scale: System design. In SysML, 2019.
    Google ScholarLocate open access versionFindings
  • Chen, X., Chen, T., Sun, H., Wu, Z. S., and Hong, M. Distributed training with heterogeneous data: Bridging median and mean based algorithms. arXiv:1906.01736, 2019.
    Findings
  • Chen, Y., Su, L., and Xu, J. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. In POMACS, 2017.
    Google ScholarLocate open access versionFindings
  • Damaskinos, G., El Mhamdi, E., Guerraoui, R., Guirguis, A. H. A., and Rouault, S. L. A. AGGREGATHOR: Byzantine machine learning via robust gradient aggregation. In SysML, 2019.
    Google ScholarLocate open access versionFindings
  • https://doc.ai/blog/
    Findings
  • federated-future-ready-shipping/, 2019.
    Google ScholarFindings
  • Dwork, C. Differential privacy: A survey of results. In TAMC, 2008.
    Google ScholarLocate open access versionFindings
  • El Mhamdi, E., Guerraoui, R., and Rouault, S. The hidden vulnerability of distributed learning in Byzantium. In ICML, 2018.
    Google ScholarLocate open access versionFindings
  • French, R. M. Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3(4):128–135, 1999.
    Google ScholarLocate open access versionFindings
  • Grosse, K., Trost, T. A., Mosbach, M., Backes, M., and Klakow, D. Adversarial initialization–when your network performs the way I want. arXiv:1902.03020, 2019.
    Findings
  • Guha, N., Talwalkar, A., and Smith, V. One-shot federated learning. arXiv:1902.11175, 2019.
    Findings
  • Hanin, B. and Rolnick, D. How to start training: The effect of initialization and architecture. In NIPS, 2018.
    Google ScholarLocate open access versionFindings
  • Hard, A., Rao, K., Mathews, R., Beaufays, F., Augenstein, S., Eichner, H., Kiddon, C., and Ramage, D. Federated learning for mobile keyboard prediction. arXiv:1811.03604, 2018.
    Findings
  • Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015.
    Google ScholarFindings
  • Hsu, T.-M. H., Qi, H., and Brown, M. Measuring the effects of non-identical data distribution for federated visual classification. arXiv:1909.06335, 2019.
    Findings
  • Huang, Z., Li, J., Siniscalchi, S. M., Chen, I.-F., Wu, J., and Lee, C.-H. Rapid adaptation for deep neural networks through multi-task learning. In INTERSPEECH, 2015.
    Google ScholarLocate open access versionFindings
  • Jiang, Y., Konecny, J., Rush, K., and Kannan, S. Improving federated learning personalization via model agnostic meta learning. arXiv:1909.12488, 2019.
    Findings
  • Kairouz, P. et al. Advances and open problems in federated learning. arXiv:1912.04977, 2019.
    Findings
  • Kirkpatrick, J. et al. Overcoming catastrophic forgetting in neural networks. Proc. NAS, 114(13):3521–3526, 2017.
    Google ScholarLocate open access versionFindings
  • Krizhevsky, A. Learning multiple layers of features from tiny images, 2009.
    Google ScholarFindings
  • McMahan, H. B., Moore, E., Ramage, D., Hampson, S., and Aguera y Arcas, B. Communication-efficient learning of deep networks from decentralized data. In AISTATS, 2017.
    Google ScholarLocate open access versionFindings
  • McMahan, H. B., Ramage, D., Talwar, K., and Zhang, L. Learning differentially private recurrent language models. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Melis, L., Song, C., De Cristofaro, E., and Shmatikov, V. Exploiting unintended feature leakage in collaborative learning. In S&P, 2019.
    Google ScholarLocate open access versionFindings
  • Miao, Y., Zhang, H., and Metze, F. Speaker adaptive training of deep neural network acoustic models using i-vectors. IEEE/ACM Trans. AUDIO SPE, 23(11):1938– 1949, 2015.
    Google ScholarLocate open access versionFindings
  • io19-helpful-google-everyone/, 2019.
    Google ScholarFindings
  • pytorch. PyTorch examples. https://github.com/pytorch/examples/tree/master/word_language_model/, 2019.
    Findings
  • Rajput, S., Wang, H., Charles, Z., and Papailiopoulos, D. Detox: A redundancy-based framework for faster and more robust gradient aggregation. In NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • Reddit. Reddit comments. https://bigquery.cloud.google.com/dataset/fh-bigquery:reddit_comments, 2019.
    Findings
  • Samarakoon, L. and Sim, K. C. Factorized hidden layer adaptation for deep neural network based acoustic modeling. IEEE/ACM Trans. AUDIO SPE, 24(12):2241–2250, 2016.
    Google ScholarLocate open access versionFindings
  • Shokri, R., Stronati, M., Song, C., and Shmatikov, V. Membership inference attacks against machine learning models. In S&P, 2017.
    Google ScholarLocate open access versionFindings
  • Song, C., Ristenpart, T., and Shmatikov, V. Machine learning models that remember too much. In CCS, 2017.
    Google ScholarLocate open access versionFindings
  • Tan, T., Qian, Y., Yin, M., Zhuang, Y., and Yu, K. Cluster adaptive training for deep neural network. In ICASSP, 2015.
    Google ScholarLocate open access versionFindings
  • Wang, K., Mathews, R., Kiddon, C., Eichner, H., Beaufays, F., and Ramage, D. Federated evaluation of on-device personalization. arXiv:1910.10252, 2019.
    Findings
  • Yin, D., Chen, Y., Ramchandran, K., and Bartlett, P. Byzantine-robust distributed learning: Towards optimal statistical rates. In ICML, 2018.
    Google ScholarLocate open access versionFindings
  • Yu, D. and Li, J. Recent progresses in deep learning based acoustic models. IEEE/CAA J. Automatica Sinica, 4(3): 396–409, 2017.
    Google ScholarLocate open access versionFindings
  • Yurochkin, M., Agarwal, M., Ghosh, S., Greenewald, K., Hoang, T. N., and Khazaeni, Y. Bayesian nonparametric federated learning of neural networks. In ICML, 2019.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments