# Bypassing the Ambient Dimension: Private SGD with Gradient Subspace Identification

international conference on learning representations, 2021.

Weibo:

Abstract:

Differentially private SGD (DP-SGD) is one of the most popular methods for solving differentially private empirical risk minimization (ERM). Due to its noisy perturbation on each gradient update, the error rate of DP-SGD scales with the ambient dimension p, the number of parameters in the model. Such dependence can be problematic for over...More

Code:

Data:

Introduction

- Many fundamental machine learning tasks involve solving empirical risk minimization (ERM): given a

1n n i=1 loss (w, function , find zi), where z1, . .

a ., model zn are w ∈ Rp that minimizes i.i.d. examples drawn from the empirical a distribution risk Ln(w) P. - Many fundamental machine learning tasks involve solving empirical risk minimization (ERM): given a.
- A ., model zn are w ∈ Rp that minimizes i.i.d. examples drawn from the empirical a distribution risk Ln(w) P.
- One of the most commonly used algorithm for solving private ERM is the differentially-private stochastic gradient descent (DP-SGD) (Abadi et al, 2016; Bassily et al, 2014; Song et al, 2013)–a private variant of SGD that perturbs each gradient update with random noise vector drawn from an isotropic Gaussian distribution N (0, σ2Ip), with appropriately chosen variance σ2

Highlights

- Many fundamental machine learning tasks involve solving empirical risk minimization (ERM): given a

1n n i=1 loss (w, function, find zi), where z1, . .

a ., model zn are w ∈ Rp that minimizes i.i.d. examples drawn from the empirical a distribution risk Ln(w) P - We think Differential privacy (DP)-SGD performs better than PDP-SGD because the subspace reconstruction error dominates the error from the injected noise, since noise scale is small for large
- While differentially-private stochastic gradient descent (DP-SGD) and variants have been well studied for private ERM, the error rate of DP-SGD has a dependence on the ambient dimension p
- We propose PDP-SGD which projects the noisy gradient to an approximated subspace evaluated on a public dataset
- We show theoretically that PDP-SGD can obtain dimension-independent error rate
- We provide a theoretical analysis and empirical evaluations to show that our method can substantially improve the accuracy of DP-SGD in the high privacy regime
- We evaluate the proposed algorithms on two popular deep learning tasks and demonstrate the empirical advantages of PDP-SGD

Methods

- The authors empirically evaluate PDP-SGD on training neural networks with two datasets: the MNIST (LeCun et al, 1998) and Fashion MNIST (Xiao et al, 2017).
- The authors explore a heuristic method, i.e., DP-SGD with random projection by replacing the projector with a Rk×p Gaussian random projector (Bingham and Mannila, 2001; Blocki et al, 2012).
- The authors call this method randomly projected DP-SGD (RPDP-SGD).
- The authors present the experimental results after discussing the experimental setup.
- More details and additional results are in the Appendix D

Results

- The training accuracy and test accuracy for different , are reported in Figure 3.
- I.e., ≤ 0.42 with MNIST (Figure 3 (a)) and ≤ 0.72 with Fashion MNIST (Figure 3 (b)), PDP-SGD outperforms DP-SGD.
- Training dynamics of DP-SGD and PDP-SGD with different privacy levels are presented in Figure 8 and Figure 9 respectively for MNIST and Fashion MNIST.
- Among the choice of k, the authors can see that PDP-SGD with k = 50 performs better that the others in terms of the training and test accuracy.

Conclusion

- While DP-SGD and variants have been well studied for private ERM, the error rate of DP-SGD has a dependence on the ambient dimension p.
- The authors propose PDP-SGD which projects the noisy gradient to an approximated subspace evaluated on a public dataset.
- The authors show theoretically that PDP-SGD can obtain dimension-independent error rate.
- The authors evaluate the proposed algorithms on two popular deep learning tasks and demonstrate the empirical advantages of PDP-SGD

Summary

## Introduction:

Many fundamental machine learning tasks involve solving empirical risk minimization (ERM): given a

1n n i=1 loss (w, function , find zi), where z1, . .

a ., model zn are w ∈ Rp that minimizes i.i.d. examples drawn from the empirical a distribution risk Ln(w) P.- Many fundamental machine learning tasks involve solving empirical risk minimization (ERM): given a.
- A ., model zn are w ∈ Rp that minimizes i.i.d. examples drawn from the empirical a distribution risk Ln(w) P.
- One of the most commonly used algorithm for solving private ERM is the differentially-private stochastic gradient descent (DP-SGD) (Abadi et al, 2016; Bassily et al, 2014; Song et al, 2013)–a private variant of SGD that perturbs each gradient update with random noise vector drawn from an isotropic Gaussian distribution N (0, σ2Ip), with appropriately chosen variance σ2
## Objectives:

The authors aim to overcome such dependence on the ambient dimension p by leveraging the structure of the gradient space in the training of neural networks.- The authors aim to bypass such dependence by leveraging the low-dimensional structure of the observed gradients in the training of deep networks
## Methods:

The authors empirically evaluate PDP-SGD on training neural networks with two datasets: the MNIST (LeCun et al, 1998) and Fashion MNIST (Xiao et al, 2017).- The authors explore a heuristic method, i.e., DP-SGD with random projection by replacing the projector with a Rk×p Gaussian random projector (Bingham and Mannila, 2001; Blocki et al, 2012).
- The authors call this method randomly projected DP-SGD (RPDP-SGD).
- The authors present the experimental results after discussing the experimental setup.
- More details and additional results are in the Appendix D
## Results:

The training accuracy and test accuracy for different , are reported in Figure 3.- I.e., ≤ 0.42 with MNIST (Figure 3 (a)) and ≤ 0.72 with Fashion MNIST (Figure 3 (b)), PDP-SGD outperforms DP-SGD.
- Training dynamics of DP-SGD and PDP-SGD with different privacy levels are presented in Figure 8 and Figure 9 respectively for MNIST and Fashion MNIST.
- Among the choice of k, the authors can see that PDP-SGD with k = 50 performs better that the others in terms of the training and test accuracy.
## Conclusion:

While DP-SGD and variants have been well studied for private ERM, the error rate of DP-SGD has a dependence on the ambient dimension p.- The authors propose PDP-SGD which projects the noisy gradient to an approximated subspace evaluated on a public dataset.
- The authors show theoretically that PDP-SGD can obtain dimension-independent error rate.
- The authors evaluate the proposed algorithms on two popular deep learning tasks and demonstrate the empirical advantages of PDP-SGD

- Table1: Network architecture for MNIST and Fashion MNIST
- Table2: Neural network and datasets setup
- Table3: Hyper-parameter settings for DP-SGD and PDP-SGD for MNIST
- Table4: Hyper-parameter settings for DP-SGD and PDP-SGD for Fashion MNIST

Related work

- Beyond the aforementioned work on private ERM, there has been recent work on private ERM that also leverages the low-dimensional structure of the problem. Jain and Thakurta (2014); Song et al (2020) show dimension independent excess empirical risk bounds for convex generalized linear problems, when the input data matrix is low-rank. Kairouz et al (2020) show that for unconstrained convex empirical risk minimization if the gradients along the path of optimization lie in a low-dimensional subspace, then a noisy version of AdaGrad method which only operates on private data achieves dimension-free excess risk bound. In comparison, our work studies both convex and non-convex problems and our analysis applies for more general low-dimensional structures that can be characterized by small γ2 functions (Talagrand, 2014) (e.g., low-rank gradients and fast decay in the gradient coordinates).

Given a private dataset S = {z1, ..., zn} drawn i.i.d. from the underlying distribution P, we want to solve the following empirical risk minimization (ERM) problem subject to differential privacy:1 minw Ln(w) = 1 n n i=1

(w, zi). where the parameter w ∈ Rp.

We optimize this objective with an iterative algorithm. At each iteration t, we write wt to denote the algorithm’s iterate and use gt to denote the mini-batch gradient, and

Funding

- We provide a theoretical analysis and empirical evaluations to show that our method can substantially improve the accuracy of DP-SGD in the high privacy regime (corresponding to low privacy loss )

Study subjects and analysis

samples: 100

We provide an empirical evaluation of PDP-SGD on two real datasets. In our experiments, we construct the ”public” datasets by taking very small random sub-samples of these two datasets (100 samples). While these two public datasets are not sufficient for training an accurate predictor, we demonstrate that they provide useful gradient subspace projection and substantial accuracy improvement over DP-SGD

samples: 10000

The MNIST and Fashion MNIST datasets both consist of 60,000 training examples and 10,000 test examples. To construct the private training set, we randomly sample 10, 000 samples from the original training set of MNIST and Fashion MNIST, then we randomly sample 100 samples from the rest to construct the public dataset. Note that the smaller private datasets make the private learning problem more challenging

training samples: 10000

We follow the Moment Accountant (MA) method (Bu et al, 2019) to calculate the accumulated privacy cost, which depends on the number of epochs, the batch size, δ, and noise σ. With 30 epochs, batch size 250, 10, 000 training samples, and fixing δ = 10−5, the is {2.41, 1.09, 0.72, 0.42, 0.30, 0.23} for σ ∈ {2, 4, 6, 10, 14, 18} for Fashion MNIST. For MNIST, is {1.09, 0.72, 0.53, 0.42, 0.30, 0.23} for σ = {4, 6, 8, 10, 14, 18}

samples: 10000

For MNIST, is {1.09, 0.72, 0.53, 0.42, 0.30, 0.23} for σ = {4, 6, 8, 10, 14, 18}. Note that the presented in this paper is w.r.t. a subset i.e., 10, 000 samples from MNIST and Fashion MNIST. Experimental Results

samples: 20000

(b) Fashion MNIST. MNIST with 20,000 samples and (b) Fashion MNIST with 50,000 samples. The X-axis and Y-axis refer to Figure 3

samples: 20000

Figure 6 reports PDP-SGD with s = {1, 10, 20} for (a) MNIST and (b) Fashion MNIS showing that PDP-SGD with a reduced eigen-space computation also improves the accuracy over DP-SGD, even though there is a mild decay for PDP-SGD with fewer eigen-space computation. We also compare PDP-SGD and DP-SGD for different number of training samples, i.e., MNIST with 20,000 samples (Figure 7(a)) and Fashion MNIST with 50,000 samples (Figure 7(a)) (100 public samples for both case). The observation that PDP-SGD outperforms DP-SGD for small regime in Figure 3 also holds for other number of training samples

Reference

- M. Abadi, A. Chu, I. Goodfellow, B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. In 23rd ACM Conference on Computer and Communications Security, pages 308–318, 2016. URL https://arxiv.org/abs/1607.00133.
- B. Avent, A. Korolova, D. Zeber, T. Hovden, and B. Livshits. BLENDER: enabling local search with a hybrid differential privacy model. In 26th USENIX Security Symposium, pages 747– 764, 2017. URL https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/avent.
- R. Bassily, A. Smith, and A. Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 464–47IEEE, 2014.
- R. Bassily, K. Nissim, A. D. Smith, T. Steinke, U. Stemmer, and J. Ullman. Algorithmic stability for adaptive data analysis. In Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, pages 1046–1059, 2016. doi: 10.1145/2897518.2897566. URL https://doi.org/10.1145/2897518.2897566.
- R. Bassily, V. Feldman, K. Talwar, and A. G. Thakurta. Private stochastic convex optimization with optimal rates. In Advances in Neural Information Processing Systems, pages 11282–11291, 2019a.
- Limits of private learning with access to public data. In Advances in Neural Information Processing Systems, pages 10342–10352, 2019b.
- URL http://papers.nips.cc/paper/
- R. Bassily, A. Cheu, S. Moran, A. Nikolov, J. Ullman, and Z. S. Wu. Private query release assisted by public data. CoRR, abs/2004.10941, 2020. URL https://arxiv.org/abs/2004.10941.
- E. Bingham and H. Mannila. Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 245–250, 2001.
- J. Blocki, A. Blum, A. Datta, and O. Sheffet. The johnson-lindenstrauss transform itself preserves differential privacy. In 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science, pages 410–419. IEEE, 2012.
- Z. Bu, J. Dong, Q. Long, and W. J. Su. Deep learning with gaussian differential privacy. arXiv preprint arXiv:1911.11607, 2019.
- C. Dwork, G. N. Rothblum, and S. P. Vadhan. Boosting and differential privacy. In 51th Annual IEEE Symposium on Foundations of Computer Science, pages 51–60. IEEE Computer Society, 2010. doi: 10.1109/FOCS.2010.URL https://doi.org/10.1109/FOCS.2010.12.
- C. Dwork, K. Talwar, A. Thakurta, and L. Zhang. Analyze gauss: optimal bounds for privacypreserving principal component analysis. In Symposium on Theory of Computing, pages 11– 20. ACM, 2014. doi: 10.1145/2591796.2591883. URL https://doi.org/10.1145/2591796.2591883.
- C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. L. Roth. Preserving statistical validity in adaptive data analysis. In Proceedings of the 47th Annual ACM on Symposium on Theory of Computing, pages 117–126. ACM, 2015. doi: 10.1145/2746539.2746580. URL https://doi.org/10.1145/2746539.2746580.
- V. Feldman, I. Mironov, K. Talwar, and A. Thakurta. Privacy amplification by iteration. In 59th IEEE Annual Symposium on Foundations of Computer Science, pages 521–532, 2018. doi: 10. 1109/FOCS.2018.00056. URL https://doi.org/10.1109/FOCS.2018.00056.
- S. Ghadimi and G. Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013. doi: 10.1137/120880811. URL https://doi.org/10.1137/120880811.
- G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, third edition, 1996.
- G. Gur-Ari, D. A. Roberts, and E. Dyer. Gradient descent happens in a tiny subspace. CoRR, abs/1812.04754, 20URL http://arxiv.org/abs/1812.04754.
- R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge university press, 2012.
- P. Jain and A. G. Thakurta. (near) dimension independent risk bounds for differentially private learning. volume 32 of Proceedings of Machine Learning Research, pages 476–484, Bejing, China, 22–24 Jun 2014. PMLR. URL http://proceedings.mlr.press/v32/jain14.html.
- C. Jung, K. Ligett, S. Neel, A. Roth, S. Sharifi-Malvajerdi, and M. Shenfeld. A new analysis of differential privacy’s generalization guarantees. volume 151, pages 31:1–31:17, 2020. doi: 10. 4230/LIPIcs.ITCS.2020.31. URL https://doi.org/10.4230/LIPIcs.ITCS.2020.31.
- P. Kairouz, M. Ribero, K. Rush, and A. Thakurta. Dimension independence in unconstrained private ERM via adaptive preconditioning. CoRR, abs/2008.06570, 2020. URL https://arxiv.org/abs/2008.06570.
- S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. Smith. What can we learn privately? In 2008 49th Annual IEEE Symposium on Foundations of Computer Science, 2008.
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- X. Li, Q. Gu, Y. Zhou, T. Chen, and A. Banerjee. Hessian based analysis of SGD for deep nets: Dynamics and generalization. In Proceedings of the 2020 SIAM International Conference on Data Mining, pages 190–198. SIAM, 2020. doi: 10.1137/1.9781611976236.22. URL https://doi.org/10.1137/1.9781611976236.22.
- F. McSherry. Spectral methods for data analysis. PhD thesis, University of Washington, 2004.
- Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer Publishing Company, Incorporated, 1 edition, 2014. ISBN 1461346916.
- N. Papernot, M. Abadi, U. Erlingsson, I. J. Goodfellow, and K. Talwar. Semi-supervised knowledge transfer for deep learning from private training data. In 5th International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=HkwoSDPgg.
- N. Papernot, A. Thakurta, S. Song, S. Chien, and Ulfar Erlingsson. Tempered sigmoid activations for deep learning with differential privacy, 2020.
- V. Papyan. Measurements of three-level hierarchical structure in the outliers in the spectrum of deepnet hessians. In International Conference on Machine Learning, pages 5012–5021, 2019.
- B. Polyak. Gradient methods for the minimisation of functionals. Ussr Computational Mathematics and Mathematical Physics, 3:864–878, 12 1963. doi: 10.1016/0041-5553(63)90382-3.
- S. Song, K. Chaudhuri, and A. D. Sarwate. Stochastic gradient descent with differentially private updates. In IEEE Global Conference on Signal and Information Processing, pages 245–248. IEEE, 2013. doi: 10.1109/GlobalSIP.2013.6736861. URL https://doi.org/10.1109/ GlobalSIP.2013.6736861.
- S. Song, O. Thakkar, and A. Thakurta. Characterizing private clipped gradient descent on convex generalized linear problems. arXiv preprint arXiv:2006.06783, 2020.
- M. Talagrand. Upper and Lower Bounds for Stochastic Processes. Springer, 2014.
- Under review as a conference paper at ICLR 2021 R. Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science.
- Cambridge University Press, 2018. doi: 10.1017/9781108231596. M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge
- University Press, 2019. D. Wang and J. Xu. Differentially private empirical risk minimization with smooth non-convex loss functions: A non-stationary view. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 1182–1189, 2019.
- D. Wang, M. Ye, and J. Xu. Differentially private empirical risk minimization revisited: Faster and more general. In Advances in Neural Information Processing Systems, pages 2722–2731, 2017.
- H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
- (Talagrand, 2014). Typically the results in generic chaining are characterized by the so-called γ2 function (see Definition 2). Talagrand (2014) shows that for a process (Xt)t∈T and a given metric space (T, d), if (Xt)t∈T satisfies the increment condition u2
- We first show that the variable Mt − Σt 2 satisfies the increment condition as stated in equation 5 in the Lemma 1. Before we present the proof of Lemma 1, we introduce the Ahlswede-Winter Inequality Horn and Johnson (2012); Wainwright (2019), which will be used in the proof of Lemma 1.

Tags

Comments