Understanding Gradient Clipping in Private SGD: A Geometric Perspective

NIPS 2020, 2020.

Cited by: 10|Views46
EI
Weibo:
Combined with our empirical evaluation showing that gradient distribution of private SGD follows some symmetric structure along the trajectory, these results provide an explanation why gradient clipping works in practice

Abstract:

Deep learning models are increasingly popular in many machine learning applications where the training data may contain sensitive information. To provide formal and rigorous privacy guarantee, many learning systems now incorporate differential privacy by training their models with (differentially) private SGD. A key step in each private...More

Code:

Data:

0
Full Text
Bibtex
Weibo
Introduction
  • Many modern applications of machine learning rely on datasets that may contain sensitive personal information, including medical records, browsing history, and geographic locations.
  • To bound the 2-sensitivity, existing theoretical analyses typically assume that the loss function is L-Lipschitz in the model parameters, and the constant L is known to the algorithm designer for setting the noise rate [Bassily et al, 2014, Wang and Xu, 2019].
Highlights
  • Many modern applications of machine learning rely on datasets that may contain sensitive personal information, including medical records, browsing history, and geographic locations
  • To protect the private information of individual citizens, many machine learning systems train their models subject to the constraint of differential privacy [Dwork et al, 2006], which informally requires that no individual training example has a significant influence on the trained model
  • We provide a theoretical analysis on the effect of gradient clipping in SGD and private SGD
  • We provide a new way to quantify the clipping bias by coupling the gradient distribution with a geometrically symmetric distribution
  • Combined with our empirical evaluation showing that gradient distribution of private SGD follows some symmetric structure along the trajectory, these results provide an explanation why gradient clipping works in practice
  • We provide a perturbation-based technique to reduce the clipping bias even for adversarial instances
Methods
  • The authors investigate whether the gradient distributions of DP-SGD are approximate symmetric in practice.
  • For MNIST, the authors train a CNN with two convolution layers with 16 4×4 kernels followed by a fully connected layer with 32 nodes.
  • For CIFAR-10, the authors train a CNN with two convolutional layers with 2×2 max pooling of stride 2 followed by a fully connected layer, all using ReLU activation, each layer uses a dropout rate of 0.5.
Conclusion
  • The authors provide a theoretical analysis on the effect of gradient clipping in SGD and private SGD.
  • The authors provide a new way to quantify the clipping bias by coupling the gradient distribution with a geometrically symmetric distribution.
  • Combined with the empirical evaluation showing that gradient distribution of private SGD follows some symmetric structure along the trajectory, these results provide an explanation why gradient clipping works in practice.
  • The authors provide a perturbation-based technique to reduce the clipping bias even for adversarial instances
Summary
  • Introduction:

    Many modern applications of machine learning rely on datasets that may contain sensitive personal information, including medical records, browsing history, and geographic locations.
  • To bound the 2-sensitivity, existing theoretical analyses typically assume that the loss function is L-Lipschitz in the model parameters, and the constant L is known to the algorithm designer for setting the noise rate [Bassily et al, 2014, Wang and Xu, 2019].
  • Methods:

    The authors investigate whether the gradient distributions of DP-SGD are approximate symmetric in practice.
  • For MNIST, the authors train a CNN with two convolution layers with 16 4×4 kernels followed by a fully connected layer with 32 nodes.
  • For CIFAR-10, the authors train a CNN with two convolutional layers with 2×2 max pooling of stride 2 followed by a fully connected layer, all using ReLU activation, each layer uses a dropout rate of 0.5.
  • Conclusion:

    The authors provide a theoretical analysis on the effect of gradient clipping in SGD and private SGD.
  • The authors provide a new way to quantify the clipping bias by coupling the gradient distribution with a geometrically symmetric distribution.
  • Combined with the empirical evaluation showing that gradient distribution of private SGD follows some symmetric structure along the trajectory, these results provide an explanation why gradient clipping works in practice.
  • The authors provide a perturbation-based technique to reduce the clipping bias even for adversarial instances
Tables
  • Table1: Scalability of Eξt=0,ζt [ ∇f (xt), gt ] w.r.t. d and k d=1 d = 10 d = 100 d = 1, 000 d = 10, 000 k=1
Download tables as Excel
Related work
  • The divergence caused by the clipping bias was also studied by prior work. In Pichapati et al [2019], an adaptive gradient clipping method is analyzed and the divergence is characterized by a bias depending on the difference between the clipped and unclipped gradients. However, they study a different variant of clipping that bounds the ∞ norm of the gradient instead of 2 norm; the latter, which we study in this paper, is the more commonly used clipping operation [Abadi et al, 2016b,a]. In Zhang et al [2019], the divergence is characterized by a bias depending on the clipping probability. These results suggest that, the clipping probability as well as the bias are inversely proportional to the size of the clipping threshold. For example, small clipping threshold results in large bias in the gradient estimation, which can potentially lead to worse training and generalization performance. Thakkar et al [2019] provides another adaptive gradient clipping heuristic that sets the threshold based on a privately estimated quantile, which can be viewed as minimizing the clipping probability.
Reference
  • Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Gregory S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian J. Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Gordon Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul A. Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda B. Viegas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. CoRR, abs/1603.04467, 2016a. URL http://arxiv.org/abs/1603.04467.
    Findings
  • Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 308–318, 2016b.
    Google ScholarLocate open access versionFindings
  • Raef Bassily, Adam D. Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. pages 464–473, 2014. doi: 10.1109/FOCS.2014.56. URL https://doi.org/10.1109/FOCS.2014.56.
    Findings
  • Brett K. Beaulieu-Jones, Zhiwei Steven Wu, Chris Williams, Ran Lee, Sanjeev P. Bhavnani, James Brian Byrd, and Casey S. Greene. Privacy-preserving generative deep neural networks support clinical data sharing. Circulation: Cardiovascular Quality and Outcomes, 12(7):e005122, 2019. doi: 10.1161/CIRCOUTCOMES.118.005122. URL https://www.ahajournals.org/doi/abs/10.1161/CIRCOUTCOMES.118.005122.
    Locate open access versionFindings
  • Zhiqi Bu, Jinshuo Dong, Qi Long, and Weijie J Su. Deep learning with gaussian differential privacy. arXiv preprint arXiv:1911.11607, 2019.
    Findings
  • Guy Gur-Ari, Daniel A. Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace. CoRR, abs/1812.04754, 2018. URL http://arxiv.org/abs/1812.04754.
    Findings
  • Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
    Google ScholarFindings
  • Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
    Findings
  • Xinyan Li, Qilong Gu, Yingxue Zhou, Tiancong Chen, and Arindam Banerjee. Hessian based analysis of SGD for deep nets: Dynamics and generalization. In Carlotta Demeniconi and Nitesh V. Chawla, editors, Proceedings of the 2020 SIAM International Conference on Data Mining, SDM 2020, Cincinnati, Ohio, USA, May 7-9, 2020 [the conference was canceled because of the coronavirus pandemic, the reviewed papers are published in this volume], pages 190–198. SIAM, 2020. doi: 10.1137/1.9781611976236.22. URL https://doi.org/10.1137/1.9781611976236.22.
    Locate open access versionFindings
  • Venkatadheeraj Pichapati, Ananda Theertha Suresh, Felix X Yu, Sashank J Reddi, and Sanjiv Kumar. Adaclip: Adaptive clipping for private sgd. arXiv preprint arXiv:1908.07643, 2019.
    Findings
  • Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with differentially private updates. In 2013 IEEE Global Conference on Signal and Information Processing, pages 245–248. IEEE, 2013.
    Google ScholarLocate open access versionFindings
  • Om Thakkar, Galen Andrew, and H. Brendan McMahan. Differentially private learning with adaptive clipping. CoRR, abs/1905.03871, 2019. URL http://arxiv.org/abs/1905.03871.
    Findings
  • Di Wang and Jinhui Xu. Differentially private empirical risk minimization with smooth nonconvex loss functions: A non-stationary view. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pages 1182–1189. AAAI Press, 2019. doi: 10.1609/aaai.v33i01.33011182. URL https://doi.org/10.1609/aaai.v33i01.33011182.
    Locate open access versionFindings
  • Yu-Xiang Wang, Borja Balle, and Shiva Prasad Kasiviswanathan. Subsampled renyi differential privacy and analytical moments accountant. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1226–1235, 2019.
    Google ScholarLocate open access versionFindings
  • Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Yuqing Zhu and Yu-Xiang Wang. Poission subsampled renyi differential privacy. In International Conference on Machine Learning, pages 7634–7642, 2019.
    Google ScholarLocate open access versionFindings
  • 0. By letting z = y + ξt,1 and q2
    Google ScholarFindings
Your rating :
0

 

Tags
Comments