Private Post-GAN Boosting

Marcel Neunhoeffer
Marcel Neunhoeffer

ICLR 2021, (2021)

Cited by: 0|Views153
Weibo:
Unlike much of the prior work that focuses on fine-tuning of network architectures and training techniques, we propose Private post-generative adversarial networks boosting —a differentially private method that boosts the quality of the generated samples after the training of a G...

Abstract:

Differentially private GANs have proven to be a promising approach for generating realistic synthetic data without compromising the privacy of individuals. However, due to the privacy-protective noise introduced in the training, the convergence of GANs becomes even more elusive, which often leads to poor utility in the output generator ...More
0
ZH
Full Text
Bibtex
Weibo
Introduction
  • The vast collection of detailed personal data, including everything from medical history to voting records, to GPS traces, to online behavior, promises to enable researchers from many disciplines to conduct insightful data analyses.
  • The authors' Private PGB method has a natural non-private variant, which the authors show improves the GAN training without privacy constraints.
  • The authors define a relevant quality score function and show that the both Private and Non-Private PGB methods improve the score of the samples generated from GAN.
Highlights
  • The vast collection of detailed personal data, including everything from medical history to voting records, to GPS traces, to online behavior, promises to enable researchers from many disciplines to conduct insightful data analyses
  • A natural and promising approach to tackle this challenge is to release differentially private synthetic data—a privatized version of the dataset that consists of fake data records and that approximates the real dataset on important statistical properties of interest
  • Unlike much of the prior work that focuses on fine-tuning of network architectures and training techniques, we propose Private post-generative adversarial networks (GANs) boosting (Private PGB)—a differentially private method that boosts the quality of the generated samples after the training of a GAN
  • We empirically evaluate how both the Private and Non-Private PGB methods affect the utility of the generated synthetic data from GANs
  • We show two appealing advantages of our approach: 1) non-private PGB outperforms the last Generator of GANs, and 2) our approach can significantly improve the synthetic examples generated by a GAN under differential privacy
  • To show how the differentially private version of PGB improves the samples generated from GANs that were trained under differential privacy, we first re-run the ex
Results
  • The authors show that the Non-Private PGB method can be used to improve the quality of images generated by GANs using the MNIST dataset.
  • The private algorithm for training GAN performs the same alternating optimization, but it optimizes the discriminator under differential privacy while keeping the generator optimization the same.
  • The noisy gradient updates impede convergence of the differentially private GAN training algorithm, and the generator obtained in the final epoch of the training procedure may not yield a good approximation to the data distribution.
  • The payoff function U measures the predictive accuracy of the distinguisher in classifying whether the examples are drawn from the synthetic data player’s distribution φ or the private dataset X.
  • Note that the synthetic data player’s MW update rule does not involve the private dataset, and is just a post-processing step of the selected discriminator Dt. the privacy guarantee follows from applying the advacned composition of T runs of the exponential mechanism.1
  • The authors can use D to further improve the data distribution φ by the discriminator rejection sampling (DRS) technique of Azadi et al (2019).
  • The architecture of the GAN is the same across all results.3 To compare the utility of the synthetic datasets with the real data, the authors inspect the visual quality of the resultsand calculate the proportion of high quality synthetic examples similar to Azadi et al (2019),Turner et al (2019) and Srivastava et al (2017).4
  • To show how the differentially private version of PGB improves the samples generated from GANs that were trained under differential privacy, the authors first re-run the ex-
Conclusion
  • The last Generator of the differentially private GAN achieves a proportion of 0.031 high quality samples.
  • The lines show the RMSE for predicted income for all linear regression models trained with three independent variables from the set of on the synthetic data generated with Private PGB as compared to the last GAN generator and other post processing methods like DRS.
  • In a final set of experiments the authors evaluate the performance of machine learning models trained on synthetic data and tested on real out-of-sample data.
Summary
  • The vast collection of detailed personal data, including everything from medical history to voting records, to GPS traces, to online behavior, promises to enable researchers from many disciplines to conduct insightful data analyses.
  • The authors' Private PGB method has a natural non-private variant, which the authors show improves the GAN training without privacy constraints.
  • The authors define a relevant quality score function and show that the both Private and Non-Private PGB methods improve the score of the samples generated from GAN.
  • The authors show that the Non-Private PGB method can be used to improve the quality of images generated by GANs using the MNIST dataset.
  • The private algorithm for training GAN performs the same alternating optimization, but it optimizes the discriminator under differential privacy while keeping the generator optimization the same.
  • The noisy gradient updates impede convergence of the differentially private GAN training algorithm, and the generator obtained in the final epoch of the training procedure may not yield a good approximation to the data distribution.
  • The payoff function U measures the predictive accuracy of the distinguisher in classifying whether the examples are drawn from the synthetic data player’s distribution φ or the private dataset X.
  • Note that the synthetic data player’s MW update rule does not involve the private dataset, and is just a post-processing step of the selected discriminator Dt. the privacy guarantee follows from applying the advacned composition of T runs of the exponential mechanism.1
  • The authors can use D to further improve the data distribution φ by the discriminator rejection sampling (DRS) technique of Azadi et al (2019).
  • The architecture of the GAN is the same across all results.3 To compare the utility of the synthetic datasets with the real data, the authors inspect the visual quality of the resultsand calculate the proportion of high quality synthetic examples similar to Azadi et al (2019),Turner et al (2019) and Srivastava et al (2017).4
  • To show how the differentially private version of PGB improves the samples generated from GANs that were trained under differential privacy, the authors first re-run the ex-
  • The last Generator of the differentially private GAN achieves a proportion of 0.031 high quality samples.
  • The lines show the RMSE for predicted income for all linear regression models trained with three independent variables from the set of on the synthetic data generated with Private PGB as compared to the last GAN generator and other post processing methods like DRS.
  • In a final set of experiments the authors evaluate the performance of machine learning models trained on synthetic data and tested on real out-of-sample data.
Tables
  • Table1: Predicting Titanic Survivors with Machine Learning Models trained on synthetic data and tested on real out-of-sample data. Median scores of 20 repetitions with independently generated synthetic data. With differential privacy is 2 and δ is
  • Table2: Predicting Titanic Survivors with Machine Learning Models trained on real data and tested on real out-of-sample data
Download tables as Excel
Related work
  • Our PGB method can be viewed as a modular boosting method that can improve on a growing line of work on differentially private GANs (Beaulieu-Jones et al, 2019; Xie et al, 2018; Frigerio et al, 2019; Torkzadehmahani et al, 2020). To obtain formal privacy guarantees, these algorithms optimize the discriminators in GAN under differential privacy, by using private SGD, RMSprop, or Adam methods, and track the privacy cost using moments accounting Abadi et al (2016); Mironov (2017). Yoon et al (2019) give a private GAN training method by adapting ideas from the PATE framework (Papernot et al, 2018).

    Our PGB method is inspired by the Private Multiplicative Weigths method (Hardt & Rothblum, 2010) and its more practical variant MWEM (Hardt et al, 2012), which answer a large collection of statistical queries by releasing a synthetic dataset. Our work also draws upon two recent techniques (Turner et al (2019) and Azadi et al (2019)) that use the discriminator as a rejection sampler to improve the generator distribution. We apply their technique by using the mixture discriminator computed in PGB as the rejection sampler. There has also been work that applies the idea of boosting to (non-private) GANs. For example, Arora et al (2017) and Hoang et al (2018) propose methods that directly train a mixture of generators and discriminators, and Tolstikhin et al (2017) proposes AdaGAN that reweighes the real examples during training similarly to what is done in AdaBoost (Freund & Schapire, 1997). Both of these methods may be hard to make differentially private: they either require substantially more privacy budget to train a collection of discriminators or increase the weights on a subset of examples, which requires more adding more noise when computing private gradients. In contrast, our PGB method boosts the generated samples post training and does not make modifications to the GAN training procedure.
Funding
  • We show two appealing advantages of our approach: 1) non-private PGB outperforms the last Generator of GANs, and 2) our approach can significantly improve the synthetic examples generated by a GAN under differential privacy
  • To evaluate the quality of the generated images we use a metric that is based on the Inception score (IS) (Salimans et al, 2016), where instead of the Inception Net we use a MNIST Classifier that achieves 99.65% test accuracy
Study subjects and analysis
observations: 1000
The left column in Figure 1 displays the training data. Each of the 25 clusters consists of 1, 000 observations. The architecture of the GAN is the same across all results.3

samples: 5000
The theoretical best score of the MNIST IS is 10, and the real test images achieve a score of 9.93. Without privacy the last GAN Generator achieves a score of 8.41, using DRS on the last Generator slightly decreases the score to 8.21, samples with PGB achieve a score of 8.76, samples with the combination of PGB and DRS achieve a similar score of 8.77 (all inception scores are calculated on 5,000 samples). Uncurated samples for all methods are included in the Appendix

observations: 39660
For 1940 we synthesize an excerpt of the 1% sample of all Californians that were at least 18 years old.9. Our training sample consists of 39,660 observations and 8 attributes (sex, age, educational attainment, income, race, Hispanic origin, marital status and county). The test set contains another 9,915 observations

observations: 9915
Our training sample consists of 39,660 observations and 8 attributes (sex, age, educational attainment, income, race, Hispanic origin, marital status and county). The test set contains another 9,915 observations. Our final value of is 1 and δ is

observations of Titanic passengers: 891
In a final set of experiments we evaluate the performance of machine learning models trained on synthetic data (with and without privacy) and tested on real out-of-sample data. We synthesize the Kaggle Titanic10 training set (891 observations of Titanic passengers on 8 attributes) and train three machine learning models (Logistic Regression, Random Forests (RF) (Breiman, 2001) and XGBoost (Chen & Guestrin, 2016)) on the synthetic datasets to predict whether someone survived the Titanic catastrophe. We then evaluate the performance on the test set with 418 observations

observations: 418
We synthesize the Kaggle Titanic10 training set (891 observations of Titanic passengers on 8 attributes) and train three machine learning models (Logistic Regression, Random Forests (RF) (Breiman, 2001) and XGBoost (Chen & Guestrin, 2016)) on the synthetic datasets to predict whether someone survived the Titanic catastrophe. We then evaluate the performance on the test set with 418 observations. To address missing values in both the training set and the test set we independently impute values using the MissForest (Stekhoven & Buhlmann, 2012)

samples: 5000
We keep track of the values of and δ by using the moments accountant (Abadi et al, 2016; Mironov, 2017). 7All inception scores are calculated on 5, 000 samples. 8Further experiments using data from the 2010 American Census can be found in the appendix. 9A 1% sample means that the micro data contains 1% of the total American (here Californian) population. 10https://www.kaggle.com/c/titanic/data algorithm. For the private synthetic data our final value of is 2 and δ is

Reference
  • Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS ’16, pp. 308–318, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4139-4. doi: 10.1145/2976749.2978318. URL http://doi.acm.org/10.1145/2976749.2978318.
    Locate open access versionFindings
  • John M. Abowd. The U.S. census bureau adopts differential privacy. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018, pp. 2867, 2018. doi: 10.1145/3219819.3226070. URL https://doi.org/10.1145/3219819.3226070.
    Locate open access versionFindings
  • Martın Arjovsky, Soumith Chintala, and Leon Bottou. Wasserstein GAN. CoRR, abs/1701.07875, 2017. URL http://arxiv.org/abs/1701.07875.
    Findings
  • Christian Arnold and Marcel Neunhoeffer. Really useful synthetic data–a framework to evaluate the quality of differentially private synthetic data. arXiv preprint arXiv:2004.07740, 2020.
    Findings
  • Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a metaalgorithm and applications. Theory of Computing, 8(6):121–164, 2012. doi: 10.4086/toc.2012. v008a006. URL http://www.theoryofcomputing.org/articles/v008a006.
    Locate open access versionFindings
  • Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (GANs). In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 224–232, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/arora17a.html.
    Locate open access versionFindings
  • Samaneh Azadi, Catherine Olsson, Trevor Darrell, Ian J. Goodfellow, and Augustus Odena. Discriminator rejection sampling. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019. URL https://openreview.net/forum?id=S1GkToR5tm.
    Locate open access versionFindings
  • Brett K. Beaulieu-Jones, Zhiwei Steven Wu, Chris Williams, Ran Lee, Sanjeev P. Bhavnani, James Brian Byrd, and Casey S. Greene. Privacy-preserving generative deep neural networks support clinical data sharing. Circulation: Cardiovascular Quality and Outcomes, 12(7):e005122, 2019. doi: 10.1161/CIRCOUTCOMES.118.005122.
    Locate open access versionFindings
  • Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
    Google ScholarLocate open access versionFindings
  • Kamalika Chaudhuri and Staal A Vinterbo. A stability-based validation procedure for differentially private machine learning. In Advances in Neural Information Processing Systems, pp. 2652–2660, 2013.
    Google ScholarLocate open access versionFindings
  • Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794, 2016.
    Google ScholarLocate open access versionFindings
  • Differential Privacy Team, Apple. Learning with privacy at scale. https://machinelearning.apple.com/docs/learning-with-privacy-at-scale/appledifferentialprivacysystem.pdf, December 2017.
    Findings
  • Bolin Ding, Janardhan Kulkarni, and Sergey Yekhanin. Collecting telemetry data privately. In Advances in Neural Information Processing Systems 30, NIPS ’17, pp. 3571–3580. Curran Associates, Inc., 2017.
    Google ScholarLocate open access versionFindings
  • Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Proceedings of the 3rd Theory of Cryptography Conference, volume 3876, pp. 265–284, 2006.
    Google ScholarLocate open access versionFindings
  • Ulfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. RAPPOR: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM Conference on Computer and Communications Security, CCS ’14, pp. 1054–1067, New York, NY, USA, 2014. ACM.
    Google ScholarLocate open access versionFindings
  • Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119 – 139, 1997. ISSN 0022-0000. doi: https://doi.org/10.1006/jcss.1997.1504. URL http://www.sciencedirect.com/science/article/pii/S002200009791504X.
    Locate open access versionFindings
  • Lorenzo Frigerio, Anderson Santana de Oliveira, Laurent Gomez, and Patrick Duverger. Differentially private generative adversarial networks for time series, continuous, and discrete open data. In ICT Systems Security and Privacy Protection - 34th IFIP TC 11 International Conference, SEC 2019, Lisbon, Portugal, June 25-27, 2019, Proceedings, pp. 151–164, 2019. doi: 10.1007/ 978-3-030-22312-0\ 11. URL https://doi.org/10.1007/978-3-030-22312-0_11.
    Locate open access versionFindings
  • Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, pp. 2672–2680, Cambridge, MA, USA, 2014. MIT Press. URL http://dl.acm.org/citation.cfm?id=2969033.2969125.
    Locate open access versionFindings
  • Moritz Hardt and Guy N. Rothblum. A multiplicative weights mechanism for privacy-preserving data analysis. In 51th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2010, October 23-26, 2010, Las Vegas, Nevada, USA, pp. 61–70, 2010. doi: 10.1109/FOCS.2010.85. URL https://doi.org/10.1109/FOCS.2010.85.
    Locate open access versionFindings
  • Moritz Hardt, Katrina Ligett, and Frank McSherry. A simple and practical algorithm for differentially private data release. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., pp. 2348–2356, 2012.
    Google ScholarLocate open access versionFindings
  • Quan Hoang, Tu Dinh Nguyen, Trung Le, and Dinh Phung. MGAN: Training generative adversarial nets with multiple generators. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rkmu5b0a-.
    Locate open access versionFindings
  • Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
    Findings
  • Jyrki Kivinen and Manfred K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1 – 63, 1997. ISSN 0890-5401. doi: https://doi.org/10.1006/inco.1996.2612. URL http://www.sciencedirect.com/science/article/pii/S0890540196926127.
    Locate open access versionFindings
  • Jingcheng Liu and Kunal Talwar. Private selection from private candidates. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pp. 298–309, 2019.
    Google ScholarLocate open access versionFindings
  • Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
    Findings
  • Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science, FOCS ’07, pp. 94–103, Washington, DC, USA, 2007. IEEE Computer Society. ISBN 0-7695-3010-9. doi: 10.1109/ FOCS.2007.41. URL http://dx.doi.org/10.1109/FOCS.2007.41.
    Locate open access versionFindings
  • Ilya Mironov. Renyi differential privacy. In 30th IEEE Computer Security Foundations Symposium, CSF 2017, Santa Barbara, CA, USA, August 21-25, 2017, pp. 263–275, 2017. doi: 10.1109/CSF. 2017.11. URL https://doi.org/10.1109/CSF.2017.11.
    Locate open access versionFindings
  • Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Ulfar Erlingsson. Scalable private learning with PATE. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018. URL https://openreview.net/forum?id=rkZB1XbRZ.
    Locate open access versionFindings
  • Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
    Findings
  • Steven Ruggles, Sarah Flood, Ronald Goeken, Josiah Grover, Erin Meyer, Jose Pacas, and Matthew Sobek. Ipums usa: Version 9.0 [dataset]. Minneapolis, MN: IPUMS, 10:D010, 2019. doi: 10.18128/D010.V9.0.
    Findings
  • Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242, 2016.
    Google ScholarLocate open access versionFindings
  • Joshua Snoke, Gillian M. Raab, Beata Nowok, Chris Dibben, and Aleksandra Slavkovic. General and specific utility measures for synthetic data. Journal of the Royal Statistical Society: Series A (Statistics in Society), 181(3):663–688, 2018. doi: 10.1111/rssa.12358. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssa.12358.
    Locate open access versionFindings
  • Proof. We will use the seminal result of Freund & Schapire (1997), which shows that if the two players have low regret in the dynamics, then their average plays form an approximate equilibrium. First, we will bound the regret from the data player. The regret guarantee of the multiplicative weights algorithm (see e.g. Theorem 2.3 of Arora et al. (2012)) gives
    Google ScholarLocate open access versionFindings
  • Next, we bound the regret of the distinguisher using the accuracy guarantee of the exponential mechanism (McSherry & Talwar, 2007). For each t, we know with probability (1 − β/T ), max U (φt, Dj)
    Google ScholarFindings
  • Then following the result of Freund & Schapire (1997), their average plays (D, φ) is an αapproximate equilibrium with log |B| 2 log(N T /β)
    Google ScholarFindings
  • The generator and discriminator are neural nets with two fully connected hidden layers (Discriminator: 128, 256; Generator: 512, 256) with Leaky ReLu activations. The latent noise vector Z is of dimension 2 and independently sampled from a gaussian distribution with mean 0 and standard deviation of 1. For GAN training we use the KL-WGAN loss (Song & Ermon, 2020). Before passing the Discriminator scores to PGB we transform them to probabilities using a sigmoid activation.
    Google ScholarLocate open access versionFindings
  • C.2 GAN ARCHITECTURE FOR THE 1940 AMERICAN CENSUS DATA.
    Google ScholarFindings
  • The GAN networks consist of two fully connected hidden layers (256, 128) with Leaky ReLu activation functions. To sample from categorical attributes we apply the Gumbel-Softmax trick (Maddison et al., 2016; Jang et al., 2016) to the output layer of the Generator. We run our PGB algorithm over the last 150 stored Generators and Discriminators and train it for T = 400 update steps.
    Google ScholarLocate open access versionFindings
  • D PRIVATE SYNTHETIC 2010 AMERICAN DECENNIAL CENSUS SAMPLES.
    Google ScholarFindings
  • We conducted further experiments on more recent Census files. The 2010 data is similar to the data that the American Census is collecting for the 2020 decennial Census. For this experiment, we synthesize a 10% sample for California with 3,723,669 observations of 5 attributes (gender, age, Hispanic origin, race and puma district membership). Our final value of is 0.795 and δ is
    Google ScholarFindings
  • (11 answer categories in the 2010 Census) by Hispanic origin (25 answer categories in the 2010
    Google ScholarFindings
  • Census) by gender (2 answer categories in the 2010 Census) giving us a total of 550 cells. To assess the specific utility for these three-way marginals we calculate the average accuracy across all 550 cells. Compared to the true data DP GAN achieves 99.82%, DP DRS 99.89%, DP PGB 99.89%
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments