# Private Query Release Assisted by Public Data

ICML, pp. 695-703, 2020.

EI

Weibo:

Abstract:

We study the problem of differentially private query release assisted by access to public data. In this problem, the goal is to answer a large class $\mathcal{H}$ of statistical queries with error no more than $\alpha$ using a combination of public and private samples. The algorithm is required to satisfy differential privacy only with ...More

Code:

Data:

Introduction

- The ability to answer statistical queries on a sensitive data set in a privacy-preserving way is one of the most fundamental primitives in private data analysis.
- In this work the authors study the private and public sample complexities of PAP query-release algorithms, and give upper and lower bounds on both.
- The authors describe a construction of a public-data-assisted private query release algorithm that works for any class with a finite VC-dimension.

Highlights

- The ability to answer statistical queries on a sensitive data set in a privacy-preserving way is one of the most fundamental primitives in private data analysis
- A central question in private query release is concerned with characterizing the private sample complexity, which is the least amount of private samples required to perform this task up to some additive error α
- It was shown that the optimal bound on the private sample complexity in terms of |X |, |H|, and the privacy parameters is attained by the Private Multiplicative Weights (PMW) algorithm due to Hardt and Rothblum [HR10]
- Lower bound on private sample complexity: We show that there is a query class H with VC
- Lower bound on public sample complexity: We show that if the class H has infinite Littlestone dimension,2 any Public-data-Assisted Private query-release algorithm for H must have public sample complexity Ω(1/α)
- We describe a construction of a public-data-assisted private query release algorithm that works for any class with a finite VC-dimension

Results

- The key idea of the construction is to use the public data to create a finite α-cover H for the input query class H, run the PMW algorithm on the finite cover and the representative domain XH given by the dual of H.
- PAP algorithm that takes n private samples and m public samples, satisfies (1, 1/4n)-differential privacy, and is (α, α)-accurate for the class of decision stumps Sp. either n = Ω (√p/α) or m = Ω(1/α2).
- Note that the accuracy condition of A implies that the authors must have n + m > t by the standard lower bound on the sample complexity of query release even without any privacy constraints.
- The goal of this section is to show a general lower bound on the public sample complexity of PAP query release.
- In the proof of the above theorem, the authors will refer to the following notion of private PAC learning with access to public data that was defined in [ABM19].
- The authors prove the above theorem in two simple steps that follow from prior works: the first step shows that PAP query-release implies PAP learning, and the second step invokes a known lower bound on PAP learning of classes with infinite Littlestone dimension.
- Note that Lemma 17 shows that for any class H, a PAP query-release algorithm for H with public sample complexity m implies the existence of a PAP learner for H with the same public sample complexity.

Conclusion

- The authors note that the reduction in [BNS13] is for “proper sanitizers,” which are query-release algorithms that are restricted to output a data set from the input domain rather than any data structure that maps H to [−1, 1].
- As discussed in Remark 9, ignoring computational complexity, any PAP query-release algorithm satisfying Definition 8 can be transformed into a PAP query-release algorithm that outputs a data set from the input domain and has the same accuracy.
- Given these minor details and since any PAP algorithm can obviously be viewed as a differentially private algorithm operating on the private data set, Lemma 17 follows by invoking the reduction in [BNS13].

Summary

- The ability to answer statistical queries on a sensitive data set in a privacy-preserving way is one of the most fundamental primitives in private data analysis.
- In this work the authors study the private and public sample complexities of PAP query-release algorithms, and give upper and lower bounds on both.
- The authors describe a construction of a public-data-assisted private query release algorithm that works for any class with a finite VC-dimension.
- The key idea of the construction is to use the public data to create a finite α-cover H for the input query class H, run the PMW algorithm on the finite cover and the representative domain XH given by the dual of H.
- PAP algorithm that takes n private samples and m public samples, satisfies (1, 1/4n)-differential privacy, and is (α, α)-accurate for the class of decision stumps Sp. either n = Ω (√p/α) or m = Ω(1/α2).
- Note that the accuracy condition of A implies that the authors must have n + m > t by the standard lower bound on the sample complexity of query release even without any privacy constraints.
- The goal of this section is to show a general lower bound on the public sample complexity of PAP query release.
- In the proof of the above theorem, the authors will refer to the following notion of private PAC learning with access to public data that was defined in [ABM19].
- The authors prove the above theorem in two simple steps that follow from prior works: the first step shows that PAP query-release implies PAP learning, and the second step invokes a known lower bound on PAP learning of classes with infinite Littlestone dimension.
- Note that Lemma 17 shows that for any class H, a PAP query-release algorithm for H with public sample complexity m implies the existence of a PAP learner for H with the same public sample complexity.
- The authors note that the reduction in [BNS13] is for “proper sanitizers,” which are query-release algorithms that are restricted to output a data set from the input domain rather than any data structure that maps H to [−1, 1].
- As discussed in Remark 9, ignoring computational complexity, any PAP query-release algorithm satisfying Definition 8 can be transformed into a PAP query-release algorithm that outputs a data set from the input domain and has the same accuracy.
- Given these minor details and since any PAP algorithm can obviously be viewed as a differentially private algorithm operating on the private data set, Lemma 17 follows by invoking the reduction in [BNS13].

Funding

- RB’s research is supported by NSF Awards AF-1908281, SHF-1907715, Google Faculty Research Award, and OSU faculty start-up support
- AC and JU were supported by NSF grants CCF-1718088, CCF-1750640, CNS-1816028, and CNS-1916020
- AN is supported by an Ontario ERA, and an NSERC Discovery Grant RGPIN-2016-06333
- ZSW is supported by a Google Faculty Research Award, a J.P
- Morgan Faculty Award, and a Mozilla research grant

Reference

- [ABM19] Noga Alon, Raef Bassily, and Shay Moran. Limits of private learning with access to public data. arXiv:1910.11519 [cs.LG] (appeared at NeurIPS 2019), 2019.
- [ALMM19] Noga Alon, Roi Livni, Maryanthe Malliaris, and Shay Moran. Private pac learning implies finite littlestone dimension. STOC 2019, pp. 852-860 (arXiv preprint arXiv:1806.00949), 2019.
- [Ass83] Patrick Assouad. Densiteet dimension. In Annales de l’Institut Fourier, volume 33, pages 233–282, 1983.
- [BDPSS09] Shai Ben-David, David Pal, and Shai Shalev-Shwartz. Agnostic online learning. In COLT, volume 3, page 1, 2009.
- [BLR13] Avrim Blum, Katrina Ligett, and Aaron Roth. A learning theory approach to noninteractive database privacy. Journal of the ACM (JACM), 60(2):1–25, 2013.
- [BNSV15] Mark Bun, Kobbi Nissim, Uri Stemmer, and Salil P. Vadhan. Differentially private release and learning of threshold functions. In IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015, pages 634–649, 2015.
- [BTT18] Raef Bassily, Abhradeep Guha Thakurta, and Om Dipakbhai Thakkar. Model-agnostic private learning. In Advances in Neural Information Processing Systems, pages 7102–7112, 2018.
- [BUV18] Mark Bun, Jonathan Ullman, and Salil Vadhan. Fingerprinting codes and the price of approximate differential privacy. SIAM Journal on Computing, 47(5):1888–1938, 2018.
- [CH11] Kamalika Chaudhuri and Daniel Hsu. Sample complexity bounds for differentially private learning. In Proceedings of the 24th Annual Conference on Learning Theory, pages 155–186, 2011.
- [DLS+17] Aref N. Dajani, Amy D. Lauger, Phyllis E. Singer, Daniel Kifer, Jerome P. Reiter, Ashwin Machanavajjhala, Simson L. Garfinkel, Scot A. Dahl, Matthew Graham, Vishesh Karwa, Hang Kim, Philip Lelerc, Ian M. Schmutte, William N. Sexton, Lars Vilhuber, and John M. Abowd. The modernization of statistical disclosure limitation at the U.S. census bureau, 2017. Presented at the September 2017 meeting of the Census Scientific Advisory Committee.
- [DN03] Irit Dinur and Kobbi Nissim. Revealing information while preserving privacy. In Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 202–210, 2003.
- [DR14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3–4):211–407, 2014.
- [DSS+15] Cynthia Dwork, Adam Smith, Thomas Steinke, Jonathan Ullman, and Salil Vadhan. Robust traceability from trace amounts. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 650–669. IEEE, 2015.
- [HCB16] Jihun Hamm, Yingjun Cao, and Mikhail Belkin. Learning privately from multiparty data. In International Conference on Machine Learning, pages 555–563, 2016.
- [HR10] Moritz Hardt and Guy N Rothblum. A multiplicative weights mechanism for privacypreserving data analysis. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 61–70. IEEE, 2010.
- [Lit88] Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm. Machine learning, 2(4):285–318, 1988.
- [MN12] Shanmugavelayutham Muthukrishnan and Aleksandar Nikolov. Optimal private halfspace counting via discrepancy. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pages 1285–1292, 2012.
- [NB19] Anupama Nandi and Raef Bassily. Privately answering classification queries in the agnostic pac model. arXiv preprint arXiv:1907.13553. To appear in ALT 2020, 2019.
- [PAE+17] Nicolas Papernot, Martın Abadi, Ulfar Erlingsson, Ian Goodfellow, and Kunal Talwar. Semisupervised knowledge transfer for deep learning from private training data. stat, 1050, 2017.
- [PSM+18] Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Ulfar Erlingsson. Scalable private learning with pate. arXiv preprint arXiv:1802.08908, 2018.
- [Sau72] Norbert Sauer. On the density of families of sets. Journal of Combinatorial Theory, Series A, 13(1):145–147, 1972.
- [SU15] Thomas Steinke and Jonathan Ullman. Between pure and approximate differential privacy. arXiv preprint arXiv:1501.06095, 2015.

Tags

Comments