AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We presented experimental results that validated the algorithm in practice by applying it to two real datasets from di erent domains

Privacy preserving mining of association rules

Inf. Syst., no. 4 (2004): 343-364

被引用1171|浏览91
EI

摘要

We present a framework for mining association rules from transactions consisting of categorical items where the data has been randomized to preserve privacy of individual transactions. While it is feasible to recover association rules and preserve privacy using a straightforward "uniform" randomization, the discovered rules can unfortunat...更多

代码

数据

0
简介
  • The explosive progress in networking, storage, and processor technologies is resulting in an unprecedented amount of digitization of information.
  • It is estimated that the amount of information in the world is doubling every 20 months 20.
  • In concert with this dramatic and escalating increase in digital data, concerns about privacy of personal information have emerged globally 15 17 20 24.
  • Privacy issues are further exacerbated that the internet makes it easy for the new data to be automatically collected and added to databases 10 13 14 27 28 29.
重点内容
  • The explosive progress in networking, storage, and processor technologies is resulting in an unprecedented amount of digitization of information
  • SIGKDD 02 Edmonton, Alberta, Canada primary task in data mining is the development of models about aggregated data, can we develop accurate models without access to precise information in individual data records? Speci cally, they studied the technical feasibility of building accurate classi cation models using training data in which the sensitive numeric values in a user's record have been randomized so that the true values cannot be estimated with su cient precision
  • The following are the important results contained in this paper: In Section 2, we show that a straightforward uniform randomization leads to privacy breaches
  • We present experimental results on two real datasets in Section 5, as well as graphs showing the relationship between discoverability, privacy, and data characteristics
  • We presented experimental results that validated the algorithm in practice by applying it to two real datasets from di erent domains
  • Our approach deals with a restricted albeit important class of privacy breaches; can we extend it to cover other kinds of breaches? Second, what are the theoretical limits on discoverability for a given level of privacy and vice versa? can we combine randomization and cryptographic protocols to get the strengths of both without the weaknesses of either?
结果
  • Before the authors come to the experiments with datasets, the authors rst show in Section 5.1 how the ability to recover supports depends on the permitted breach level, as well as other data characteristics.
  • The authors report the results for both datasets at a minimum support that is close to the lowest discoverable support, in order to show the resilience of the algorithm even at these very low support levels.
  • Given the values of maximum supports, the authors used the methodology from Section 4.4 to nd the lowest randomization level such that the breach probability for each itemset size is still below the desired breach level.
结论
  • The authors have presented three key contributions toward mining association rules while preserving privacy.
  • The authors gave a sound mathematical treatment for a class of randomization algorithms and derived formulae for support and variance prediction, and showed how to incorporate these formulae into mining algorithms.
  • The authors presented experimental results that validated the algorithm in practice by applying it to two real datasets from di erent domains.
  • Can the authors combine randomization and cryptographic protocols to get the strengths of both without the weaknesses of either?
  • The authors' approach deals with a restricted albeit important class of privacy breaches; can the authors extend it to cover other kinds of breaches? Second, what are the theoretical limits on discoverability for a given level of privacy and vice versa? can the authors combine randomization and cryptographic protocols to get the strengths of both without the weaknesses of either?
总结
  • Introduction:

    The explosive progress in networking, storage, and processor technologies is resulting in an unprecedented amount of digitization of information.
  • It is estimated that the amount of information in the world is doubling every 20 months 20.
  • In concert with this dramatic and escalating increase in digital data, concerns about privacy of personal information have emerged globally 15 17 20 24.
  • Privacy issues are further exacerbated that the internet makes it easy for the new data to be automatically collected and added to databases 10 13 14 27 28 29.
  • Results:

    Before the authors come to the experiments with datasets, the authors rst show in Section 5.1 how the ability to recover supports depends on the permitted breach level, as well as other data characteristics.
  • The authors report the results for both datasets at a minimum support that is close to the lowest discoverable support, in order to show the resilience of the algorithm even at these very low support levels.
  • Given the values of maximum supports, the authors used the methodology from Section 4.4 to nd the lowest randomization level such that the breach probability for each itemset size is still below the desired breach level.
  • Conclusion:

    The authors have presented three key contributions toward mining association rules while preserving privacy.
  • The authors gave a sound mathematical treatment for a class of randomization algorithms and derived formulae for support and variance prediction, and showed how to incorporate these formulae into mining algorithms.
  • The authors presented experimental results that validated the algorithm in practice by applying it to two real datasets from di erent domains.
  • Can the authors combine randomization and cryptographic protocols to get the strengths of both without the weaknesses of either?
  • The authors' approach deals with a restricted albeit important class of privacy breaches; can the authors extend it to cover other kinds of breaches? Second, what are the theoretical limits on discoverability for a given level of privacy and vice versa? can the authors combine randomization and cryptographic protocols to get the strengths of both without the weaknesses of either?
表格
  • Table1: Results on Real Datasets a mailorder, 0:2 true support predicted support size Itemsets 0:1 0:1,0:15 0:15,0:2 0:2
  • Table2: Analysis of false drops a mailorder, 0:2 predicted support true support size Itemsets 0:1 0:1,0:15 0:15,0:2 0:2
  • Table3: Analysis of false positives soccer
  • Table4: Actual Privacy Breaches
Download tables as Excel
相关工作
  • There has been extensive research in the area of statistical databases motivated by the desire to provide statistical information sum, count, average, maximum, minimum, pth percentile, etc. without compromising sensitive information about individuals see surveys in 1 22 . The proposed techniques can be broadly classi ed into query restriction and data perturbation. The query restriction family includes restricting the size of query result, controlling the overlap amongst successive queries, keeping audit trail of all answered queries and constantly checking for possible compromise, suppression of data cells of small size, and clustering entities into mutually exclusive atomic populations. The perturbation family includes swapping values between records, replacing the original database by a sample from the same distribution, adding noise to the values in the database, adding noise to the results of a query, and sampling the result of a query. There are negative results showing that the proposed techniques cannot satisfy the conicting objectives of providing high quality statistics and at the same time prevent exact or partial disclosure of individual information 1 .

    The most relevant work from the statistical database literature is the work by Warner 26 , where he developed the randomized response" method for survey results. The method deals with a single boolean attribute e.g., drug addiction. The value of the attribute is retained with probability p and ipped with probability 1 , p. Warner then derived equations for estimating the true value of queries such as COUNT Age = 42 & Drug Addiction = Yes. The approach we present in Section 2 can be viewed as a generalization of Warner's idea.
引用论文
  • 1 N. R. Adam and J. C. Wortman. Security-control methods for statistical databases. ACM Computing Surveys, 214:515 556, Dec. 1989.
    Google ScholarLocate open access versionFindings
  • 2 D. Agrawal and C. C. Aggarwal. On the Design and Quanti cation of Privacy Preserving Data Mining Algorithms. In Proc. of the 20th ACM Symposium on Principles of Database Systems, pages 247 255, Santa Barbara, California, May 2001.
    Google ScholarLocate open access versionFindings
  • 3 R. Agrawal. Data Mining: Crossing the Chasm. In 5th Int'l Conference on Knowledge Discovery in Databases and Data Mining, San Diego, California, August 1999. Available from http:www.almaden.ibm.com cs quest.papers kdd99 chasm.ppt 4 R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc.of the ACM SIGMOD Conference on Management of Data, pages 207 216, Washington, D.C., May 1993.
    Locate open access versionFindings
  • 5 R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast Discovery of Association Rules. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, chapter 12, pages 307 328. AAAI MIT Press, 1996.
    Google ScholarLocate open access versionFindings
  • 6 R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules. Research Report RJ 9839, IBM Almaden Research Center, San Jose, California, June 1994.
    Google ScholarLocate open access versionFindings
  • 7 R. Agrawal and R. Srikant. Privacy preserving data mining. In Proc. of the ACM SIGMOD Conference on Management of Data, pages 439 450, Dallas, Texas, May 2000.
    Google ScholarLocate open access versionFindings
  • 8 R. Bayardo. E ciently mining long patterns from databases. In Proc. of the ACM SIGMOD Conference on Management of Data, Seattle, Washington, 1998.
    Google ScholarLocate open access versionFindings
  • 9 L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classi cation and Regression Trees. Wadsworth, Belmont, 1984.
    Google ScholarFindings
  • 10 Business Week. Privacy on the Net, March 2000.
    Google ScholarFindings
  • 11 C. Clifton and D. Marks. Security and privacy implications of data mining. In ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, pages 15 19, May 1996.
    Google ScholarLocate open access versionFindings
  • 12 R. Conway and D. Strip. Selective partial access to a database. In Proc. ACM Annual Conf., pages 85 89, 1976.
    Google ScholarLocate open access versionFindings
  • 13 L. Cranor, J. Reagle, and M. Ackerman. Beyond concern: Understanding net users' attitudes about online privacy. Technical Report TR 99.4.3, AT&T Labs Research, April 1999.
    Google ScholarFindings
  • 14 L. F. Cranor, editor. Special Issue on Internet Privacy. Comm. ACM, 422, Feb. 1999.
    Google ScholarLocate open access versionFindings
  • 15 The Economist. The End of Privacy, May 1999.
    Google ScholarFindings
  • 16 V. Estivill-Castro and L. Brankovic. Data swapping: Balancing privacy against precision in mining for logic rules. In M. Mohania and A. Tjoa, editors, Data Warehousing and Knowledge Discovery DaWaK-99, pages 389 398. Springer-Verlag Lecture Notes in Computer Science 1676, 1999.
    Google ScholarLocate open access versionFindings
  • 17 European Union. Directive on Privacy Protection, October 1998.
    Google ScholarFindings
  • 18 Y. Lindell and B. Pinkas. Privacy preserving data mining. In CRYPTO, pages 36 54, 2000.
    Google ScholarLocate open access versionFindings
  • 19 T. M. Mitchell. Machine Learning, chapter 6. McGraw-Hill, 1997.
    Google ScholarLocate open access versionFindings
  • 20 O ce of the Information and Privacy Commissioner, Ontario. Data Mining: Staking a Claim on Your Privacy, January 1998.
    Google ScholarFindings
  • 21 J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81 106, 1986.
    Google ScholarLocate open access versionFindings
  • 22 A. Shoshani. Statistical databases: Characteristics, problems and some solutions. In VLDB, pages 208 213, Mexico City, Mexico, September 1982.
    Google ScholarLocate open access versionFindings
  • 23 K. Thearling. Data mining and privacy: A con ict in making. DS*, March 1998.
    Google ScholarLocate open access versionFindings
  • 24 Time. The Death of Privacy, August 1997.
    Google ScholarFindings
  • 25 J. Vaidya and C. W. Clifton. Privacy preserving association rule mining in vertically partitioned data. In Proc. of the 8th ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining, Edmonton, Canada, July 2002.
    Google ScholarLocate open access versionFindings
  • 26 S. Warner. Randomized response: A survey technique for eliminating evasive answer bias. J. Am. Stat. Assoc., 60309:63 69, March 1965.
    Google ScholarLocate open access versionFindings
  • 27 A. Westin. E-commerce and privacy: What net users want. Technical report, Louis Harris & Associates, June 1998.
    Google ScholarFindings
  • 28 A. Westin. Privacy concerns & consumer choice. Technical report, Louis Harris & Associates, Dec. 1998.
    Google ScholarFindings
  • 29 A. Westin. Freebies and privacy: What net users think. Technical report, Opinion Research Corporation, July 1999.
    Google ScholarFindings
您的评分 :
0

 

标签
评论
小科