Efficient

David Rebollo Monedero,Ahmad Mohamad Mezher, Xavier Casanova Colomé,Jordi Forné,Miguel Soriano

Information Sciences: an International Journal(2019)

引用 0|浏览2
暂无评分
摘要
• The primary goal of this work is to reduce the running time of k-anonymous microaggregation algo-rithms operating on datasets with a large quantity of numerical demographic attributes, acting as quasi-identifiers. Principal component analysis (PCA), an algebraic-statistical procedure that constructs an or-thogonal projection onto a lower-dimensional subspace, permits the effective reduction of the number of attributes of the original dataset. The optimality principles of multivariate PCA strive to preserve Euclidean distances between the projected data points. • The compressed data is fed to the microaggregation algorithm, but the k -anonymous microcells or groups obtained are directly applied to the original data. The distance-preservation properties of multivariate PCA help construct a micropartition of the set of respondents similar to that obtained when the original data is microaggregated in the conventional fashion, but in fewer dimensions. • This means that we are able to achieve significant time gains ( ≈  14–31%) with very little impact on information utility ( < 2%, with respect to the total variance) with respect to the traditional procedure on the original data. • Additional variants of the above method are devised and analyzed with extensive experimentation on standardized datasets, in terms of running time and information loss, pushing the already substantial speed-up even further ( ≈ 48–64%), with mild distortion impact ( < 3%, with respect to the total variance). k -Anonymous microaggregation is a widespread technique to address the problem of protecting the privacy of the respondents involved beyond the mere suppression of their identifiers, in applications where preserving the utility of the information disclosed is critical. Unfortunately, microaggregation methods with high data utility may impose stringent computational demands when dealing with datasets containing a large number of records and attributes. This work proposes and analyzes various anonymization methods which draw upon the algebraic-statistical technique of principal component analysis (PCA), in order to effective reduce the number of attributes processed, that is, the dimension of the multivariate microaggregation problem at hand. By preserving to a high degree the energy of the numerical dataset and carefully choosing the number of dominant components to process, we manage to achieve remarkable reductions in running time and memory usage with negligible impact in information utility. Our methods are readily applicable to high-utility SDC of large-scale datasets with numerical demographic attributes. © 2019 The Authors. Preprint submitted to Elsevier, Inc.
更多
查看译文
关键词
Data privacy,Statistical disclosure control,k-anonymity,Microaggregation,Principal component analysis,Large-scale datasets
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要