Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms

Patricia C. Arocena,Boris Glavic,Giansalvatore Mecca,Renée J. Miller,Paolo Papotti,Donatello Santoro

Proceedings of the VLDB Endowment（2016）

引用 91|浏览120

暂无评分

摘要

We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To provide a scalable solution, we develop a correct and efficient greedy algorithm that sacrifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要