Cleaning Data With Constraints And Experts

PROCEEDINGS OF THE 21ST WORKSHOP ON THE WEB AND DATABASES (WEBDB 2018)(2018)

引用 16|浏览29
暂无评分
摘要
Popular techniques for data cleaning use integrity constraints to identify errors in the data and to automatically resolve them, e. g. by using predefined priorities among possible updates and finding a minimal repair that will resolve violations. Such automatic solutions however cannot ensure precision of the repairs since they do not have enough evidence about the actual errors and may in fact lead to wrong results with respect to the ground truth. It has thus been suggested to use domain experts to examine the potential updates and choose which should be applied to the database. However, the sheer volume of the databases and the large number of possible updates that may resolve a given constraint violation, may make such a manual examination prohibitory expensive. The goal of the DANCE system presented here is to help to optimize the experts work and reduce as much as possible the number of questions (updates verification) they need to address. Given a constraint violation, our algorithm identifies the suspicious tuples whose update may contribute (directly or indirectly) to the constraint resolution, as well as the possible dependencies among them. Using this information it builds a graph whose nodes are the suspicious tuples and whose weighted edges capture the likelihood of an error in one tuple to occur and affect the other. PageRankstyle algorithm then allows us to identify the most beneficial tuples to ask about first. Incremental graph maintenance is used to assure interactive response time. We implemented our solution in the DANCE system and show its effectiveness and efficiency through a comprehensive suite of experiments.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要