Cleaning Uncertain Data With Crowdsourcing - A General Model With Diverse Accuracy Rates

Chen Zhang,Haodi Zhang,Weiteng Xie,Nan Liu, Qifan Li,Kaishun Wu,Di Jiang,Peiguang Lin,Lei Chen

IEEE Transactions on Knowledge and Data Engineering（2022）

引用 2|浏览59

暂无评分

摘要

Since inaccuracies commonly exist in many applications, data uncertainty has become an important problem in database systems. To deal with data uncertainty, probabilistic databases can be used to store uncertain data, and querying facilities are provided to yield answers with confidence. However, the results from a query or mining process may not be reliable when the uncertainty propagates in the systems. In this paper, we leverage the power of crowdsourcing by designing a set of Human Intelligence Tasks, or HITs in short, to ask a crowd to improve the quality of uncertain data. In particular, we consider crowds consists of workers with diverse accuracy rates when answering the HITs. We design solutions to maximize the data quality with minimal number of HITs. There are two obstacles for this non-trivial optimization, which lead to very high computational cost for selecting the optimal set of HITs. First, members of a crowd may return incorrect answers with different probabilities. Second, the HITs decomposed from uncertain data are often correlated. We have addressed these challenges in this paper by designing an effective approximation algorithm and an efficient heuristic solution, especially for crowds with diverse individual accuracy rates. To further improve the efficiency, we derive tight lower and upper bounds for effective filtering and estimation. Extensive experiments on both a simulated crowd and a real crowdsourcing platform are conducted to evaluate our solutions.

查看译文

关键词

Crowdsourcing,cleaning uncertain data,approximation algorithm,heuristic algorithm

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要