Interactive and Deterministic Data Cleaning
SIGMOD/PODS'16: International Conference on Management of Data San Francisco California USA June, 2016, pp. 893-907, 2016.
EI
Weibo:
Abstract:
We present Falcon, an interactive, deterministic, and declarative data cleaning system, which uses SQL update queries as the language to repair data. Falcon does not rely on the existence of a set of pre-defined data quality rules. On the contrary, it encourages users to explore the data, identify possible problems, and make updates to fi...More
Code:
Data:
Introduction
- High quality data is important to all businesses, and data cleaning is an important but tedious step.
- Removing errors in order to get high quality data takes most of data analysts’ time [31], and some studies predict a shortage of people with the skills and the know-how for these tasks [33].
- In the evolving scenario of data cleaning, these approaches show a serious limitation.
- They assume that data quality rules are declared upfront by domain experts who understand the data and write logical formulas or procedural code.
- These systems have failed short in terms of adoption in industrial tools
Highlights
- High quality data is important to all businesses, and data cleaning is an important but tedious step
- Besides using traditional one-hop sql based traverse algorithms (e.g., Breadth-first search or Depth-first search), we describe novel multi-hop search algorithms such that can
- Which rule-based data repairing consists of using integrity constraints to identify data errors [11, 12, 17, 25, 40], and automated algorithms to enforce these constraints over the data [7, 22, 23, 32, 43]
- In order to e ciently manage all potential updates, and e↵ectively interact with users, we propose Fal, which works as follows
- Despite more sophisticated combinations are possible, we found that the simple sum gives a global overview of the algorithms behaviour that is close to the real overall experience of the users
- While we discover rules using any combination of columns, Refine either generates rules for the entire column, which is unlikely to hold for data errors, or rules that update a single tuple
Methods
- The authors conducted five experiments.
- Piq Exp-1 compares benefits of the various lattice-traversal algorithms with di↵erent budget values, and show that CoDive maximizes the benefit.
- Piiq Exp-2 studies the impact of di↵erent Benefit Benefit Benefit DFS.
- BFS Ducc Dive CoDive Soccer Hospital Synth 10k Synth 1M DBLP BUS (a) Budget=2
- The authors conducted five experiments. piq Exp-1 compares benefits of the various lattice-traversal algorithms with di↵erent budget values, and show that CoDive maximizes the benefit. piiq Exp-2 studies the impact of di↵erent Benefit Benefit Benefit DFS
Conclusion
- Falcon and deterministic data cleaning system.
- The authors have demonstrated that can e↵ectively interact with users to.
- Falcon generalize user-solicited updates, and clean-up data with a significant benefit w.r.t. the number of required interactions.
- A number of possible future studies using are.
- Falcon apparent.
- The authors plan to extend it by using external sources, as remarked in Appendix B.
- The authors will leverage the information obtained from previous interactions with the user multiple data updates
Summary
Introduction:
High quality data is important to all businesses, and data cleaning is an important but tedious step.- Removing errors in order to get high quality data takes most of data analysts’ time [31], and some studies predict a shortage of people with the skills and the know-how for these tasks [33].
- In the evolving scenario of data cleaning, these approaches show a serious limitation.
- They assume that data quality rules are declared upfront by domain experts who understand the data and write logical formulas or procedural code.
- These systems have failed short in terms of adoption in industrial tools
Methods:
The authors conducted five experiments.- Piq Exp-1 compares benefits of the various lattice-traversal algorithms with di↵erent budget values, and show that CoDive maximizes the benefit.
- Piiq Exp-2 studies the impact of di↵erent Benefit Benefit Benefit DFS.
- BFS Ducc Dive CoDive Soccer Hospital Synth 10k Synth 1M DBLP BUS (a) Budget=2
- The authors conducted five experiments. piq Exp-1 compares benefits of the various lattice-traversal algorithms with di↵erent budget values, and show that CoDive maximizes the benefit. piiq Exp-2 studies the impact of di↵erent Benefit Benefit Benefit DFS
Conclusion:
Falcon and deterministic data cleaning system.- The authors have demonstrated that can e↵ectively interact with users to.
- Falcon generalize user-solicited updates, and clean-up data with a significant benefit w.r.t. the number of required interactions.
- A number of possible future studies using are.
- Falcon apparent.
- The authors plan to extend it by using external sources, as remarked in Appendix B.
- The authors will leverage the information obtained from previous interactions with the user multiple data updates
Tables
- Table1: Dataset Tdrug with drug tests
- Table2: A 2-way contingency table
- Table3: Notations used in the paper
- Table4: Features of node DML
- Table5: Correlation of attributes in Soccer dataset when Stadium is updated
- Table6: Comparison of the lattice search algorithms with B “ 3: U is the number of user updates, A is the number of user answers, and |QpT q| is the total number of errors
- Table7: Comparison of the baselines. Here T is the total interaction cost for the user, Rep is the number of repaired
Related work
- Data transformation. Interactive systems for data transformation [27,37,44] also reason about the updated attribute to learn transformation rules. They mainly focus on string manipulation and reformatting at the text level. In contrast, we use more expressive SQL scripts. Consequently, we discover not only rules that contain one attribute that is being updated syntactically, but also rules that combine multiple attributes to semantically determine new repairs. Our language and algorithms can lead to smaller interaction cost, as discussed in Section 6 Exp-3.
Funding
- This work was partly supported by the 973 Program of China (2015CB358700), NSF of China (61422205, 61472198), Huawei, Shenzhou, Tencent, FDCT/116/2013/A3, MYRG105(Y1-L3)-FST13- GZ, National High-Tech R&D (863) Program of China (2012AA012600), and the Chinese Special Project of Science and Technology (2013zx01039-002-002)
Reference
- [2] A. Abouzied, J. M. Hellerstein, and A. Silberschatz. Playful query specification with dataplay. PVLDB, 5(12):1938–1941, 2012.
- [4] B. Alexe, L. Chiticariu, R. J. Miller, and W. C. Tan. Muse: Mapping understanding and design by example. In ICDE, pages 10–19, 2008.
- [6] P. C. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, and D. Santoro. Messing up with BART: error generation for evaluating data-cleaning algorithms. PVLDB, 9(2), 2015.
- [7] P. Bohannon, M. Flaster, W. Fan, and R. Rastogi. A cost-based model and e↵ective heuristic for repairing constraints by value modification. In SIGMOD, 2005.
- [8] A. Bonifati, R. Ciucanu, and S. Staworko. Interactive inference of join queries. In EDBT, 2014.
- [9] A. Bonifati, R. Ciucanu, and S. Staworko. Interactive join query inference with JIM. PVLDB, 7(13), 2014.
- [10] C. Chang and C. Lin. LIBSVM: A library for support vector machines. ACMTIST, 2(3):27, 2011.
- [11] F. Chiang and R. J. Miller. Discovering data quality rules. PVLDB, 1(1), 2008.
- [12] X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. PVLDB, 6(13), 2013.
- [13] M. Dallachiesa, A. Ebaid, A. Eldawy, A. K. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. NADEEF: a commodity data cleaning system. In SIGMOD, 2013.
- [14] O. Deshpande, D. S. Lamba, M. Tourn, S. Das, S. Subramaniam, A. Rajaraman, V. Harinarayan, and A. Doan. Building, maintaining, and using knowledge bases: A report from the trenches. In SIGMOD, 2013.
- [15] A. Ebaid, A. K. Elmagarmid, I. F. Ilyas, M. Ouzzani, J. Quiane-Ruiz, N. Tang, and S. Yin. NADEEF: A generalized data cleaning system. PVLDB, 6(12):1218–1221, 2013.
- [16] W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst., 33(2), 2008.
- [17] W. Fan, F. Geerts, J. Li, and M. Xiong. Discovering conditional functional dependencies. IEEE Trans. Knowl. Data Eng., 23(5), 2011.
- [18] W. Fan, F. Geerts, N. Tang, and W. Yu. Inferring data currency and consistency for conflict resolution. In ICDE, 2013.
- [19] W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Interaction between record matching and data repairing. In SIGMOD, 2011.
- [20] W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. VLDB J., 21(2), 2012.
- [21] H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: Language, model, and algorithms. In VLDB, 2001.
- [22] F. Geerts, G. Mecca, P. Papotti, and D. Santoro. The LLUNATIC data-cleaning framework. PVLDB, 6(9), 2013.
- [23] F. Geerts, G. Mecca, P. Papotti, and D. Santoro. Mapping and Cleaning. In ICDE, pages 232–243, 2014.
- [24] F. Geerts, G. Mecca, P. Papotti, and D. Santoro. That’s all folks! LLUNATIC goes open source. PVLDB, 7(13):1565–1568, 2014.
- [25] L. Golab, H. J. Karlo↵, F. Korn, B. Saha, and D. Srivastava. Discovering conservation rules. In ICDE, 2012.
- [27] J. Heer, J. M. Hellerstein, and S. Kandel. Predictive interaction for data transformation. In CIDR, 2015.
- [28] A. Heise, J. Quiane-Ruiz, Z. Abedjan, A. Jentzsch, and F. Naumann. Scalable discovery of unique column combinations. PVLDB, 7(4), 2013.
- [29] I. F. Ilyas, V. Markl, P. J. Haas, P. Brown, and A. Aboulnaga. CORDS: automatic discovery of correlations and soft functional dependencies. In SIGMOD, pages 647–658, 2004.
- [30] M. Interlandi and N. Tang. Proof positive and negative in data cleaning. In ICDE, 2015.
- [31] S. Kandel, A. Paepcke, J. M. Hellerstein, and J. Heer. Enterprise data analysis and visualization: An interview study. IEEE Trans. Vis. Comput. Graph., 18(12), 2012.
- [32] Z. Khayyat, I. F. Ilyas, A. Jindal, S. Madden, M. Ouzzani, J.-A. Quiane-Ruiz, P. Papotti, N. Tang, and S. Yin. BigDansing: a system for big data cleansing. In SIGMOD, 2015.
- [35] L. Qian, M. J. Cafarella, and H. V. Jagadish. Sample-driven schema mapping. In SIGMOD, pages 73–84, 2012.
- [36] G. Ramalingam and T. W. Reps. A categorized bibliography on incremental computation. In POPL, 1993.
- [37] V. Raman and J. M. Hellerstein. Potter’s wheel: An interactive data cleaning system. In VLDB, pages 381–390, 2001.
- [38] Y. Shen, K. Chakrabarti, S. Chaudhuri, B. Ding, and L. Novik. Discovering queries based on example tuples. In SIGMOD, pages 493–504, 2014.
- [39] D. D. Sleator and R. E. Tarjan. Amortized e ciency of list update and paging rules. Commun. ACM, 28(2), 1985.
- [40] S. Song and L. Chen. E cient discovery of similarity constraints for matching dependencies. Data Knowl. Eng., 87, 2013.
- [41] M. Volkovs, F. Chiang, J. Szlichta, and R. J. Miller. Continuous data cleaning. In ICDE, 2014.
- [42] J. Wang, J. Han, and J. Pei. CLOSET+: searching for the best strategies for mining frequent closed itemsets. In SIGKDD, 2003.
- [43] J. Wang and N. Tang. Towards dependable data repairing with fixing rules. In SIGMOD, 2014.
- [44] B. Wu and C. A. Knoblock. An iterative approach to synthesize data transformation programs. In IJCAI, pages 1726–1732, 2015.
- [45] M. Yakout, L. Berti-Equille, and A. K. Elmagarmid. Don’t be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In SIGMOD, pages 553–564, 2013.
- [46] M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 2011.
- [47] Z. Yan, N. Zheng, Z. G. Ives, P. P. Talukdar, and C. Yu. Actively soliciting feedback for query answers in keyword search-based data integration. PVLDB, 6(3):205–216, 2013.
- [48] M. J. Zaki and W. Meira. Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press, 2014.
- [49] M. Zhang, H. Elmeleegy, C. M. Procopiuc, and D. Srivastava. Reverse engineering complex join queries. In SIGMOD, 2013.
Full Text
Tags
Comments