Differential Data Quality Verification on Partitioned Data

2019 IEEE 35th International Conference on Data Engineering (ICDE)(2019)

引用 17|浏览32
暂无评分
摘要
Modern companies and institutions rely on data to guide every single decision. Missing or incorrect information seriously compromises any decision process. In previous work, we presented Deequ, a Spark-based library for automating the verification of data quality at scale. Deequ provides a declarative API, which combines common quality constraints with user-defined validation code, and thereby enables "unit tests for data". However, we found that the previous computational model of Deequ is not flexible enough for many scenarios in modern data pipelines, which handle large, partitioned datasets. Such scenarios require the evaluation of dataset-level quality constraints after individual partition updates, without having to re-read already processed partitions. Additionally, such scenarios often require the verification of data quality on select combinations of partitions. We therefore present a differential generalization of the computational model of Deequ, based on algebraic states with monoid properties. We detail how to efficiently implement the corresponding operators and aggregation functions in Apache Spark. Furthermore, we show how to optimize the resulting workloads to minimize the required number of passes over the data, and empirically validate that our approach decreases the runtimes for updating data metrics under data changes and for different combinations of partitions.
更多
查看译文
关键词
Computational modeling,Data integrity,Data models,Frequency estimation,Analytical models,Libraries
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要