Data Diff: Interpretable, Executable Summaries of Changes in Distributions for Data Wrangling.

Charles A. Sutton,Timothy Hobson,James Geddes,Rich Caruana

KDD（2018）

引用 49|浏览56

暂无评分

摘要

Many analyses in data science are not one-off projects, but are repeated over multiple data samples, such as once per month, once per quarter, and so on. For example, if a data scientist performs an analysis in 2017 that saves a significant amount of money, then she will likely to be asked to perform the same analysis on data from 2018. But more data analyses means more effort spent in data wrangling. We introduce the data diff problem, which attempts to turn this problem into an opportunity. Comparing the repeated data samples against each other, inconsistencies may be indicative of underlying issues in data quality. By analogy to text \textttdiff, the data diff problem is to find a "patch", that is, transformation in a specified domain-specific language, that transforms the data samples so that they are identically distributed. We present a prototype tool for data diff that formalizes the problem as a bipartite matching problem, calibrating its parameters using a bootstrap procedure. The tool is evaluated quantitatively and through a case study on an open government data set.

查看译文

关键词

automl,data mining,data wrangling,sequential analysis

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要