DataWig: Missing Value Imputation for Tables

Felix Biessmann,Tammo Rukat,Phillipp Schmidt,Prathik Naidu,Sebastian Schelter,Andrey Taptunov,Dustin Lange,David Salinas

JOURNAL OF MACHINE LEARNING RESEARCH（2019）

引用 100|浏览86

暂无评分

摘要

With the growing importance of machine learning (ML) algorithms for practical applications, reducing data quality problems in ML pipelines has become a major focus of research. In many cases missing values can break data pipelines which makes completeness one of the most impactful data quality challenges. Current missing value imputation methods are focusing on numerical or categorical data and can be difficult to scale to datasets with millions of rows. We release DataWig, a robust and scalable approach for missing value imputation that can be applied to tables with heterogeneous data types, including unstructured text. DataWig combines deep learning feature extractors with automatic hyperparameter tuning. This enables users without a machine learning background, such as data engineers, to impute missing values with minimal effort in tables with more heterogeneous data types than supported in existing libraries, while requiring less glue code for feature engineering and offering more flexible modelling options. We demonstrate that DataWig compares favourably to existing imputation packages. Source code, documentation, and unit tests for this package are available at: github.com/awslabs/datawig

查看译文

关键词

missing value imputation,deep learning,heterogeneous data

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要