A batch process for high dimensional imputation

COMPUTATIONAL STATISTICS(2024)

引用 1|浏览2
暂无评分
摘要
This paper describes a correlation-based batch process for addressing high dimensional imputation problems. There are relatively few algorithms designed to efficiently handle imputation of missing data in high dimensional contexts. Fewer still are flexible enough to natively handle mixed-type data, often requiring lengthy pre-processing to get the data into proper shape, and then post-processing to return the data to usable form. Such decisions as well as assumptions made by many methods (e.g., data generating process) limit their performance, flexibility, and usability. Built on a set of complementary algorithms for nonparametric imputation via chained random forests, I introduce a batching process to ease computational costs associated with high dimensional imputation by subsetting data based on ranked cross-feature absolute correlations. The algorithm then imputes each batch separately, and joins imputed subsets in the final step. The process, hdImpute, is fast and accurate. As a result, high dimensional imputation is more accessible, and researchers are not forced to decide between speed or accuracy. Complementary software is available in the form of an R package, and is openly developed on Github under the MIT public license. In the spirit of open science, collaboration and engagement with the actively developing software are encouraged.
更多
查看译文
关键词
Imputation,High dimensional data,Chained random forests,Missing data
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要