A Method of Pruning and Random Replacing of Known Values for Comparing Missing Data Imputation Models for Incomplete Air Quality Time Series

Luis Alfonso Menendez Garcia,Marta Menendez Fernandez,Violetta Sokola-Szewiola,Laura Alvarez de Prado, Almudena Ortiz Marques, David Fernandez Lopez,Antonio Bernardo Sanchez

APPLIED SCIENCES-BASEL（2022）

引用 4|浏览7

暂无评分

摘要

The data obtained from air quality monitoring stations, which are used to carry out studies using data mining techniques, present the problem of missing values. This paper describes a research work on missing data imputation. Among the most common methods, the method that best imputes values to the available data set is analysed. It uses an algorithm that randomly replaces all known values in a dataset once with imputed values and compares them with the actual known values, forming several subsets. Data from seven stations in the Silesian region (Poland) were analyzed for hourly concentrations of four pollutants: nitrogen dioxide (NO2), nitrogen oxides (NOx), particles of 10 mu m or less (PM10) and sulphur dioxide (SO2) for five years. Imputations were performed using linear imputation (LI), predictive mean matching (PMM), random forest (RF), k-nearest neighbours (k-NN) and imputation by Kalman smoothing on structural time series (Kalman) methods and performance evaluations were performed. Once the comparison method was validated, it was determine that, in general, Kalman structural smoothing and the linear imputation methods best fitted the imputed values to the data pattern. It was observed that each imputation method behaves in an analogous way for the different stations The variables with the best results are NO2 and SO2. The UMI method is the worst imputer for missing values in the data sets.

查看译文

关键词

imputation, linear imputation, predictive mean matching, random forest, k-nearest neighbours, Kalman smoothing, air quality, air pollution

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要