An Iterative Cleaning Method for Redundant Data in Multi-Source Heterogeneous Data Based on Decision Tree Algorithm

Jiaxuan Hou,Wenqi Huang,Lingyu Liang,Shang Cao,Huanming Zhang,Xiangyu Zhao,Hanju Li

2023 4th International Conference on Information Science, Parallel and Distributed Systems (ISPDS)（2023）

引用 0|浏览13

暂无评分

摘要

The current conventional iterative cleaning method for multi-source heterogeneous data mainly calculates the data repetition to achieve the redundant data rejection, which leads to poor data cleaning effect due to the lack of parsimonious processing for data. In this regard, an iterative cleaning method based on decision tree algorithm is proposed for redundant data in multi-source heterogeneous data. By calculating the information gain value of redundant data in multi-source heterogeneous data, and using the highest value as the classification criterion, a decision tree is constructed to realize data classification. And the deep belief network model is constructed to extract the data mixture features, and finally the data is iteratively cleaned by data table repetition judgment rule pairs for data parsimony. In the experiments, the proposed method is verified for the cleaning effect. The experimental results show that when the proposed method is used to clean multi-source heterogeneous data, the data checking accuracy is high and has a more desirable data cleaning performance.

查看译文

关键词

Decision tree algorithm,Multi-source heterogeneous data,Redundant data,Data cleaning

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要