Unsupervised blocking and probabilistic parallelisation for record matching of distributed big data

Chenxiao Dou, Yi Cui,Daniel Sun,Raymond Wong,Muhammad Atif,Guoqiang Li,Rajiv Ranjan

The Journal of Supercomputing（2017）

引用 6|浏览62

暂无评分

摘要

Record Matching refers to identifying pairs of records that relate to the same entities across different data sources. In many applications of data mining, record matching is usually associated to quadratic complexity. In practice, the number of non-matching record pairs always far exceeds the number of matching pairs, and this is called imbalance problem . Blocking is a technique of data reduction, which can filter unlikely matching pairs before record matching. However, for big data there is no fast and effective blocking algorithm yet. In this paper, we report on big data infrastructure to improve efficiency of blocking. Our approach runs blocking process independently and distributedly on the partitions of whole data. To improve efficiency, we adopt a probabilistic technique to balance the speed and the effect of the algorithm that we proposed for distributed blocking. Our experimental analysis endorses the superiority of our technique and shows its novel scalability.

查看译文

关键词

Big data, Record matching, Blocking, Density, Parallelisation

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要