K-Means Parallel Acceleration for Sparse Data Dimensions on Flink.

Zihao Zeng,Kenli Li,Mingxing Duan,Chubo Liu,Xiangke Liao

HPCC/SmartCity/DSS（2019）

引用 0|浏览16

暂无评分

摘要

The K-means algorithm is a clustering algorithm which widely used in various applications, and it's running time is dramatically increased as the data size expanded. When the volume of data exceeds the range that can be carried by a single machine, the parallel operation of the algorithm must be implemented by using a distributed computing framework. Generally, during the parallel operation of the task, there are differences among the running time of each task due to the data skew, and the running progress of the entire job is determined by the task with the longest running time. In this paper, we propose an optimal data partitioning method for the application of the k-means algorithm on the sparsely dimensioned dataset to eliminate the data skew problem and further accelerate the parallel execution of the algorithm. Experimental evaluation on large-scale text datasets demonstrate the effectiveness of our partitioning approach on Flink.

查看译文

关键词

K-means,Flink,Sparse vector,Data skew

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要