ImRP: A Predictive Partition Method for Data Skew Alleviation in Spark Streaming Environment

Zhongming Fu,Zhuo Tang,Li Yang,Kenli Li,Keqin Li

parallel computing（2020）

引用 6|浏览17

暂无评分

摘要

Abstract Spark Streaming is an extension of the core Spark engine that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. It treats stream as a series of deterministic batches and handles them as regular jobs. However, for a stream job responsible for a batch, data skew (i.e., the imbalance in the amount of data allocated to each reduce task), can degrade the job performance significantly because of load imbalance. In this paper, we propose an improved range partitioner (ImRP) to alleviate the reduce skew for stream jobs in Spark Streaming. Unlike previous work, ImRP does not require any pre-run sampling of input data and generates the data partition scheme based on the intermediate data distribution estimated by the previous batch processing, in which a prediction model EWMA (Exponentially Weighted Moving Average) is adopted. To lighten the data skew, ImRP presents a novel method of calculating the partition borders optimally, and a mechanism of splitting the border key clusters when the semantics of shuffle operators permit. Besides, ImRP considers the integrated partition size and heterogeneity of computing environments when balancing the load among reduce tasks appropriately. We implement ImRP in Spark-3.0 and evaluate its performance on four representative benchmarks: wordCount, sort, pageRank, and LDA. The results show that by mitigating the data skew, ImRP can decrease the execution time of stream jobs substantially compared with some other partition strategies, especially when the skew degree of input batch is serious.

查看译文

关键词

MapReduce, Data skew, Load imbalance, Spark Streaming

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要