Spongefiles: Mitigating Data Skew In Mapreduce Using Distributed Memory

SIGMOD/PODS'14: International Conference on Management of Data Snowbird Utah USA June, 2014(2014)

引用 41|浏览106
暂无评分
摘要
Data skew is a major problem for data processing platforms like MapReduce. Skew causes worker tasks to spill to disk what they cannot fit in memory, which slows down the task and the overall job. Moreover, performance of other jobs sharing same disk degrades. In many cases, this situation occurs even as the cluster has plenty of spare memory-it is just not used evenly.We introduce SpongeFiles, a novel distributed-memory abstraction tailored to data processing environments like MapReduce. A SpongeFile is a logical byte array, comprised of large chunks that can be stored in a variety of locations in the cluster. Spilled data goes to SpongeFiles, which route it to the nearest location with sufficient capacity (local memory, remote memory, local disk, or remote disk as a last resort). By enabling memory-sapped nodes to tap into the spare capacity of their neighbors, SpongeFiles minimize expensive disk spilling, thereby improving performance. In our experiments with Hadoop(1) and Pig(2), SpongeFiles reduce overall job runtimes by up to 55% and by up to 85% under disk contention.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要