Improving Shuffle I/O performance for big data processing using hybrid storage

2017 International Conference on Computing, Networking and Communications (ICNC)（2017）

引用 5|浏览15

暂无评分

摘要

Nowadays Big Data analytics have been widely used in many domains, e.g., weather forecast, social network analysis, scientific computing, and bioinformatics. As indispensable part of Big Data analytics, MapReduce has become the de facto standard model of the distributed computing framework. With the growing complexity of software and hardware components, Big Data analytics systems face the challenge of performance bottleneck when handling the increasing size of computing workloads. In our study, we reveal that the existing Shuffle mechanism in the current Spark implementation is still the performance bottleneck due to the Shuffle I/O latency. We demonstrate that the Shuffle stage causes performance degradation among MapReduce jobs. By observing that the high-end Solid State Disks (SSDs) are capable of handling random writes well due to efficient flash translation layer algorithms and larger on-board I/O cache, we present a hybrid storage system-based solution that uses hard drive disks (HDDs) for large datasets storage and SSDs for improving Shuffle I/O performance to mitigate this performance degradation issue. Our extensive experiments using both real-world and synthetic workloads show that our hybrid storage system-based approach achieves performance improvement in the Shuffle stage compared with the original HDD-based Spark implementation.

查看译文

关键词

Shuffle I/O performance improvement,Big Data processing,MapReduce,Spark implementation,solid state disks,random write handling,flash translation layer algorithms,on-board I/O cache,hybrid storage system-based solution,hard drive disks,HDD,SSDs,performance degradation issue mitigation,real-world workloads,synthetic workloads

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要