Bridging the Gap between HPC and Big Data frameworks.

Michael J. Anderson,Shaden Smith,Narayanan Sundaram,Mihai Capota,Zheguang Zhao,Subramanya Dulloor,Nadathur Satish,Theodore L. Willke

PVLDB（2017）

引用 76|浏览184

暂无评分

摘要

Apache Spark is a popular framework for data analytics with attractive features such as fault tolerance and interoperability with the Hadoop ecosystem. Unfortunately, many analytics operations in Spark are an order of magnitude or more slower compared to native implementations written with high performance computing tools such as MPI. There is a need to bridge the performance gap while retaining the benefits of the Spark ecosystem such as availability, productivity, and fault tolerance. In this paper, we propose a system for integrating MPI with Spark and analyze the costs and benefits of doing so for four distributed graph and machine learning applications. We show that offloading computation to an MPI environment from within Spark provides 3.1−17.7× speedups on the four sparse applications, including all of the overheads. This opens up an avenue to reuse existing MPI libraries in Spark with little effort.

查看译文

关键词

hpc,data

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要