Building Pipelines for Heterogeneous Execution Environments for Big Data Processing.

Dongyao Wu,Liming Zhu,Xiwei Xu,Sherif Sakr,Daniel Sun,Qinghua Lu

IEEE Software（2016）

引用 49|浏览40

暂无评分

摘要

Many real-world data analysis scenarios require pipelining and integration of multiple (big) data-processing and data-analytics jobs, which often execute in heterogeneous environments, such as MapReduce; Spark; or R, Python, or Bash scripts. Such a pipeline requires much glue code to get data across environments. Maintaining and evolving these pipelines are difficult. Pipeline frameworks that try to solve such problems are usually built on top of a single environment. They might require rewriting the original job to take into account a new API or paradigm. The Pipeline61 framework supports the building of data pipelines involving heterogeneous execution environments. It reuses the existing code of the deployed jobs in different environments and provides version control and dependency management that deals with typical software engineering issues. A real-world case study shows its effectiveness. This article is part of a special issue on Software Engineering for Big Data Systems.

查看译文

关键词

Big data,Pipeline processing,Software engineering,Context modeling,Data analysis,Programming,Software development

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要