A Pipeline Framework for Heterogeneous Execution Environment of Big Data Processing

IEEE Software(2016)

引用 22|浏览36
暂无评分
摘要
Many real-world data analysis scenarios require pipelining and integration of multiple (big) data wrangling and analytics jobs, which are often executed in heterogeneous environments, such as MapReduce, Spark, R/Python/Bash scripts. For such a pipeline, a large amount of glue code has to be written to get data across environments. Maintaining and evolving such pipelines is difficult. Existing pipeline frameworks trying to solve such problems are usually built on top of a single environment, and/or require the original job to be re-written against a new APIs or paradigm. In this article, we propose Pipeline61, a framework that supports the building of data pipelines involving heterogeneous execution environments. Pipeline61 reuses the existing code of the deployed jobs in different environments and also provides version control and dependency management that deals with typical software engineering issues. A real-world case study is used to show the effectiveness of Pipeline61 over the state-of-the-art.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要