Improving the performance of Hadoop Hive by sharing scan and computation tasks

Tansel Dokeroglu, Serkan Ozal,Murat Ali Bayir,Muhammet Serkan Cinar,Ahmet Cosar

J. Cloud Computing（2014）

引用 23|浏览11

暂无评分

摘要

MapReduce is a popular programming model for executing time-consuming analytical queries as a batch of tasks on large scale data clusters. In environments where multiple queries with similar selection predicates, common tables, and join tasks arrive simultaneously, many opportunities can arise for sharing scan and/or join computation tasks. Executing common tasks only once can remarkably reduce the total execution time of a batch of queries. In this study, we propose a Multiple Query Optimization framework, SharedHive, to improve the overall performance of Hadoop Hive, an open source SQL-based data warehouse using MapReduce. SharedHive transforms a set of correlated HiveQL queries into a new set of insert queries that will produce all of the required outputs within a shorter execution time. It is experimentally shown that SharedHive achieves significant reductions in total execution times of TPC-H queries.

查看译文

关键词

Hadoop,Hive,Data warehouse,Multiple-query optimization

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要