Query Optimization Using Column Statistics In Hive

Anja Gruenheid,Edward Omiecinski,Leo Mark

IDEAS（2011）

引用 15|浏览49

暂无评分

摘要

Hive is a data warehousing solution on top of the Hadoop MapReduce framework that has been designed to handle large amounts of data and store them in tables like a relational database management system or a conventional data warehouse while using the parallelization and batch processing functionalities of the Hadoop MapReduce framework to speed up the execution of queries. Data inserted into Hive is stored in the Hadoop FileSystem (HDFS), which is part of the Hadoop MapReduce framework. To make the data accessible to the user, Hive uses a query language similar to SQL, which is called HiveQL. When a query is issued in HiveQL, it is translated by a parser into a query execution plan that is optimized and then turned into a series of map and reduce iterations. These iterations are then executed on the data stored in the HDFS, writing the output to a file.The goal of this work is to to develop an approach for improving the performance of the HiveQL queries executed in the Hive framework. For that purpose, we introduce an extension to the Hive MetaStore which stores metadata that has been extracted on the column level of the user database. These column level statistics are then used for example in combination with join ordering algorithms which are adapted to the specific needs of the Hadoop MapReduce environment to improve the overall performance of the HiveQL query execution.

查看译文

关键词

Query Optimization,Column Statistics,Hive,Hadoop MapReduce

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要