Building a novel physical design of a distributed big data warehouse over a Hadoop cluster to enhance OLAP cube query performance

Parallel Computing(2022)

引用 6|浏览3
暂无评分
摘要
Improving OLAP (Online Analytical Processing) query performance in a distributed system on top of Hadoop is a challenging task. An OLAP Cube query comprises several relational operations, such as selection, join, and group-by aggregation. It is well-known that star join and group-by aggregation are the most costly operations in a Hadoop database system. These operations indeed increase network traffic and may overflow memory; to overcome these difficulties, numerous partitioning and data load balancing techniques have been proposed in the literature. However, some issues remain questionable, such as decreasing the Spark stages and the network I/O for an OLAP query being executed on a distributed system. In a precedent work, we proposed a novel data placement strategy for a big data warehouse over a Hadoop cluster. This data warehouse schema enhances the projection, selection, and star-join operations of an OLAP query, such that the system’s query-optimizer can perform a star join process locally, in only one spark stage without a shuffle phase. Also, the system can skip loading unnecessary data blocks when executing the predicates. In this paper, we extend our previous work with further technical details and experiments, and we propose a new dynamic approach to improve the group-by aggregation. To evaluate our approach, we conduct some experiments on a cluster with 15 nodes. Experimental results show that our method outperforms existing approaches in terms of OLAP query evaluation time.
更多
查看译文
关键词
Big data warehouse,Spark stage,Star join,OLAP query,Balancing reducer loads,Group-by
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要