I/O load balancing for big data HPC applications

Arnab K. Paul, Arpit Goyal,Feiyi Wang,Sarp Oral,Ali R. Butt,Michael J. Brim, Sangeetha B. Srinivasa

2017 IEEE International Conference on Big Data (Big Data)（2017）

引用 20|浏览7

暂无评分

摘要

High Performance Computing (HPC) big data problems require efficient distributed storage systems. However, at scale, such storage systems often experience load imbalance and resource contention due to two factors: the bursty nature of scientific application I/O; and the complex I/O path that is without centralized arbitration and control. For example, the extant Lustre parallel file system-that supports many HPC centers-comprises numerous components connected via custom network topologies, and serves varying demands of a large number of users and applications. Consequently, some storage servers can be more loaded than others, which creates bottlenecks and reduces overall application I/O performance. Existing solutions typically focus on per application load balancing, and thus are not as effective given their lack of a global view of the system. In this paper, we propose a data-driven approach to load balance the I/O servers at scale, targeted at Lustre deployments. To this end, we design a global mapper on Lustre Metadata Server, which gathers runtime statistics from key storage components on the I/O path, and applies Markov chain modeling and a minimum-cost maximum-flow algorithm to decide where data should be placed. Evaluation using a realistic system simulator and a real setup shows that our approach yields better load balancing, which in turn can improve end-to-end performance.

查看译文

关键词

big data HPC applications,High Performance Computing big data problems,load imbalance,resource contention,scientific application,I/O path,centralized arbitration,HPC centers,numerous components,custom network topologies,storage servers,application load balancing,data-driven approach,I/O servers,Lustre deployments,Lustre Metadata Server,key storage components,realistic system simulator,end-to-end performance,distributed storage systems,extant Lustre parallel file system,Markov chain modeling

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要