Machine Learning Based Parallel I/O Predictive Modeling: A Case Study on Lustre File Systems

Sandeep Madireddy,Prasanna Balaprakash,Philip Carns,Robert Latham,Robert Ross,Shane Snyder,Stefan M. Wild

HIGH PERFORMANCE COMPUTING, ISC HIGH PERFORMANCE 2018（2018）

引用 21|浏览76

暂无评分

摘要

Parallel I/O hardware and software infrastructure is a key contributor to performance variability for applications running on large-scale HPC systems. This variability confounds efforts to predict application performance for characterization, modeling, optimization, and job scheduling. We propose a modeling approach that improves predictive ability by explicitly treating the variability and by leveraging the sensitivity of application parameters on performance to group applications with similar characteristics. We develop a Gaussian process-based machine learning algorithm to model I/O performance and its variability as a function of application and file system characteristics. We demonstrate the effectiveness of the proposed approach using data collected from the Edison system at the National Energy Research Scientific Computing Center. The results show that the proposed sensitivity-based models are better at prediction when compared with application-partitioned or unpartitioned models. We highlight modeling techniques that are robust to the outliers that can occur in production parallel file systems. Using the developed metrics and modeling approach, we provide insights into the file system metrics that have a significant impact on I/O performance.

查看译文

关键词

I/O performance variability,Parallel file systems,Machine learning,Robust Gaussian process regression

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要