Failure Prediction of Jobs in Compute Clouds: A Google Cluster Case Study

ISSRE Workshops(2014)

引用 58|浏览14
暂无评分
摘要
Most cloud computing clusters are built from unreliable, commercial off-the-shelf components. The high failure rates in their hardware and software components result in frequent node and application failures. Therefore, it is important to predict application failures before they occur to avoid resource wastage. In this paper, we investigate how to identify application failures based on resource usage measurements from the Google cluster traces. We apply recurrent neural networks to the resource usage measures, and generate features to categorize the input resource usage time series into different classes. Our results show that the model is able to predict failures of batch applications, which are the dominant jobs in the Google cluster. Moreover, we explore early classification to identify failures, and find that the prediction algorithm provides the cloud system enough time to take proactive actions much earlier than the termination of applications, with an average 6% to 10% of resource savings.
更多
查看译文
关键词
compute clouds,resource wastage,cloud reliability, application failure, failure prediction,software components,failure rates,google cluster case study,job failure prediction,failure prediction,fault tolerant computing,resource savings,application termination,cloud reliability,cloud computing clusters,application failures,hardware components,commercial off-the-shelf components,prediction algorithm,cloud computing,cloud system,application failure
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要