Predicting DRAM-Caused Node Unavailability in Hyper-Scale Clouds

Pengcheng Zhang, Yunong Wang,Xuhua Ma, Yaoheng Xu,Bin Yao, Xudong Zheng,Linquan Jiang

2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)(2022)

引用 4|浏览40
暂无评分
摘要
DRAM faults are major hardware sources of cloud node unavailability. To enable early preventive actions and mitigate DRAM fault impacts, prior studies focus on predicting DRAM uncorrectable errors (UEs) that typically cause immediate node unavailability. In our cloud with over half a million nodes, we firstly observe that the correctable error storm (numerous CEs occur in a short period) dominates 56% DRAM-caused node unavailability (DCNU). Therefore, we propose to predict DCNU that takes account into both UEs and CE storms. Observing that DCNUs have strong relevance to temporal statistics and spatial patterns of CEs, we design novel spatio-temporal features to train the prediction model. Considering the model’s real effects cannot be evaluated by traditional metrics like F1-score, we propose a new metric NURR to quantify the node unavailability reduction and tune model hyperparameters with NURR. Our approach achieves over 40% better NURR than existing methods on historical data and runs stably in the production environment.
更多
查看译文
关键词
hyper-scale clouds,DRAM faults,cloud node unavailability,mitigate DRAM fault impacts,DRAM uncorrectable errors,immediate node unavailability,half a million nodes,correctable error storm,56% DRAM-caused node unavailability,prediction model,node unavailability reduction
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要