Predicting Uncorrectable Memory Errors for Proactive Replacement: An Empirical Study on Large-Scale Field Data

2020 16th European Dependable Computing Conference (EDCC)(2020)

引用 8|浏览10
暂无评分
摘要
Uncorrectable memory errors are the leading causes of server failures in datacenters. Predicting uncorrectable errors (UEs) using the historical correctable error (CE) information helps for proactive replacement of memory hardware before the catastrophic events happen. In this paper, we perform an empirical study of UE prediction on the large-scale field data from more than 30,000 contemporary servers in Tencent datacenters over an 8-month period. We demonstrate that the traditional approach based on CE rate works poorly with a low precision. We then leverage the detail micro-level CE information to design several new predictors. The comparative study shows that the new predictor based on column fault identification boosts the baseline precision for a factor of more than 300% and at the same time also improve the baseline recall substantially.
更多
查看译文
关键词
memory reliability,uncorrectable error prediction,proactive replacement
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要