HealthLog Monitor: Errors, Symptoms and Reactions Consolidated

IEEE Transactions on Device and Materials Reliability(2019)

引用 5|浏览25
暂无评分
摘要
Advances in reliability research have presented novel techniques for early identification of upcoming failures (both long-term and short-term) as well as sophisticated isolation and mitigation of error occurrences. The existing primitive error monitoring tools usually deliver very basic information of errors and do not allow integration of all system health-related information to be combined under the same context. The absence of common tools becomes more noticeable when new and more aggressive and dynamic schemes are concerned, such as processor undervolting or overclocking. In this paper, we present HealthLog monitor, a flexible Linux system monitoring service that offers a generic abstraction layer to combine health-related information as well as action features. HealthLog is capable of monitoring hardware metrics (performance, sensor, and errors) as well as external health-related data, allowing combined symptom description and reaction features supported by an API. In addition, HealthLog can be programmed to react and initiate mitigation, isolation, and recovery actions. The scope of the monitor is to offer a universal and platform-independent standard for error reporting and system monitoring mechanisms in all system layers. HealthLog is flexibly designed to be a cross-platform service that comes with built-in support of X-Gene processor family and S.M.A.R.T. HDD monitoring, but can be easily extended to support new processors and interfaces. In this paper, we showcase how HealthLog can be used in two real-life scenarios that consider system aging and system undervolting.
更多
查看译文
关键词
Monitoring,Hardware,Tools,Computer architecture,Materials reliability,Linux
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要