Fault Detection and Localization in Distributed Systems Using Recurrent Convolutional Neural Networks.
ADVANCED DATA MINING AND APPLICATIONS, ADMA 2017(2017)
摘要
Early detection of faults is essential to maintaining the reliability of a distributed system. While there are many solutions for detecting faults, handling high dimensionality and uncertainty of system observations to make an accurate detection still remains a challenge. In this paper, we address this challenge with a two-dimensional convolutional neural network in the form of a denoising autoencoder with recurrent neural networks that performs simultaneous fault detection and diagnosis based on real-time system metrics from a given distributed system (e.g. CPU usage, memory consumption, etc.). The model provides a unified way to automatically learn useful features and make adaptive inferences regarding the onset of faults without hand-crafted feature extraction and human diagnostic expertise. In addition, we develop a Bayesian change point detection approach for fault localization, in order to support the fault recovery process. We conducted extensive experiments in a real distributed environment over Amazon EC2 and the results demonstrate our proposal outperforms a variety of state-of-the-art machine learning algorithms that are used for fault detection and diagnosis in distributed systems.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要