Predicting failures in distributed cloud-based systems

SREIS '13 Proceedings of the 14th Annual Information Security Symposium(2013)

引用 23|浏览18
暂无评分
摘要
Distributed cloud based systems consist of a set of geographically distributed routers organized in an overlay network, which promise to deliver high quality networking services to their customers (e.g., packet delivery within 200ms to/from any clients). To accomplish this requirement, their overlay network needs to be functional 24-7. Even a minor failure, such as a routing path that goes down for a couple of seconds, could negatively impact the performance of the system. However, to date there are few methods to predict or adaptively prevent failures in these distributed system. In this poster, we conducted an analysis of 2Tb of distributed system log files to identify example \"failures\" (i.e., signatures) that can be used to develop automated prediction methods via machine learning. Although the majority of the log information consists of normal behavior, we were able to characterize an important \"outage\" type of event where a significant number of customers jointly drop or change configurations in the network. Based on this pattern definition, we were able to identify several new examples of outage problems in the data. Considering these new set of training examples, we are now working on automated methods to discriminate among different types of failures and predict possible outages ahead of time, before they lead to large-scale failures.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要