Onion: identifying incident-indicating logs for cloud systems

Foundations of Software Engineering(2021)

引用 8|浏览43
暂无评分
摘要
ABSTRACTIn cloud systems, incidents affect the availability of services and require quick mitigation actions. Once an incident occurs, operators and developers often examine logs to perform fault diagnosis. However, the large volume of diverse logs and the overwhelming details in log data make the manual diagnosis process time-consuming and error-prone. In this paper, we propose Onion, an automatic solution for precisely and efficiently locating incident-indicating logs, which can provide useful clues for diagnosing the incidents. We first point out three criteria for localizing incident-indicating logs, i.e., Consistency, Impact, and Bilateral-Difference. Then we propose a novel agglomeration of logs, called log clique, based on which these criteria are satisfied. To obtain log cliques, we develop an incident-aware log representation and a progressive log clustering technique. Contrast analysis is then performed on the cliques to identify the incident-indicating logs. We have evaluated Onion using well-labeled log datasets. Onion achieves an average F1-score of 0.95 and can process millions of logs in only a few minutes, demonstrating its effectiveness and efficiency. Onion has also been successfully applied to the cloud system of Microsoft. Its practicability has been confirmed through the quantitative and qualitative analysis of the real incident cases.
更多
查看译文
关键词
Log analysis, Incident-indicating logs identification, Cloud computing, Fault diagnosis
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要