Analyzing WLCG File Transfer Errors Through Machine Learning

Computing and Software for Big Science(2022)

引用 0|浏览3
暂无评分
摘要
The increasingly growing scale of modern computing infrastructures solicits more ingenious and automatic solutions to their management. Our work focuses on file transfer failures within the Worldwide Large Hadron Collider Computing Grid and proposes a pipeline to support distributed data management operations by suggesting potential issues to investigate. Specifically, we adopt an unsupervised learning approach leveraging Natural Language Processing and Machine Learning tools to automatically parse error messages and group similar failures. The results are presented in the form of a summary table containing the most common textual patterns and time evolution charts. This approach has two main advantages. First, the joint elaboration of the error string and the transfer’s source/destination enables more informative and compact troubleshooting, as opposed to inspecting each site and checking unique messages separately. As a by-product, this also reduces the number of errors to check by some orders of magnitude (from unique error strings to unique categories or patterns). Second, the time evolution plots allow operators to immediately filter out secondary issues (e.g. transient or in resolution) and focus on the most serious problems first (e.g. escalating failures). As a preliminary assessment, we compare our results with the Global Grid User Support ticketing system, showing that most of our suggestions are indeed real issues (direct association), while being able to cover 89% of reported incidents (inverse relationship).
更多
查看译文
关键词
wlcg file transfer errors,machine learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要